Loading...
We use CDC api to collect Covid-19 Cases and Death dataset. We need a APP TOKEN to request their api.
Requested data are in the form of json array. I transform them into CSV data with some adjustments (to make data more easy to look at). Their api has a paging limitation so I can't just get my data in a simple request. I made a loop to request 5000 records per request, and save them into CSV files.
Column Name | Description |
---|---|
submission_date | Date of counts |
state | Jurisdiction |
tot_cases | Total number of cases |
conf_cases | Total confirmed cases |
prob_cases | Total probable cases |
new_case | Number of new cases |
pnew_case | Number of new probable cases |
tot_death | Total number of deaths |
conf_death | Total number of confirmed deaths |
prob_death | Total number of probable deaths |
new_death | Number of new deaths |
pnew_death | Number of new probable deaths |
created_at | Date and time record was created |
consent_cases | If Agree, then confirmed and probable cases are included. If Not Agree, then only total cases are included. |
consent_deaths | If Agree, then confirmed and probable deaths are included. If Not Agree, then only total deaths are included. |
Please click here to view some data sample of this dataset.
Different data types such as float and string may appear in the same column.
There are a lot of missing values.
For the CDC data set, we mainly made the following assessments:
The percentage of missing values for each attribute:
The fraction of noise values:
We use python script to download data from OpenSky for this dataset. Find a website that contain the data we need, pass the url to the method, then the method will iterate all the downloading links in the html text, add it into the waiting list only if it is the csv file that we need by using 'endswith()'. Then use 'requests' to download files, and add a downloading percentage bar by using tqdm. Since all scv files are compressed as gz file, we also need to import gzip to decompress the file one by one by script.
In total, the whole dataset contains 15 different columns and 71645420 records.
Since it is too large to upload (13.3 GB), we choose to provide a download link for the whole data set.
Column Name | Description |
---|---|
callsign | the identifier of the flight displayed on ATC screens (usually the first three letters are reserved for an airline: AFR for Air France, DLH for Lufthansa, etc.) |
number | the commercial number of the flight, when available (the matching with the callsign comes from public open API) |
icao24 | the transponder unique identification number; |
registration | the aircraft tail number (when available); |
typecode | the aircraft model type (when available); |
origin | a four letter code for the origin airport of the flight (when available); |
destination | a four letter code for the destination airport of the flight (when available); |
firstseen | the UTC timestamp of the first message received by the OpenSky Network; |
lastseen | the UTC timestamp of the last message received by the OpenSky Network; |
day | the UTC day of the last message received by the OpenSky Network. |
Please click here to view some data sample of this dataset.
This is a dataset of tweets that is scraped from twitter. We scraped for more than 20 hours to get over 450K tweets in English that contains keywords covid
and flight
since 01-01-2019
(It's an early date but we just want to make sure the data is fully covered. Interesting fact is that the earliest tweet that contains those two words is on 02-19-2019
, unrelated to covid-19 though).
There are a lot of columns such as datetime, user_id, username, name, tweet, language, mentions, urls, photos, replies_count, retweets_count, likes_count, hashtags, cashtags, link, retweet, quote_url, video, thumbnail, near, geo, etc.
Column Name | Description |
---|---|
id | id of this original tweet |
conversation_id | id of the conversation that includes this tweet(if applied). Otherwise appears the id of this original tweet |
created_at | the date and time when this tweet is created (with time zone) |
date | the date when this tweet is created |
time | the time when this tweet is created |
timezone | the timezone of date and time above |
user_id | id of the user who creates this tweet |
username | username of the user who creates this tweet |
name | the nickname of user who creates this tweet |
tweet | the plain text of tweet |
language | the language this tweet uses |
mentions | twitter users that are mentioned in this tweet |
urls | the urls that appears in this tweet |
photos | the photos that appears in this tweet |
replies_count | the number of replies to this tweet |
retweets_count | the number of retweets of this tweet |
likes_count | the number of likes of this tweet |
hashtags | the hashtags that appears in this tweet |
link | the original link to this tweet |
video | the number of videos that appear in this tweet |
thumbnail | the thumbnail of user that creates this tweet |
reply_to | the people and their id that this tweet replies to(if applied) |
Please click here to view some data sample of this dataset.
Missing values in many fields.
Noise value in 'tweet' colunmn. For example, there are lots of emojis and punctuation which are useless for following analysis.
cleaningText.py: # cleaning text
the percentage of valid content: 19.478%
the percentage of useful words: 78.476%