Data

We have collected 3 Datasets from CDC , OpenSky and Twitter.

2.1 Covid-19 Cases and Death Dataset

2.1.1 Overview

We use CDC api to collect Covid-19 Cases and Death dataset. We need a APP TOKEN to request their api.

Requested data are in the form of json array. I transform them into CSV data with some adjustments (to make data more easy to look at). Their api has a paging limitation so I can't just get my data in a simple request. I made a loop to request 5000 records per request, and save them into CSV files.

2.1.2 Data Sample and Definition

Column Name	Description
submission_date	Date of counts
state	Jurisdiction
tot_cases	Total number of cases
conf_cases	Total confirmed cases
prob_cases	Total probable cases
new_case	Number of new cases
pnew_case	Number of new probable cases
tot_death	Total number of deaths
conf_death	Total number of confirmed deaths
prob_death	Total number of probable deaths
new_death	Number of new deaths
pnew_death	Number of new probable deaths
created_at	Date and time record was created
consent_cases	If Agree, then confirmed and probable cases are included. If Not Agree, then only total cases are included.
consent_deaths	If Agree, then confirmed and probable deaths are included. If Not Agree, then only total deaths are included.

Please click here to view some data sample of this dataset.

2.1.3 Data Issues

Different data types such as float and string may appear in the same column.
There are a lot of missing values.
- For example, we have the value of "tot_case" for a certain city on a certain day, and this value represents the total number of cases. This value should be equal to the sum of the value of "conf_cases" and the value of "prob_cases". But the last two values in the data set are often blank, even if they should not be 0. Some data may have a noise value. For example, the value of the number of new infections is -1.

2.1.4 Data Cleanliness

For the CDC data set, we mainly made the following assessments：

Calculate the fraction of missing values for each attribute.
Calculate the fraction of noise values.

The percentage of missing values for each attribute:

Missing Value

The fraction of noise values:

Noise Value

2.2 Opensky Airline dataset

2.2.1 Overview

We use python script to download data from OpenSky for this dataset. Find a website that contain the data we need, pass the url to the method, then the method will iterate all the downloading links in the html text, add it into the waiting list only if it is the csv file that we need by using 'endswith()'. Then use 'requests' to download files, and add a downloading percentage bar by using tqdm. Since all scv files are compressed as gz file, we also need to import gzip to decompress the file one by one by script.

In total, the whole dataset contains 15 different columns and 71645420 records.

Since it is too large to upload (13.3 GB), we choose to provide a download link for the whole data set.

2.2.2 Data Sample and Definition

Column Name	Description
callsign	the identifier of the flight displayed on ATC screens (usually the first three letters are reserved for an airline: AFR for Air France, DLH for Lufthansa, etc.)
number	the commercial number of the flight, when available (the matching with the callsign comes from public open API)
icao24	the transponder unique identification number;
registration	the aircraft tail number (when available);
typecode	the aircraft model type (when available);
origin	a four letter code for the origin airport of the flight (when available);
destination	a four letter code for the destination airport of the flight (when available);
firstseen	the UTC timestamp of the first message received by the OpenSky Network;
lastseen	the UTC timestamp of the last message received by the OpenSky Network;
day	the UTC day of the last message received by the OpenSky Network.

Please click here to view some data sample of this dataset.

2.2.3 Data Issues

There is no noise value in this dataset, only missing value
Missing value for Fields origin and destination Origin and destination airports are computed online based on the ADS-B trajectories on approach/takeoff: no crosschecking with external sources of data has been conducted. These two fields are empty when no airport could be found
Missing value for Fields typecode and registration Aircraft information come from the OpenSky aircraft database. Fields are empty when the aircraft is not present in the database.
Missing value for Fields number The commercial number of the flight, are empty when unavailable

2.2.4 Data Cleanliness

The evaluation of the cleaniness is (valid records that we need/ all records that we need), as long as there is a missing value in the record, this record is invalid.
Cleaniness of Sample(First 4000 records of each file) => 91.8156%
Cleaniness of whole Dataset(the whole dataset is too big to upload - 13.3GB) => 98.1030%

2.3.1 Overview

This is a dataset of tweets that is scraped from twitter. We scraped for more than 20 hours to get over 450K tweets in English that contains keywords covid and flight since 01-01-2019 (It's an early date but we just want to make sure the data is fully covered. Interesting fact is that the earliest tweet that contains those two words is on 02-19-2019, unrelated to covid-19 though).

There are a lot of columns such as datetime, user_id, username, name, tweet, language, mentions, urls, photos, replies_count, retweets_count, likes_count, hashtags, cashtags, link, retweet, quote_url, video, thumbnail, near, geo, etc.

2.3.2 Data Sample and Definition

Column Name	Description
id	id of this original tweet
conversation_id	id of the conversation that includes this tweet(if applied). Otherwise appears the id of this original tweet
created_at	the date and time when this tweet is created (with time zone)
date	the date when this tweet is created
time	the time when this tweet is created
timezone	the timezone of date and time above
user_id	id of the user who creates this tweet
username	username of the user who creates this tweet
name	the nickname of user who creates this tweet
tweet	the plain text of tweet
language	the language this tweet uses
mentions	twitter users that are mentioned in this tweet
urls	the urls that appears in this tweet
photos	the photos that appears in this tweet
replies_count	the number of replies to this tweet
retweets_count	the number of retweets of this tweet
likes_count	the number of likes of this tweet
hashtags	the hashtags that appears in this tweet
link	the original link to this tweet
video	the number of videos that appear in this tweet
thumbnail	the thumbnail of user that creates this tweet
reply_to	the people and their id that this tweet replies to(if applied)

Please click here to view some data sample of this dataset.

2.3.3 Data Issues

Missing values in many fields.
- For example, 'username', 'mentions', 'retweets'. However, we mainly focus on the tweet column because that is where text data exists.
Noise value in 'tweet' colunmn. For example, there are lots of emojis and punctuation which are useless for following analysis.

2.3.4 Data Cleanliness

cleaningText.py: # cleaning text

using regular expression to filter all the emojis and punctuation
using stopwords to filter all the useless words

the percentage of valid content: 19.478%
the percentage of useful words: 78.476%

Data

2.1 Covid-19 Cases and Death Dataset

2.1.1 Overview

2.1.2 Data Sample and Definition

2.1.3 Data Issues

2.1.4 Data Cleanliness

2.2 Opensky Airline dataset

2.2.1 Overview

2.2.2 Data Sample and Definition

2.2.3 Data Issues

2.2.4 Data Cleanliness

2.3 Scraped Tweets related to 'Covid' and 'flight'

2.3.1 Overview

2.3.2 Data Sample and Definition

2.3.3 Data Issues

2.3.4 Data Cleanliness