Loading...

Pre-Processing of Data

We extract 8 columns from COVID-19 dataset and 3 columns from flight dataset.

3.1 Covid-19 Cases and Death Dataset

  • Handling null values

    • For null values, we decided to fill them with mean values. Because we need to make sure there is a relation about the columns, such as tot_cases = conf_cases + prob_cases.
# handle null value of each column
data[col].fillna(mean, inplace = True)
  • Checking the duplicates
# check duplicate
print(df[df.duplicated()]) 

3.2 Opensky Airline dataset

  • Since we only need the origin, destination, icao24 and date of flights for the current project, delete any records that contain nan values or empty values in above four columns.
  • Also, for this project, we only need flights that take place in the United States, so we use python crawler to get the aiports and their corresponding states from wiki, so we can identify and delete records that containing airports that are not in the United States.

In this part, we need to decide what content of a tweet is not clean. This decision should also made under the consideration of what we need to do with those tweets. After our discussion, we decided to do Sentiment analysis with all these tweets (will be further described in that section).

After reading quite a few scraped tweets, we found out that these 4 kinds of noise should be cleaned.

The reason for removing stopwords is quite clear, there's no need analyze the sentiment of stopwords(which usually don't have emotional meaning afterall). This reason also applies for URLs and Mentioned tag. Whether to remove hashtags is discussed more. We decided to remove it simply because, yes it do have meaning sometimes, but most of the time they only represent some certain objective things(it's a noun mostly) which don't tend to have emotional meaning.

We used nltk package to tokenize tweets. We also used it as our stopwords list. In terms of urls and hash tags and mentioned tags, we came up with two regular expressions to find and replace them.

Every tweet is read from csv file named result_covid_flight.csv and then rewrite into a new csv file named result_covid_flight_cleaning.csv. As we collected 470K records and there's no way to upload it on github, we created a share link on google drive for you to use.

https://drive.google.com/file/d/1OA1kkjQ9V0ZXQ-s8MqbLsOXATgK4k-IW/view