Loading...
We extract 8 columns from COVID-19 dataset and 3 columns from flight dataset.
Handling null values
# handle null value of each column
data[col].fillna(mean, inplace = True)
# check duplicate
print(df[df.duplicated()])
In this part, we need to decide what content of a tweet is not clean. This decision should also made under the consideration of what we need to do with those tweets. After our discussion, we decided to do Sentiment analysis with all these tweets (will be further described in that section).
After reading quite a few scraped tweets, we found out that these 4 kinds of noise should be cleaned.
The reason for removing stopwords is quite clear, there's no need analyze the sentiment of stopwords(which usually don't have emotional meaning afterall). This reason also applies for URLs and Mentioned tag. Whether to remove hashtags is discussed more. We decided to remove it simply because, yes it do have meaning sometimes, but most of the time they only represent some certain objective things(it's a noun mostly) which don't tend to have emotional meaning.
We used nltk
package to tokenize tweets. We also used it as our stopwords list. In terms of urls and hash tags and mentioned tags, we came up with two regular expressions to find and replace them.
Every tweet is read from csv file named result_covid_flight.csv
and then rewrite into a new csv file named result_covid_flight_cleaning.csv
. As we collected 470K records and there's no way to upload it on github, we created a share link on google drive for you to use.