Course Name: Natural Language Processing in Trading, Section No: 3, Unit No: 3, Unit type: Notebook
Hi Team,
Good Day,
I have the three apple stock sentiment datasets. The datasets are from the years 2014, 2015, and 2016:
- "aapl_news_headline.csv": This CSV file contains the apple sentiment dataset that is pre-processed by me from scratch. (It was prepared from the plain vanilla dataset that I took from this link: https://sites.google.com/view/headlinedataset/home ).
- "news_headline_sentiments_aapl.csv": This CSV file contains the apple sentiment dataset that is provided to us by Quantra.
- "rows_that_are_in_my_dataset_but_not_in_quantra_dataset.csv": This CSV file as the name suggests contains the apple sentiment data that are in my dataset(i.e. "aapl_news_headline.csv") but are not present in the dataset that is provided by the Quantra(i.e. "news_headline_sentiments_aapl.csv"). This I prepared by doing some kind of data wrangling.
There are roughly 42,283 data rows in my dataset(i.e. "aapl_news_headline.csv"), and roughly 15,313 data rows in the dataset which is provided by Quantra(i.e. "news_headline_sentiments_aapl.csv"). So there is a difference of 28,925 rows between the two datasets. That means there's still some data pre-processing left from my side in order to further filter down my dataset from 42,283 rows to 15,313 rows. It would be extremely helpful if the team can help me with this further analysis of data pre-processing. What is further data pre-processing I am missing here? I am really curious to know about this. I would be highly obliged if the team can share their code when they used to perform the data-preprocessing while they were making this course. This is a very crucial part of data preprocessing. It will help me to enhance my knowledge on how to deal with these kinds of sentiment-based datasets.
The link for the datasets that I have discussed above is given below, the link is accessible and public:
Apple Sentiment Data