Further analysis of data preprocessing

Course Name: Natural Language Processing in Trading, Section No: 3, Unit No: 3, Unit type: Notebook



Hi Team,





Good Day,



I have the three apple stock sentiment datasets. The datasets are from the years 2014, 2015, and 2016:

  1. "aapl_news_headline.csv": This CSV file contains the apple sentiment dataset that is pre-processed by me from scratch. (It was prepared from the plain vanilla dataset that I took from this link: https://sites.google.com/view/headlinedataset/home ).
  2. "news_headline_sentiments_aapl.csv": This CSV file contains the apple sentiment dataset that is provided to us by Quantra.
  3. "rows_that_are_in_my_dataset_but_not_in_quantra_dataset.csv": This CSV file as the name suggests contains the apple sentiment data that are in my dataset(i.e. "aapl_news_headline.csv") but are not present in the dataset that is provided by the Quantra(i.e. "news_headline_sentiments_aapl.csv"). This I prepared by doing some kind of data wrangling.


There are roughly 42,283 data rows in my dataset(i.e. "aapl_news_headline.csv"), and roughly 15,313 data rows in the dataset which is provided by Quantra(i.e. "news_headline_sentiments_aapl.csv"). So there is a difference of 28,925 rows between the two datasets. That means there's still some data pre-processing left from my side in order to further filter down my dataset from 42,283 rows to 15,313 rows. It would be extremely helpful if the team can help me with this further analysis of data pre-processing. What is further data pre-processing I am missing here? I am really curious to know about this. I would be highly obliged if the team can share their code when they used to perform the data-preprocessing while they were making this course. This is a very crucial part of data preprocessing. It will help me to enhance my knowledge on how to deal with these kinds of sentiment-based datasets.

The link for the datasets that I have discussed above is given below, the link is accessible and public:
Apple Sentiment Data

Hello Harshvardhan,



The difference is there because the data considered on Quantra is a chunk of the whole data (as mentioned in the notebook markdown) and only contains data ~ September of 2014.

You can check the same by doing a head() and tail() of the data being used.



Your analysis scope if it is considering 2014, 2015 and 2016, will definitely have more data.



Hope this helps!

Please feel free to connect if you have any more questions!

Hi Gaurav,



Apologies, I think I haven't made it clear. The link of the dataset that I have attached in my previous comment contains the dataset that is given to us from the Quantra is from 2014 to 2016. And it contains 15,313 rows. 



???

Hi Harshvardhan,



We suspect that you haven't applied the Business category as a filter. Can you try doing that?



To do the same, you need to use the source_id column and keep only those rows that belong to the Business category.



Thanks!