Data Preprocessing and Data Modeling for Sentiment Class and Sentiment Score

Harshvardhan_Singh_paTj · July 3, 2021, 8:17pm

Course Name: Natural Language Processing in Trading, Section No: 3, Unit No: 3, Unit type: Notebook

Hi Team,

Good Day,

I have two problems here:

I have two data sets: one (news_headline_sentiments_aapl.csv: apple stock sentiment data) is from Quantra(QuantInsti) and another (news_headline_sentiment_data_combined_2014_2015_2016_aapl.csv: apple stock sentiment data) which I prepared this data(taken from source:https://sites.google.com/view/headlinedataset/home) from scratch through data preprocessing. I have prepared the dataset for the combined 2014, 2015, 2016(But that doesn't matter). In my dataset, there's two-column ["start_time_stamp" and "end_time_stamp"], whereas in quantra's dataset there's only one column called ["time_stamp"]..so could you please confirm whether the ["time|_stamp"] column from Quantra dataset is same as ["start_time_stamp"] from my dataset?
In my dataset, there's no "sentiment_class" and "sentiment_score" column. So I'm very curious to know the procedure to calculate "sentiment_class" and "sentiment_score" from scratch.

The link for the datasets that I have discussed above is given below, the link is accessible and public:
Apple Stock Data: 1 csv with sentiment class and score and 1 csv without sentiment class and score

Vibhu_uxX · July 5, 2021, 3:46pm

Hi Harshvardhan,

For the first part of your question:

The timestamp column in the Quantra dataset is the time at which the news headline first appears. In your dataset, I believe the start_time_stamp refer to the same that is the first appearance of the news headline. In your analysis or sentiment calculation, you can use the start_time_stamp.

For the second part of your question:

Yes, I can see your dataset only contains the news headlines, and you want to predict the sentiment class of those news headlines. You can use the news headlines sentiments data used in the course to train the model. Once the model is trained and gives satisfactory results on the test dataset, you can use the model and pass your news headline in the trained model to predict the sentiment class.

The detailed steps are mentioned here.

If you still face the issue, let us know.

Thanks!

Harshvardhan_Singh_paTj · July 5, 2021, 6:03pm

Thank you very much Vibhu. I appreciate your assistance!