NLP in trading

This is regarding the capstone solution. I am facing some issues with the code:



Issue 1: The below code doesn't give me 2021 data. It fetches 2024 data. 


Enter the keywords to get the Apple stock news data

keywords = ['Apple-Stock', 'Apple-Revenue', 'Apple-Sales', 'Apple', 'AAPL']


Set the date range to fetch the news data

googlenews.set_time_range('06/01/2021', '07/01/2021')



My solution which didn't work: googlenews.set_time_range('2021-06-01', '2021-07-01')



Issue 2: When I try to merge two data frames: aapl_stock_data and daily_sentiment_score with the below code, it gives me an empty dataframe. 


Join the two dataframes: aapl_stock_data and daily_sentiment_score

prices = aapl_stock_data.merge(

    daily_sentiment_score['score'].to_frame(), on='Date')

prices.dropna(inplace=True)

prices.head()



Issue 3: I run XGBoost model using the same dataset as provided but while running XGBoost on it, I got the following error:



ValueError: Invalid classes inferred from unique values of y. Expected: [0 1 2], got [-1 0 1]



So, I mapped it, is the below step correct:



class_map = {-1: 0, 0: 1, 1: 2} y_train_mapped = [class_map[label] for label in y_train] # Instantiate XGBClassifier xg = XGBClassifier(max_depth=6, n_estimators=100, eval_metric='mlogloss') # Fit the model on train dataset xg_model = xg.fit(X_new_train, y_train_mapped)



I have observed that some of the solutions doesn't work.



I would appreciate an appropriate solution on these issues.



 

Hi,



Let me reply with respect to the issues



Issue 1: Googlenews is a bit unstable and thus, you might not be able to retrieve data which is years into the past.

Lately, you will get a "HTTP Error 429: Too Many Requests" if you are not using API keys.



With respect to the capstone project, you can train the model on the provided dataset and then predict on latest data.

To makes sure you don't get stuck as Google News has become restrictive, you can use newsapi to retrieve latest news data.



Note that NewsAPI requires an API key and the free plan has only 100 requests per day. Also, it will allow approximately latest one month data to be retrieved for the free plan.



The capstone solution notebook has been updated with the code for newsapi. 

You can get your own API key at Login - News API



This should solve the first issue.



For the second issue, there is a possibility that there was no common row between the aapl_stock_data and daily_sentiment_score, you received an empty dataframe



IF you check the solution notebook, it should be working fine.



The third issue could not be replicated in our system. Is there a possibility that the notebook cell was run in a different order than the notebook which you had sent.




Do let us know if there is any other query, Thanks.

Hi Rekhit, 



Thanks for your reply. I appreciate it. 



I tried to run the file, but I am still stuck with Issue 3. I run a fresh file as given in the updated quantra solution. 



I have got the same error:



ValueError: Invalid classes inferred from unique values of y. Expected: [0 1 2], got [-1 0 1]





Here's the link:



Google Colab



Please check the file. Please feel free to edit it. 



Is there another way out such as mapping or label encoding?  I have done label encoding, but I am not sure. 



Looking forward to your rely.



Thanks & Regards

Saurabh Kamal







 

Hi Saurabh,



This error is due to a different version of xgboost running on Google Colab. It looks like there has been some change in the library code. One way to resolve this is to use a lower xgboost version on Google Colab. You can use the following code:

pip install xgboost==1.4.1

You can also refer to this thread for more information about this.



Hope this helps!

Thanks, it worked.



Regards

Saurabh

Hi,



I have one question while I completed capstone project. The captstone project included Bag-of-words and XGBoost. I added TF-IDF, and Word2Vec. I have a query, and I want to clear my doubt as it will help me to explain myself.



The Bag-of-Words + XGBoost accuracy on the given CSV file is 54% 

The TF-IDF + XGBoost accuracy on the given CSV file is 53%

The word2Vec + XGBoost accuracy on the given CSV file is 42%



Thereafter I downloaded the stock data from NewsAPI and finished with the model.



However, which is the best model? All of their accuracies are just not good. Even after doing lemmatization, stopwords, etc. How can I improve this model?



Will appreciate your reply.



Thanks & Regards

Saurabh Kamal



 

Hi Saurabh,



Looking at the accuracy score, it does look low, but you have to understand the context here.

In the scientific domain, higher accuracy is expected, but in the trading domain, you have to move further and look at other metrics as well, including f1 score, recall, and precision.



And further, you should backtest and analyse the performance of the strategy too.



Having said that, you can consider certain other factors as well, like implementing a different ML model, or using an ensemble approach, like the voting classifier or blender model.



In the domain of machine learning, you can focus on areas in the following manner

Input: Check if you can get additional data or create other features. 

Model: You can tune the hyperparameters and see which ones are giving good results. Can you try using deep learning here. Apply cross-validation to see if this improves the performance.



Hope this helps.