This post is regarding Word2Vec to XGBoost:
After wrting few lines of codes, I am getting an error, which i am not able to understand:
I am getting error after Incremental model training and testing
Here's the code:-
all_data =
i = 0
for df in chunks:
df.dropna(inplace=True)
# Remove common words and tokenize
i = i+1
print(i)
# Model training on 80% dataset
if i < 9:
X, y = get_feature_and_target_variable(df)
if i == 1:
xg.fit(pd.DataFrame(X), y)
# Boosted model training
else:
xg.fit(pd.DataFrame(X), y, xgb_model=xg.get_booster())
# Model testing on the rest of 20% of dataset
if i >= 9:
X, y = get_feature_and_target_variable(df)
X_test = X_test.append(pd.DataFrame(X))
y_test = np.append(y_test, y)
if i >= 10:
break
xg
Here's the error:
NameError: name 'news_headline_column' is not defined
The error points to In[7]
Please assist
Hi Saurabh,
Unfortunately, I could not seem to replicate this error on my system.
It would be great if you could share with me your Jupyter notebook so that I can have a better understanding of what could possibly be the cause of this error.
Looking forward to helping you out with the same.
http://colab.research.google.com/drive/1VcdS7rpMWxPmbog0Bc00RkO1CJ9M4Ob3?usp=sharing
Hi Kevin,
Thanks for your reply. Here's the link. I run the file on jupyter notebook and now it is giving me some other error. I have uploaded the file on the google colab. You can download and run it on the jupyter notebook. I wasn't able to run on google colab because of slow upload speed. I cannot upload "GoogleNews-vectors-negative300-SLIM.bin" file as it 345 MB.
Your help will be greatly appreciated.
Thanks & Regards
Saurabh
Hey Saurabh,
Thanks for sharing your Jupyter notebook.
The reason behind this error occurring is the difference in the version of the gensim library.
On Quantra, we are currently using version 3.8.3 of the gensim library whereas you are working with version 4.0.0 on your system.
You will find all the details regarding the changelog between the two versions of gensim here.
Now in order to run the code within your notebook, you will have to perform one of the following steps:
Option 1)
Downgrade the gensim library to version 3.8.3 by running the following command:
pip install gensim==3.8.3
Once version 3.8.3 is installed, the code will be running without any errors.
Option 2)
If you wish to continue using the updated version of gensim (i.e. 4.0.0), you can make the following changes in the code:
In Code Cell 5:
Change "if word in model.vocab:" to "if word in model.key_to_index:"
In Code Cell 7:
Change "if word in word2vec.vocab" to "if word in word2vec.key_to_index:"
Note: This is done because in version 4.0.0, 'vocab' dict has become 'key_to_index'.
I hope this solves your problem.
Thanks Kevin for your response. I used word2vec.key_to_index it worked, but now I am using these commands in my dataset and getting this error
AttributeError: 'str' object has no attribute 'dropna'
I just changed my dataset with the same features "headline" and "sentiment_class"
Here's the dataset https://drive.google.com/file/d/1FvmWRYyG2JjiUB-gUCNozEmaABtkn7Hx/view?usp=sharing
Jupyter notebook https://drive.google.com/file/d/1kAsgCyXlCF2XMsVmm2tlKRVmHCtJ_an-/view?usp=sharing
Error getting on this code
all_data = []
i = 0
for df in chunks:
df.dropna(inplace=True)
# Remove common words and tokenize
i = i+1
print(i)
# Model training on 80% dataset
if i < 9:
X, y = get_feature_and_target_variable(df)
if i == 1:
xg.fit(pd.DataFrame(X), y)
# Boosted model training
else:
xg.fit(pd.DataFrame(X), y, xgb_model=xg.get_booster())
# Model testing on the rest of 20% of dataset
if i >= 9:
X, y = get_feature_and_target_variable(df)
X_test = X_test.append(pd.DataFrame(X))
y_test = np.append(y_test, y)
if i >= 10:
break
xg
I really appreciate your reply but I am not able to understand why I am getting this.
I tried a few things but didn't worked out.
Thanks & Regards
Saurabh Kamal
Hey Saurabh,
My apologies for the delay in responding.
You can try specifying the attribute 'chunksize' in code cell 4 of your notebook as shown in the image below:
Also, here is a video that will help you understand the importance of using the 'chunksize' attribute better.
Feel free to reach out if you have any more doubts regarding the same.
I am trying to run a random forest on the same dataset. Even if I remove the "rfm_model = rfm.get_booster())" it gives me the same accuracy as XGBoost' accuracy.
rfm.jpg - Google Drive (random forest error)
and trying the same dataset on logistic regression too.
Hey Saurabh,
I'm sharing along a couple of resources here that will help you understand concepts such as the XGBoost, Random Forest and Logistical Regression algorithms in a very detailed manner.
Do consider exploring the same.
Guide to XGBoost with codes in Python: link
Guide for the Random Forest algorithm: link
Guide on Logistical Regression: link
Video on XGBoost and Random Forest and the differences between the two: link
Now coming to the Jupyter Notebook,
You are receiving the following error messages:
'RandomForestClassifier' object has no attribute 'get_booster'
'LogisticRegression' object has no attribute 'get_booster'
Such error messages are bound to arise because both the RandomForestClassifier and LogisticRegression are not based on/ utilize boosting techniques. You will also find that no such methods exist for the above two library functions when you read through their documentation carefully.
Next, you also mentioned that the accuracy returned by the RandomForest and XGBoost are the same.
The reason behind this is that the sample size chosen by you is very small (approx 600).
Due to the above reason, the model is not able to learn effectively and thus is ending up predicting zero.
A good way to tackle this issue would be to choose a dataset that is much larger and also increase the sample size accordingly.
You can try out the same on the dataset 'news_headline_column' which is provided in the course's downloadable zip file.
chunks = pd.read_csv('data_modules/news_sentiment_data.csv', usecols=[
'headline', 'sentiment_class'], chunksize = 480)
Also note, the ideal value of the chunksize attribute in the above line in your Notebook must be 10% (or less), than the overall size of your dataset.
I hope this answers your question.