Trading with Machine Learning : Regression

Section 3 Unit 22

Results from 'Best Fit Variable'

[ -4.82130052  -4.825139    -3.24566061  -9.65282093 -13.82274724]
-8.230913587039538

On XAUUSD Gold pair data from 1st Jan 2024 to now()

Is this result correct? As the example shows score as below

[-0.55125259 -0.35420839 -0.27102945 -0.14834363 -0.14452836]
-0.24203918499284677

I also notice that data are not 'imputed' in Jypitor Note in this unit. Is it accidental or there is for any specific reason.

Hi Anil,



As you have mentioned, the dataset you have used is different and hence the values you are getting are different. This is fine because stock prices will change as time goes on. The decision on imputing is a matter of choice depending on the researcher. Sometimes, it could lead changes in the statistical properties of the dataset. 



Hope this helps. Thanks.

Well my point was as per code/data in the Unit -0.2420391499 is 24%, whereas I got -8.230913587 which should be 823%!!!

Is such a large difference possible with a different dataset?

Hi Anil,



In this case, your dataset might be very limited and hence the model is not able to lower the error score. The dataset you are using is almost 4 months while the course dataset is around 6 years. Essentially you want your dataset to comprise different market regimes so that the model gets more data to learn and find the pattern.



Having a larger dataset might help in this case. 

Hi Rekhit
I am reviving this topic as now got updated skills to integrate Python and MQL5.

Attached is the new code which I have tested with 2018 Jan to current date data.
The major issues I am facing are as below:

  1. yU_MSE as 166.8452886865009 & yD_MSE as 262.21930620451394. Both seems to be on higher side.

  2. yU_R2 as -0.016384138165332462 & yD_R2 as -0.11694144786269844
    Both are negative, while Gold prices have been rising.

  3. Somehow I have feeling that I am getting the result with opposite signal (positive as negative). May I request you to please review the code and check, if I have used the code correctly? As this single code file have been arrived from the multiple code files from the Course material.

Jupyter Lab file

Python package versions are mentioned at top of the Jupyter file

Hello,

It seems the code is fine. You have transformed the data X_train into a new variable X, but you are not using this transformed data in your model fitting.

While the code seems correct, it means that we have to take a look at features being used, you can consider other features like technical indicators, even market sentiment indicators as well.

Further, it seems in between the strategy returns drop while the actual prices increase, this might be because we are using features which have a high lag and thus might be slow to react to recent information.

Hope this helps.

Hello Rekhit
Thanks for the reply.
I have been reviewing the code again, and failed to pinpoint where I missed to use new variable X.

I have reworked on the model with H4 timeframe data. Intention is to get signals from H4_TF and look for entries at lowerTF in my MT5 boot.

I am also now focusing on predicted values of High and Low, instead of trying to get strategy returns, which will be totally different than what strategy I might use.

Thus the focus it to get high quality predictions for High and Low values. Before I can proceed to it, I need to clear calculations errors if any remained in the model. I am encountering the following issues:

  1. R-squared: by definition is a statistical measure that represents the goodness of fit of a regression model. The value of R-square lies between 0 to 1. How can then my model return ‘negative’ values for it?

yU_R2 = -0.028480009931477923 yD_R2 = -0.08523133581310605

  1. Mean Squared Error (MSE): an estimator measures the average of error squares i.e. the average squared difference between the estimated values and true value. It is a risk function, corresponding to the expected value of the squared error loss. It is always non – negative and values close to zero are better.
    yU_MSE = 32.456408251055635 yD_MSE = 43.06940572221259
    Thankfully on H4 timeframe I got it at least positive.
    Correct me if my understanding is wrong about it. Should I consider it as UnderRoot of 32.456 = potential 5.697 (pips) error in High/Low predicted prices of Gold?

Model_HighLow

Edit Update: I noticed that MSE and R^2 was calculated on yU & yD (deviations from Open) and our prediction is High and Low price.

MSE / R2 on actual v/s predicted High / Low are as below
highMSE = 32.60895876232015 lowMSE = 43.60419093713083
highR2 = 0.9996854402791678 lowR2 = 0.9995723271063687
Now the model seems to overfitting !!!

Edit 2: Section 7 Unit 4
“Now we will check for the outliers by plotting Close column of gold_prices . One can use any column of choice but we are using Close for the reference.”

What are we supposed to do, if we found any outliers. Is there any way to programmatically find such outliers, instead of visualizing which is totally against the algo trading.

Hi Anil,

The reason R squared can be negative is because the model might be poorly fitting on the test data. Consider the formula for R squared, which is R² = 1 - (RSS/TSS)

Where:

  • RSS = Residual Sum of Squares = ÎŁ(y_actual - y_predicted)²
  • TSS = Total Sum of Squares = ÎŁ(y_actual - y_mean)²

Thus, if RSS/TSS is more than 1, you will get a negative r squared, which means that your square of errors are quite high.

One reason could be due to poor fitting of model on test data, or a difference in the training data and test data. (Think of it this way, a model trained on a bull run only might not perform as great if the test data is for a bear run). Or the features used in the model can be better.

The errors and R squared are always calculated on the actual and predicted values, this is why we are calculating on yU and yD.

There are different ways of finding outliers, you could use the percentile/quartile method. In this method, you can say values beyond 95 percentile are outliers.

You can also use z-score values, where any value beyond 3 standard deviations is an outlier.

1 Like

Hi Rekhi
Hope you have enjoyed the Holi festival.

As per your suggestion, I have replaced 3SMA with NonLinearRegressionMA, QuadRegressionMA and HullMA, which are comparatively less logging indicators than SMA used earlier. I have also used PERIOD_H1 timeframe instead of daily and hourly data used from 2017 Jan 1 to 2025 Mar 13. Gold have been in Bullish trend for most of the time. This period also includes PreCovid/Covid/PostCovid periods.

Scoring data stands as below now:
yU_MSE: 8.2146 yD_MSE: 9.6441
yU_R2: 0.0909 yD_R2: 0.1028
R^2 is +tive now but on lower side. I think we need between 0.50 and 0.90 for better reliability.
However it you look at Actual v/s Predicted High and Low charts they look almost perfect. (or it is just an illustion?)

Model training period was 2017 to 2022 (Gold had mixed trends) ‘and’ testing was 2023 to 2024 (Gold Bullish).
AS we used GridSearchCV with timeseries splits, I was of the opinion that using it will take care of different trends during training period.
Anyway what is your suggestion in such cases to improve further?

How about replacing NLR & QR MAs with DeMarker and RSI or Stochastic?

Outliers: Once I found them with say by ZScore, should they be dropped from the training/test data? If yes, will it not affect the continuity of timeseries?
What the Scaler(normalization) method is doing to these outliers? If Scaler method is used, do I still need to worry about Outliers?

Attached updated code and Gold data in CSV format, in case you need to review it.

Mode_HighLow_v2

Hello,

Yes, it was a good break.
Your approach is in the right direction. Considering the fact that a change in the indicators is giving better results.

If it is not done before, you could zoom in on part of the visualisation to understand if the actual and predicted values are really “almost perfect” or just because of zooming out it looks like it.
MAybe you can add the indicators, DeMarker and RSI or Stochastic instead of replacing them.

You are right that removing outlier values will lead to missing values, you could try winsorisation.

Instead of a 70:30 or 80:20 train test split, you can try implementing a rolling window approach where you continuously retrain on recent data
You can also add a market regime detection component to adjust strategies based on the current trend.

1 Like

Hi Rekhit
Thanks for the reply.
Can you please elaborate, how can I implement the above two?

Hi,

You can check the blog Cross Validation In Machine Learning Trading Models to get an understanding how you can use rolling window to train and test the model. Do note that the blog gives an example of classification model so you might have to work on how to adapt it to your regression model.

You can add a feature which indicates whether the market is trending or not. For example, calculate ADX indicator and if ADX>25, then you can consider the market (or asset) is trending, otherwise it is not. This will be one feature in the model.
This is a simplification but you can tr this and see if it works or not

1 Like

Hi Reshit

I have reworked on the entire code, and found classification modeling is much better suited for me than linear regression, as prices are generally non-linear.

In the classification model, is it possible to have Buy = +1 / Sell = -1 / None = 0 values?

I have few courses from QuantInsti and will try each of them, to upgrade my skills in Python and ML, and potentially prepare for EPAT in future soon.

Hi Rekhit

I have created a new thread / post for ATR Scalping strategy. Trying to incorporate Rolling Window Cross validation approach suggested by you in that.

With this strategy first time I am able to correlate with the code and improving my understanding of Python. Help me out to complete this as reliable model which I can deploy on my MT5 platform.

Thanks

That’s great Anil. We are here to guide.

1 Like

Hi Rekhit
Seems there is different team member for different Courses!!!

I am still awaiting reply for my ATRScalp on other thread for over 4 or 5 days now. It is difficult for me to wait soooooo long, and hence I reverted back to this file for prediction of high and low with Regression model.

I have incorporated K-Fold data split and encountering the following error:
NameError Traceback (most recent call last)
Cell In[12], line 16
13 model_yU = clf.fit(X_train, yU_train)
14 model_yD = clf.fit(X_train, yD_train)
—> 16 accuracy_score_yU(yU_test, model_yU.predict(X_test), normalize=True)*100
17 accuracy_score_yD(yD_test, model_yD.predict(X_test), normalize=True)*100
19 # Append to accuracy_model the accuracy of the model
NameError: name ‘accuracy_score_yU’ is not defined

I have double checked that I have defined ‘accuracy_score_yU’ but still getting the error. Please help me out. Revised code attached in the link below.

PredictHighLow_v3

Hi Anil,

Here, the issue is that you are using a decision tree regressor, which is used for regression problems, but you are trying to use accuracy_score, which is for classification related problems. You should use MSE, or R-squared as you had done earlier.
Thanks.

1 Like

Hi Rekhit
K-Fold method (in the link you gave me) it was a Classification model, and I was having error using the same. Then I saw on google about clf = tree.DecisionTreeRegressor(random_state=5) model.

Can you clarify the right approach to use K-Fold method in the model?
clf = tree.DecisionTreeClassifier(random_state=5), or
clf = tree.DecisionTreeRegressor(random_state=5) ?

When I subscribe for short course from QuantInsti, I was told that Python knowledge is not a pre-requisite and I will be supported on this. I therefore request you to explain in more detail to help please.

A step-by-step guidance (you may not write the code but can guide me on how to change model) will save time for both of us. As now I am already confused in the next steps of the model.

How and where should I predict yU and yD values in each run of split data, as to calculate MSE etc I need the predicted data?
Should I use yU_predict = clf.predict(X_test) in each loop?
The yU_predict = reg.predict(X_test) will not work now as ‘reg’ method is no more available from pipeline.

Also do guide me on how I can calculated ‘Adjusted R2’ so when I add or remove a feature, I can find its importance.

Hello Anil,

Classification and regression models are built for different purposes.

If you are looking at continuous variables for prediction, like the price tomorrow, you might go for regression model. If you want to predict whether the price is going up, or not, then classification model can be your choice. Both type of models have their pros and cons, so you might have to decide which model you want to choose.

K-fold can be applied on both types of models.
If you are using a classification model, then you will use accuracy score.
If you are using regression model, then you will use MSE, R-squared.

For python knowledge, you can go through the Python for Basics course on Quantra for understanding more on Python, and also you can go through the Introduction to Machine Learning for Trading on Quantra for brushing up on different types of machine learning models.

How and where should I predict yU and yD values in each run of split data, as to calculate MSE etc I need the predicted data?
IF you are using a regression model then you will predict the values inside the loop where the data is split into training and testing sets. First, train the model separately for yU and yD using the training features and their respective target values. Then, use the trained models to make predictions on the test data.

For evaluation, calculate metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, and Adjusted R-squared. Adjusted R-squared is useful for understanding the importance of features, as it adjusts for the number of predictors in the model. To calculate it, you need the number of test samples and the number of features.

Should I use yU_predict = clf.predict(X_test) in each loop?

Yes. After computing the metrics for each fold, store the values and compute their average at the end to get a final assessment of the model’s performance.

Adjusted R squared is a metric, just like r squared so you can try running based on the number of features and then compare the results.
I hope this helps.

1 Like

Hi Rekhit
Thanks for detailed explanations. However, I could not figure out the implementation part of them.

After searching google, I have been able to now calculate MSE and R2 for each set of K-Fold loop and their averages.

Support I need on following:
[A] inp[6] I am using ‘model = LinearRegression()’ straight forward
This is in contrast with hyperparameter tuning of regression model with GridSearchCV method as per the suggestion in the course material.
Assumptions is that MSE/R2 values calculated with K-Fold will help me to arrive a conclusion, if the regression model is suitable or not.
Am I correct for this assumption?

[B] With K-Fold method, we don’t create training and test dataset(s) outside the score calculation loop.
I am facing challenge inp[10] to transfer (add into X) yU_predict and yD_predict data with the following error.
I understand, I am making error with row numbers of X and predicted datasets, but failed to figure out, how to resolve it. Please help me on this.

[C] Can I use Spyer, instead of JupyterLab or NoteBook to run the Python scripts?

PredictHighLow with K-Fold method