Trading with Machine Learning : Regression

Section 3 Unit 22

Results from 'Best Fit Variable'

[ -4.82130052  -4.825139    -3.24566061  -9.65282093 -13.82274724]
-8.230913587039538

On XAUUSD Gold pair data from 1st Jan 2024 to now()

Is this result correct? As the example shows score as below

[-0.55125259 -0.35420839 -0.27102945 -0.14834363 -0.14452836]
-0.24203918499284677

I also notice that data are not 'imputed' in Jypitor Note in this unit. Is it accidental or there is for any specific reason.

Hi Anil,



As you have mentioned, the dataset you have used is different and hence the values you are getting are different. This is fine because stock prices will change as time goes on. The decision on imputing is a matter of choice depending on the researcher. Sometimes, it could lead changes in the statistical properties of the dataset. 



Hope this helps. Thanks.

Well my point was as per code/data in the Unit -0.2420391499 is 24%, whereas I got -8.230913587 which should be 823%!!!

Is such a large difference possible with a different dataset?

Hi Anil,



In this case, your dataset might be very limited and hence the model is not able to lower the error score. The dataset you are using is almost 4 months while the course dataset is around 6 years. Essentially you want your dataset to comprise different market regimes so that the model gets more data to learn and find the pattern.



Having a larger dataset might help in this case. 

Hi Rekhit
I am reviving this topic as now got updated skills to integrate Python and MQL5.

Attached is the new code which I have tested with 2018 Jan to current date data.
The major issues I am facing are as below:

  1. yU_MSE as 166.8452886865009 & yD_MSE as 262.21930620451394. Both seems to be on higher side.

  2. yU_R2 as -0.016384138165332462 & yD_R2 as -0.11694144786269844
    Both are negative, while Gold prices have been rising.

  3. Somehow I have feeling that I am getting the result with opposite signal (positive as negative). May I request you to please review the code and check, if I have used the code correctly? As this single code file have been arrived from the multiple code files from the Course material.

Jupyter Lab file

Python package versions are mentioned at top of the Jupyter file

Hello,

It seems the code is fine. You have transformed the data X_train into a new variable X, but you are not using this transformed data in your model fitting.

While the code seems correct, it means that we have to take a look at features being used, you can consider other features like technical indicators, even market sentiment indicators as well.

Further, it seems in between the strategy returns drop while the actual prices increase, this might be because we are using features which have a high lag and thus might be slow to react to recent information.

Hope this helps.

Hello Rekhit
Thanks for the reply.
I have been reviewing the code again, and failed to pinpoint where I missed to use new variable X.

I have reworked on the model with H4 timeframe data. Intention is to get signals from H4_TF and look for entries at lowerTF in my MT5 boot.

I am also now focusing on predicted values of High and Low, instead of trying to get strategy returns, which will be totally different than what strategy I might use.

Thus the focus it to get high quality predictions for High and Low values. Before I can proceed to it, I need to clear calculations errors if any remained in the model. I am encountering the following issues:

  1. R-squared: by definition is a statistical measure that represents the goodness of fit of a regression model. The value of R-square lies between 0 to 1. How can then my model return ‘negative’ values for it?

yU_R2 = -0.028480009931477923 yD_R2 = -0.08523133581310605

  1. Mean Squared Error (MSE): an estimator measures the average of error squares i.e. the average squared difference between the estimated values and true value. It is a risk function, corresponding to the expected value of the squared error loss. It is always non – negative and values close to zero are better.
    yU_MSE = 32.456408251055635 yD_MSE = 43.06940572221259
    Thankfully on H4 timeframe I got it at least positive.
    Correct me if my understanding is wrong about it. Should I consider it as UnderRoot of 32.456 = potential 5.697 (pips) error in High/Low predicted prices of Gold?

Model_HighLow

Edit Update: I noticed that MSE and R^2 was calculated on yU & yD (deviations from Open) and our prediction is High and Low price.

MSE / R2 on actual v/s predicted High / Low are as below
highMSE = 32.60895876232015 lowMSE = 43.60419093713083
highR2 = 0.9996854402791678 lowR2 = 0.9995723271063687
Now the model seems to overfitting !!!

Edit 2: Section 7 Unit 4
“Now we will check for the outliers by plotting Close column of gold_prices . One can use any column of choice but we are using Close for the reference.”

What are we supposed to do, if we found any outliers. Is there any way to programmatically find such outliers, instead of visualizing which is totally against the algo trading.

Hi Anil,

The reason R squared can be negative is because the model might be poorly fitting on the test data. Consider the formula for R squared, which is R² = 1 - (RSS/TSS)

Where:

  • RSS = Residual Sum of Squares = ÎŁ(y_actual - y_predicted)²
  • TSS = Total Sum of Squares = ÎŁ(y_actual - y_mean)²

Thus, if RSS/TSS is more than 1, you will get a negative r squared, which means that your square of errors are quite high.

One reason could be due to poor fitting of model on test data, or a difference in the training data and test data. (Think of it this way, a model trained on a bull run only might not perform as great if the test data is for a bear run). Or the features used in the model can be better.

The errors and R squared are always calculated on the actual and predicted values, this is why we are calculating on yU and yD.

There are different ways of finding outliers, you could use the percentile/quartile method. In this method, you can say values beyond 95 percentile are outliers.

You can also use z-score values, where any value beyond 3 standard deviations is an outlier.

1 Like

Hi Rekhi
Hope you have enjoyed the Holi festival.

As per your suggestion, I have replaced 3SMA with NonLinearRegressionMA, QuadRegressionMA and HullMA, which are comparatively less logging indicators than SMA used earlier. I have also used PERIOD_H1 timeframe instead of daily and hourly data used from 2017 Jan 1 to 2025 Mar 13. Gold have been in Bullish trend for most of the time. This period also includes PreCovid/Covid/PostCovid periods.

Scoring data stands as below now:
yU_MSE: 8.2146 yD_MSE: 9.6441
yU_R2: 0.0909 yD_R2: 0.1028
R^2 is +tive now but on lower side. I think we need between 0.50 and 0.90 for better reliability.
However it you look at Actual v/s Predicted High and Low charts they look almost perfect. (or it is just an illustion?)

Model training period was 2017 to 2022 (Gold had mixed trends) ‘and’ testing was 2023 to 2024 (Gold Bullish).
AS we used GridSearchCV with timeseries splits, I was of the opinion that using it will take care of different trends during training period.
Anyway what is your suggestion in such cases to improve further?

How about replacing NLR & QR MAs with DeMarker and RSI or Stochastic?

Outliers: Once I found them with say by ZScore, should they be dropped from the training/test data? If yes, will it not affect the continuity of timeseries?
What the Scaler(normalization) method is doing to these outliers? If Scaler method is used, do I still need to worry about Outliers?

Attached updated code and Gold data in CSV format, in case you need to review it.

Mode_HighLow_v2

Hello,

Yes, it was a good break.
Your approach is in the right direction. Considering the fact that a change in the indicators is giving better results.

If it is not done before, you could zoom in on part of the visualisation to understand if the actual and predicted values are really “almost perfect” or just because of zooming out it looks like it.
MAybe you can add the indicators, DeMarker and RSI or Stochastic instead of replacing them.

You are right that removing outlier values will lead to missing values, you could try winsorisation.

Instead of a 70:30 or 80:20 train test split, you can try implementing a rolling window approach where you continuously retrain on recent data
You can also add a market regime detection component to adjust strategies based on the current trend.

1 Like

Hi Rekhit
Thanks for the reply.
Can you please elaborate, how can I implement the above two?

Hi,

You can check the blog Cross Validation In Machine Learning Trading Models to get an understanding how you can use rolling window to train and test the model. Do note that the blog gives an example of classification model so you might have to work on how to adapt it to your regression model.

You can add a feature which indicates whether the market is trending or not. For example, calculate ADX indicator and if ADX>25, then you can consider the market (or asset) is trending, otherwise it is not. This will be one feature in the model.
This is a simplification but you can tr this and see if it works or not

1 Like

Hi Reshit

I have reworked on the entire code, and found classification modeling is much better suited for me than linear regression, as prices are generally non-linear.

In the classification model, is it possible to have Buy = +1 / Sell = -1 / None = 0 values?

I have few courses from QuantInsti and will try each of them, to upgrade my skills in Python and ML, and potentially prepare for EPAT in future soon.

Hi Rekhit

I have created a new thread / post for ATR Scalping strategy. Trying to incorporate Rolling Window Cross validation approach suggested by you in that.

With this strategy first time I am able to correlate with the code and improving my understanding of Python. Help me out to complete this as reliable model which I can deploy on my MT5 platform.

Thanks

That’s great Anil. We are here to guide.

1 Like

Hi Rekhit
Please check the new thread at
Thread for ATRScalp strategy