Course Name: Trading with Machine Learning: Regression

I have been trying to improve the predictions for the daily spread between the High and Open, and between the Open and Low. In the course they are called Std_U and Std_D(actual values), and Max_U and Max_D (predicted values). The highs resulting from this predictions would be calculated by adding the poredicted (High-Open) spread (Max_U) to the actual Open, and the Lows by subtracting the predicted (Open-Low) spread (Max_D) from the actual Open. When viewing the plots of the High prices compared to the predicted High prices, the results look very impressive and the R2 score is around 99%, which is extraordinary. However, if I calculate the metrics (R2 score) or plot the actual (High - Open) spread (Std_U) against the predicted (High - Open) spread (Max_U), the results don't look good. Here are the R2 scores for the predicted High (actual Open + Max_U), the predicted Low (actual Low - Max_D), the Max_U and the Max_D. 



pred high r2 score: 99.45%

pred low r2 score: 99.48%



Max_U r2 score: -18.74%

Max_D r2 score: 4.79%



Here is the code that I am using, which is pretty much what was provided in the course, except that I retrieve Gold futures prices from Yahoo Finance and add a few "metrics" commands. In the course ML for Options trading, Section 19, Unit 9, there is an example of implied volatility prediction using Random Forest regressor, and the results are very good, however, when it comes to predicting this High-Open spread, I just don't manage to get good results at all. I have tried different regression models, decision trees models including Random Forest, XGBoost, and even some basic neural network models like MLPRegressors and LSTM, and still I don't manage to get good results. Is there a model that can be used to predict that spread or is it not possible to predict it? Any feedback will be highly appreciated.


Df = yf.download("GC=F", start="2021-01-01")
Df = Df.dropna()

# Create input parameters
Df['Std_U'] = Df['High']-Df['Open']
Df['Std_D'] = Df['Open']-Df['Low']

Df['S_3'] = Df['Close'].shift(1).rolling(window=3).mean()
Df['S_15'] = Df['Close'].shift(1).rolling(window=15).mean()
Df['S_60'] = Df['Close'].shift(1).rolling(window=60).mean()

Df['OD'] = Df['Open']-Df['Open'].shift(1)
Df['OL'] = Df['Open']-Df['Close'].shift(1)
Df['Corr'] = Df['Close'].shift(1).rolling(window=10).corr(Df['S_3'].shift(1))

X = Df[['Open', 'S_3', 'S_15', 'S_60', 'OD', 'OL', 'Corr']]
yU = Df['Std_U']
yD = Df['Std_D']


# Define the imputer to replace NaN values with the desired strategy.
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Define the steps for the pipeline
steps = [('imputation', imp), ('scaler', StandardScaler()),
         ('linear', LinearRegression())]

# Define the pipeline to execute the steps
pipeline = Pipeline(steps)

# Define the parameters for the best fit. the linear__fit_intercept changed from [0, 1] to [False, True]
# parameters = {'linear__fit_intercept': [0, 1]}
parameters = {'linear__fit_intercept': [False, True]}

# Split the data between Train and test
t = 0.8
split = int(t*len(Df))


####### For yU ######
reg = GridSearchCV(pipeline, parameters, cv=5)

# Fit regression equation for yU
reg.fit(X[:split], yU[:split])

# Best fit variable for yU
best_fit = reg.best_params_['linear__fit_intercept']

# Linear regression for yU
reg = LinearRegression(fit_intercept=best_fit)

# Impute NaN values for yU
X = imp.fit_transform(X, yU)

# Fit the model for yU
reg.fit(X[:split], yU[:split])

# Make prediction for yU
yU_predict = reg.predict(X[split:])

####### End yU ######

####### For yD ######
reg = GridSearchCV(pipeline, parameters, cv=5)

# Fit regression equation for yD
reg.fit(X[:split], yD[:split])

# Best fit variable for yD
best_fit = reg.best_params_['linear__fit_intercept']

# Linear regression for yD
reg = LinearRegression(fit_intercept=best_fit)

# Impute NaN values for yD
X = imp.fit_transform(X, yD)

# Fit the model for yD
reg.fit(X[:split], yD[:split])

# Make prediction for yD
yD_predict = reg.predict(X[split:])

####### End yD ######


# Reset the index
Df.reset_index(inplace=True)

# Create new columns
Df['Max_U'] = 0.0
Df['Max_D'] = 0.0
Df.loc[Df.index >= split, 'Max_U'] = yU_predict
Df.loc[Df.index >= split, 'Max_D'] = yD_predict

Df.loc[Df['Max_U'] < 0, 'Max_U'] = 0
Df.loc[Df['Max_D'] < 0, 'Max_D'] = 0
Df['P_H'] = Df['Open']+Df['Max_U'].shift(1)
Df['P_L'] = Df['Open']-Df['Max_D'].shift(1)


# Check the accuracy of the model
## Real values in the range [split:]
yU_test = yU[split:]
yD_test = yD[split:]

## Predicted values in the range [split:] 
yUpred = yU_predict
yDpred = yD_predict

yU_r2_score = r2_score(yU_test, yUpred)*100
yD_r2_score = r2_score(yD_test, yDpred)*100

pH_r2_score = r2_score(Df['High'].iloc[split:], Df['P_H'].iloc[split:])*100
pL_r2_score = r2_score(Df['Low'].iloc[split:], Df['P_L'].iloc[split:])*100

print(f'pred high r2 score: {pH_r2_score:.2f}%')
print(f'pred low r2 score: {pL_r2_score:.2f}%')
print()
print(f'Max_U r2 score: {yU_r2_score:.2f}%')
print(f'Max_D r2 score: {yD_r2_score:.2f}%')


# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['P_H'].iloc[split:], label="Predicted")
plt.plot(Df['High'].iloc[split:], label="Actual")
plt.xlabel("Date")
plt.ylabel("GLD High")
plt.title("Predicted vs Actual High for GLD")
plt.legend()

# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['P_L'].iloc[split:], label="Predicted L")
plt.plot(Df['Low'].iloc[split:], label="Actual L")
plt.xlabel("Date")
plt.ylabel("GLD Low")
plt.title("Predicted vs Actual Low for GLD")
plt.legend()

# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['Max_U'].iloc[split:], label="Predicted Max U")
plt.plot(Df['Std_U'].iloc[split:], label="Actual Std U")
plt.xlabel("Date")
plt.ylabel("GLD Max U")
plt.title("Predicted vs Actual (High - Open) spread for GLD")
plt.legend()

# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['Max_D'].iloc[split:], label="Predicted Max D")
plt.plot(Df['Std_D'].iloc[split:], label="Actual Std D")
plt.xlabel("Date")
plt.ylabel("GLD Max D")
plt.title("Predicted vs Actual (Open - Low) spread for GLD")
plt.legend()

Hey Arturo,



Since we're using the actual Open to get the predicted High/Low, the predicted prices will still be very close to the actual value.



For example:

Actual Open = 150

Actual High-Open spread (Std_U) = 10 → Actual High = 160

Predicted High-Open spread (Max_U) = 8 → Predicted High = 158



The difference between the actual High (160) and predicted High (158) is small, resulting in a good R2 score for High price predictions, even though the spread prediction is off if you look at it all by itself.



The R2 score is influenced by the scale of the target variable. The High price is a much larger number (e.g., 150) compared to the spread (e.g., 10). So when you're evaluating the spread directly (which is small in scale), even small errors can result in a much lower R2 score, or even a negative one, as it is in your case.



The difference in the predicted and actual spread will look much greater if we look at it in isolation, even though the impact on the overall price may not be as great. The end objective of predicting the spread is to get the price value. Therefore, the evaluation is done directly for these values instead of the spread. 



If you are still keen on improving the spread prediction, you can consider adding more features which are more closely aligned with volatility such as ATR. 



Hope you find this helpful!



Thanks

Rushda

Hi Rushda,



Thanks for your reply. I fully understand how the R2 score is calculated and how the scale plays a big role on the values that you get. I calculated it directly on the predicted against actual High-Open spread precisely to get a better picture of the actual spread prediction. The code that I pasted is what was provided in the course, which I used as a baseline to continue trying to predict that spread with higher accuracy. I have added many different features in my attempt to get a higher R2 score (on the spread, not the price). I have used moving averages, ATR, ADX, RSI, Volume based indicators, Bollinger bands, Z scores, not only on the price itself but also also directly on the spread. So far the best R2 score I have managed to get is around 14% using the RandomForestRegressor and several different combinations of features mentioned before. As i mentioned before, the ML for Options trading course has an example of implied volatility prediction using Random Forest, and the R2 score on that prediction is around 90%, which would be very good if I managed to get that R2 score on the High-Open spread prediction. However, when I apply that code to the High-Open spread then i get a maximum R2 score of 14%. I cannot use that for a day trading strategy that I am working on. I am trying to get some ideas on what else can I try to improve the accuracy of this High-Open spread prediction. Would you be able to help me with that?

Hey Arturo,



It’s great to see that you've experimented with features like ADX, ATR, etc. and managed to improve your R2 score from a negative value to 14%.



While this model showed better performance when predicting implied volatility in the ML for Options trading course, it's important to note that predicting spreads especially for shorter timeframes is inherently more challenging due to the high level of market noise and randomness.



One approach you might want to consider is conducting a feature importance analysis to identify which features are having the most impact on your predictions, and then decide on the final features based on that.



Hope this helps!



Thanks

Rushda

Rushda,



Thanks for the suggestion, but I have already done the Principal component analysis and ran the model with that. I did not get much better results than before. I appreciate that trying to predict this spread is very challenging, so the question is if it is possible to predict with some other model, or if it will not be possible no matter what model I try?



Regards, Arturo

Hey Arturo,



Thank you for sharing that you've already tried PCA. It's great that you've explored different techniques.



Another approach that you can consider is to explore ensemble methods, which are powerful for improving model performance by combining the strengths of multiple models.



Hope you find this helpful!



Thanks

Rushda