I have been trying to improve the predictions for the daily spread between the High and Open, and between the Open and Low. In the course they are called Std_U and Std_D(actual values), and Max_U and Max_D (predicted values). The highs resulting from this predictions would be calculated by adding the poredicted (High-Open) spread (Max_U) to the actual Open, and the Lows by subtracting the predicted (Open-Low) spread (Max_D) from the actual Open. When viewing the plots of the High prices compared to the predicted High prices, the results look very impressive and the R2 score is around 99%, which is extraordinary. However, if I calculate the metrics (R2 score) or plot the actual (High - Open) spread (Std_U) against the predicted (High - Open) spread (Max_U), the results don't look good. Here are the R2 scores for the predicted High (actual Open + Max_U), the predicted Low (actual Low - Max_D), the Max_U and the Max_D.
pred high r2 score: 99.45%
pred low r2 score: 99.48%
Max_U r2 score: -18.74%
Max_D r2 score: 4.79%
Here is the code that I am using, which is pretty much what was provided in the course, except that I retrieve Gold futures prices from Yahoo Finance and add a few "metrics" commands. In the course ML for Options trading, Section 19, Unit 9, there is an example of implied volatility prediction using Random Forest regressor, and the results are very good, however, when it comes to predicting this High-Open spread, I just don't manage to get good results at all. I have tried different regression models, decision trees models including Random Forest, XGBoost, and even some basic neural network models like MLPRegressors and LSTM, and still I don't manage to get good results. Is there a model that can be used to predict that spread or is it not possible to predict it? Any feedback will be highly appreciated.
Df = yf.download("GC=F", start="2021-01-01")
Df = Df.dropna()
# Create input parameters
Df['Std_U'] = Df['High']-Df['Open']
Df['Std_D'] = Df['Open']-Df['Low']
Df['S_3'] = Df['Close'].shift(1).rolling(window=3).mean()
Df['S_15'] = Df['Close'].shift(1).rolling(window=15).mean()
Df['S_60'] = Df['Close'].shift(1).rolling(window=60).mean()
Df['OD'] = Df['Open']-Df['Open'].shift(1)
Df['OL'] = Df['Open']-Df['Close'].shift(1)
Df['Corr'] = Df['Close'].shift(1).rolling(window=10).corr(Df['S_3'].shift(1))
X = Df[['Open', 'S_3', 'S_15', 'S_60', 'OD', 'OL', 'Corr']]
yU = Df['Std_U']
yD = Df['Std_D']
# Define the imputer to replace NaN values with the desired strategy.
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# Define the steps for the pipeline
steps = [('imputation', imp), ('scaler', StandardScaler()),
('linear', LinearRegression())]
# Define the pipeline to execute the steps
pipeline = Pipeline(steps)
# Define the parameters for the best fit. the linear__fit_intercept changed from [0, 1] to [False, True]
# parameters = {'linear__fit_intercept': [0, 1]}
parameters = {'linear__fit_intercept': [False, True]}
# Split the data between Train and test
t = 0.8
split = int(t*len(Df))
####### For yU ######
reg = GridSearchCV(pipeline, parameters, cv=5)
# Fit regression equation for yU
reg.fit(X[:split], yU[:split])
# Best fit variable for yU
best_fit = reg.best_params_['linear__fit_intercept']
# Linear regression for yU
reg = LinearRegression(fit_intercept=best_fit)
# Impute NaN values for yU
X = imp.fit_transform(X, yU)
# Fit the model for yU
reg.fit(X[:split], yU[:split])
# Make prediction for yU
yU_predict = reg.predict(X[split:])
####### End yU ######
####### For yD ######
reg = GridSearchCV(pipeline, parameters, cv=5)
# Fit regression equation for yD
reg.fit(X[:split], yD[:split])
# Best fit variable for yD
best_fit = reg.best_params_['linear__fit_intercept']
# Linear regression for yD
reg = LinearRegression(fit_intercept=best_fit)
# Impute NaN values for yD
X = imp.fit_transform(X, yD)
# Fit the model for yD
reg.fit(X[:split], yD[:split])
# Make prediction for yD
yD_predict = reg.predict(X[split:])
####### End yD ######
# Reset the index
Df.reset_index(inplace=True)
# Create new columns
Df['Max_U'] = 0.0
Df['Max_D'] = 0.0
Df.loc[Df.index >= split, 'Max_U'] = yU_predict
Df.loc[Df.index >= split, 'Max_D'] = yD_predict
Df.loc[Df['Max_U'] < 0, 'Max_U'] = 0
Df.loc[Df['Max_D'] < 0, 'Max_D'] = 0
Df['P_H'] = Df['Open']+Df['Max_U'].shift(1)
Df['P_L'] = Df['Open']-Df['Max_D'].shift(1)
# Check the accuracy of the model
## Real values in the range [split:]
yU_test = yU[split:]
yD_test = yD[split:]
## Predicted values in the range [split:]
yUpred = yU_predict
yDpred = yD_predict
yU_r2_score = r2_score(yU_test, yUpred)*100
yD_r2_score = r2_score(yD_test, yDpred)*100
pH_r2_score = r2_score(Df['High'].iloc[split:], Df['P_H'].iloc[split:])*100
pL_r2_score = r2_score(Df['Low'].iloc[split:], Df['P_L'].iloc[split:])*100
print(f'pred high r2 score: {pH_r2_score:.2f}%')
print(f'pred low r2 score: {pL_r2_score:.2f}%')
print()
print(f'Max_U r2 score: {yU_r2_score:.2f}%')
print(f'Max_D r2 score: {yD_r2_score:.2f}%')
# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['P_H'].iloc[split:], label="Predicted")
plt.plot(Df['High'].iloc[split:], label="Actual")
plt.xlabel("Date")
plt.ylabel("GLD High")
plt.title("Predicted vs Actual High for GLD")
plt.legend()
# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['P_L'].iloc[split:], label="Predicted L")
plt.plot(Df['Low'].iloc[split:], label="Actual L")
plt.xlabel("Date")
plt.ylabel("GLD Low")
plt.title("Predicted vs Actual Low for GLD")
plt.legend()
# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['Max_U'].iloc[split:], label="Predicted Max U")
plt.plot(Df['Std_U'].iloc[split:], label="Actual Std U")
plt.xlabel("Date")
plt.ylabel("GLD Max U")
plt.title("Predicted vs Actual (High - Open) spread for GLD")
plt.legend()
# Plot the output to compare the actual high and low with the predicted values.
fig, ax = plt.subplots(figsize=(17, 8))
plt.plot(Df['Max_D'].iloc[split:], label="Predicted Max D")
plt.plot(Df['Std_D'].iloc[split:], label="Actual Std D")
plt.xlabel("Date")
plt.ylabel("GLD Max D")
plt.title("Predicted vs Actual (Open - Low) spread for GLD")
plt.legend()