Hello Anil,
Classification and regression models are built for different purposes.
If you are looking at continuous variables for prediction, like the price tomorrow, you might go for regression model. If you want to predict whether the price is going up, or not, then classification model can be your choice. Both type of models have their pros and cons, so you might have to decide which model you want to choose.
K-fold can be applied on both types of models.
If you are using a classification model, then you will use accuracy score.
If you are using regression model, then you will use MSE, R-squared.
For python knowledge, you can go through the Python for Basics course on Quantra for understanding more on Python, and also you can go through the Introduction to Machine Learning for Trading on Quantra for brushing up on different types of machine learning models.
How and where should I predict yU and yD values in each run of split data, as to calculate MSE etc I need the predicted data?
IF you are using a regression model then you will predict the values inside the loop where the data is split into training and testing sets. First, train the model separately for yU and yD using the training features and their respective target values. Then, use the trained models to make predictions on the test data.
For evaluation, calculate metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared, and Adjusted R-squared. Adjusted R-squared is useful for understanding the importance of features, as it adjusts for the number of predictors in the model. To calculate it, you need the number of test samples and the number of features.
Should I use yU_predict = clf.predict(X_test) in each loop?
Yes. After computing the metrics for each fold, store the values and compute their average at the end to get a final assessment of the modelâs performance.
Adjusted R squared is a metric, just like r squared so you can try running based on the number of features and then compare the results.
I hope this helps.