Concern over ML teaching , cross validation time series?

Stephen_Miller_5Ebtj · July 14, 2021, 8:14am

Hello I have purchased a number of courses over the year 10 , some are good some are thin for the money. I have recently dived into the ML courses. I want to know why for investing/trading you are teaching or showing cross validation methods such as k-fold and gridsearch. After alittle digging this is not approiate method for time series data at all. You should be showing the code for timeseriessplit or some form of day-chaining , walk forward process. Any model fit with randomized CV is useless as far as I am now lead to believe. I feel having this in the course is very mis-leading and to be honest make me think what else is accurate.

Satyapriya_Chaudhari_63X1f · July 14, 2021, 12:13pm

Hi Stephen,

I would try to answer here why cross-validation is helpful in trading.

By intuition, cross-validation might not make complete sense for time-series data and it is one limitation of using this approach that the performance of the strategy might not be historically accurate. But this technique is used for tuning the hyperparameters of the model.

So there are different approaches to select these parameters. You can choose them based on your experience or research, or you can randomly make a guess, or you can use cross-validation or any other ML technique. Going by experience and research has its own benefits and limitations. Making a random guess might not always be a good option. But parameters selected using the cross-validation technique is supported by data.

This technique is solely used for hyperparameter tuning and not for making any forecast or trading decision. The forecast and trading decisions can only be made on the data that follows the train data. If you see the regression trading strategy, that is how it has been done. The first step is to split the data into two parts - train and test. Using the cross-validation on the train data, we achieve the generalised model. The forecast is then made on the test data, which is still unseen to the model.

Along with the cross-validation technique, you can also use time series split to sequentially split the data. You can explore the classification trading strategy. Here the RandomizedSearchCV is used to find the best parameters.

Now, having said all this, I would also add here that these are just the techniques to find the optimum strategy. It is extremely important to backtest and verify the strategy on unseen data (like in paper trading).

I hope this helps.

Thanks!

Stephen_Miller_5Ebtj · July 14, 2021, 12:23pm

Ok I am all for hyper parameter tuning but your are tuning a model on situation where there is data leakage surely. In which future values are being tuned to predict past values in 5kfold is that not correct. There is also the dedate about is your test sample which is excluded actually representative , whats the variance on it and is a nested cross validation approach not better if you going to use 5kfold?

I am new to coding/python/quant so I dont want to assume I know anything and I assume most people on here doing coures are new also and traders trying to learn some quant. I think therefore you should be including validation techniques for time series data especially when your feature data is often lagged indicators or returns. Kfold its not correct , Need to show time series split methods , walk forward, day chaining etc or there is a new library called sktime which tries to tackle this problem.

If you want to message me privately or your team please do as I am really looking for some good code on timeseries split I can trust

Satyapriya_Chaudhari_63X1f · July 14, 2021, 2:03pm

Yes, that's correctly pointed by you. Simply using 5 fold cross-validation will lead to data leakage. Using a time series split will be desirable. It will always make predictions based on past data. For example, if you have data for n months, the model starts with the choice of one set of hyperparameters.

The model will first validate the values for the second month based on the data for the first month and record the error.
It will then validate the values for the third month based on the data for the first two months, and record the error.
And so on till validating the values for the nth month based on (n-1) months.
Then it will compute the average of all the errors.
The above steps would be repeated for each set of parameters and the one which has the least error will be selected.

This can be done using the TimeSeriesSplit method of sklearn. You can find the code in the classification trading strategy.

Thanks for pointing this. We will make the necessary changes.

Regards,
Satyapriya