Why are we still able to use decision trees with time-series data? When using the decision trees, how do we know that the temporal nature of our data will remain intact?
Why do we use regular K-fold CV for time-series data? Wouldn't this destroy the order of the time-series nature of the data?
1) For decision trees, it is advisable to pass stationary features to train and test the model. It is difficult to determine when the nature of data changes. But a good way is to retrain and test on fixed interval on a newer set of data. More of this is covered in Section 8: Challenges in Live Trading of Decision Trees in the Trading course.
2) If you want to maintain the order of the data, then you can use below code. Here we are using KFold without shuffling the data. I agree that this approach is closer to realistic approach and we will add to the notebook.
from sklearn.model_selection import KFold
kf = KFold(n_splits=3,shuffle=False)
kf.split(X)
cross_val_score(random_forest, X, y, cv=kf.split(X))
Hello - thank you for the reply. That makes more sense.
I wanted to followup on the first point about the decision trees. Why do we have to consider our features for a decision tree to be stationary? From my understanding, decision trees are non-parametric and do not assume any inherent form of the underlying data. If that is so, then why is it advisable to pass stationary features to train and test our model?
That's a good question. Let me take an example to explain the requirement of stationarity. If you are passing close price to train your model for stock Apple, then 3-4 years back it would have created a decision tree model based on the close price prevailing at that time. For example, the close price < $100 buy else to sell. If you consider this rule on the current close price of Apple of $250+ then it would not make sense.
Hence the requirement of the feature to be stationary. The way you can make feature stationary in the above case is to take the percentage change in the close price. Then, the model would be relevant across time period.