Which ways to preprocess time-series data?

Jintong_Zheng_5cxmd · October 22, 2020, 11:50pm

There are so many ways to preprocess time-series data, we can directly model on unprocess price, or take time-series difference, daily return, difference to rolling mean, rolling z-score. Many tricks can all lead to stationarity, which one should we choose? Is there a general theory or method? Or just backtest all and pick best one.

Gaurav_Singh_5JwXj · October 23, 2020, 5:06am

Hi Jintong,

You are correct in pointing out the different methods to convert raw price data into a stationary series. There is no fixed theory or method which fits all. With the various tools at your disposal, you need to find the one that best fits your needs. Definitely, extensive backtesting helps and puts you in a position to choose the optimum method for a particular selection of assets.

Hope this helps!

Prodipta · October 23, 2020, 6:52pm

If you can directly model unprocessed data, almost always choose to do so. In any processing, you lose information and usually that is undesirable. That is why it is preferable to model pair trading through conintegration, rather than simple regression of returns (stationary data from input prices) of the assets.

Unfortunately, most model cannot handle unprocessed data - be it statistical or their ML/ AI counterparts. Most models rely on inferring, fitting or learning a probability distribution on the input(s) - and a non-stationary input (which is the case almost alway for unprocessed price) makes it go haywire by definition.

So that is why we do the processing. The first type of processing is for stationarity - here you want the one with the minimal information loss - e.g. differencing (or returns) or even fractional differencing if you are really particular. If you are differecing rolling means you are losing information in two steps - once during the rolling mean process, and once again during the differencing. Usually you will not want that (e.g. when you are trying to predict daily or minute returns).

The second type of processing/ filtering is when you indeed want to reduce information (or more precisely, capture long term trends by filtering high frequency compoenents of a time-series). This is very common in econometrics (e.g. Hodrick–Prescott filter). If that is your objective (filtering out HF, e.g. when you are trying to predict monthly or quarterly price moves), differencing in rolling mean makes more sense.

There is a third reason you want to process/ filter data. This is to enrich your model inputs - i.e. feature engineering. For such cases, probably you would want to do all the methods you mentioned and feed into your model (probably a supervised ML model). Here you can even get lucky to actually capture information at the price level (e.g. topological data analysis or advanced pattern/ motiff analysis), which extracts higher order features without impairing a probability model with non-stationarity.

And most importantly, avoid choosing your method based on " extensive backtesting". It is a sure-fire way to get in to the data-fitting/p-hacking trap, especially for financial timeseries.