The general recommendation that I have seen for experimenting with BlueShift code (that comes packaged with the course material) is to tweak the input parameters and see what works.
Should we not assess if the results are statistically significant or not to rule out any favourable results occuring by chance?
As far as momentum startegies are concerned I have seen use of pearson's method to statistically establish whether lookback results are co-related to holding period in your backtests.
Why is this test (or any other proving the statisitical significance of your test results) not emphasised for use with every strategy?
Am I missing something here?
Hi Neel,
The recommendation to tweak the input parameters is to make you familiar with the parameters in the code. Also if any parameter optimisation is required it should be done on training dataset after dividing the entire data into training and testing set.
The statistical significance can be checked, but one needs to be very careful and avoid p-hacking while doing so. For more on this, the following paper is a good place to start.
https://www.cmegroup.com/education/files/backtesting.pdf
Hope this helps!
Thanks,
Akshay
Hi Akshay,
Thanks for getting back. I really wont be able to grasp whats in the paper very well…
I feel that checking for confidence level using hypotheses test will give more confidence on the results.
In plain English what does p-hacking mean?
What do I need to take care of?
Thanks,
Neel
In strategy backtesting, p-hacking refers a situation where you optimize your strategy parameters until you get an impressive performance result but when you go live with your strategy, it fails badly. This situation is also called backtesting overfitting.
You have to take care to not overfit your strategy to the data you give as input. This is because when you overfit, you might see a tremendous profitability with your strategy. You might feel confident and when you start doing real trading (and maybe use leverage), you would probably loose a lot of money. Backtesting overfitting might take to find the best parameteres and the best patters with your historical data. But future doesn't always repeat the past. So the strategy might not perform well in real trading as it looked with historical data.
Marcos López de Prado in his book propose some alternative to avoid backtesting overfitting:
- Create models for a stock universe instead of a single security. This will allow you to reduce the probability of finding a false good strategy.
- Apply a machine learning bagging algorithm. Chapter 6 of his book explains it's a better option.
- Record every backtest you make so its probability of backtesting overfitting can be estimated. This will help to calculate a deflated Sharpe ratio. Please refer to this link and others to comprehend the computation.
- Instead of backtesting your model on a single data which is your historical data, you can make a monte carlo simulation creating a huge number of future random prices for your stock and backtest your strategy on the historical and simulated data. This will make your strategy go throughout different scenarios and it will reduce the p-hacking.
These are some of the recommendations López de Prado makes in his book.
About bagging, please refer to our blog article to learn more.
I hope this clarifies your doubt,
Feel free to respond with some other questions in case you have them,
We'll be more than happy to help you,
Thanks and regards,
Jose Carlos
Thanks Jose Carlos for sharing your insights on backtest overfitting.
My thinking has been to get to a stage where I develop confidence on my strategies based on only statistical techniques to start with.
Once I get there I can start with optimizing my algorithms by using ML. Do you think this is the right approach to go with?
I had a quick look at your technical note on bagging. There are few inputs needed (predictor variables) for which I dont have a concrete way on how to arrive at those. I think more reading may be needed.
I really want to try out the monte carlo simulations that you suggested. Your comment -
"you can make a monte carlo simulation creating a huge number of future random prices for your stock".
Want to clarify from you - does monte carlo simulation generate random future prices or would generate likely outcome based on historical data?
Does the python library on it's own derive what kind of probability distribution the input dataset has and accordingly derive the future data set, or the user needs to provide such inputs?
Hello Neel,
Glad to know you have a clear path to move forward in your journey on algorithmic trading.
Let's try to answer your doubts and queries:
- "My thinking has been to get to a stage where I develop confidence on my strategies based on only statistical techniques to start with.
Once I get there I can start with optimizing my algorithms by using ML. Do you think this is the right approach to go with?"
ML is not the only way to optimize your strategy parameters. You can take, for example, a moving average strategy, which uses a parameter called "window". This window can be 20. This number 20 will be the used to calculate the 20-day moving average. This moving average then can be used to get buy and sell signals to trade any stock. So how can we optimize this strategy? Well, we can change 20 to any number we'd like to use. We can have a range between 2 and 200 for the "window" period and try to optimize, e.g., the Sharpe ratio of the strategy while changing the window period. The window period number that gives you the highest Sharpe ratio for your strategy will be the optimized parameter of your, now, optimized strategy.
Here I can show a seudo-code for this in python:
for i in range(2,200):
calculate moving average with a window period "i"
calculate signal with the previous moving average
calculate strategy returns
calculate the mean and standard deviation of strategy returns
calculate Sharpe ratio of the strategy returns and save in a dictionary called "sharpe_ratios" with key "i" and value "sharpe ratio"
calculate the highest sharpe ratio in the dictionary "sharpe_ratios" and get the key.
the "key" will be the optimized parameter or the best window period for this strategy.
- "Want to clarify from you - does monte carlo simulation generate random future prices or would generate likely outcome based on historical data?"
What you can do is get the mean and standard deviations of the stock historical data you want to invest in and create, for example, 10 thousand simulated stock prices based on the mean and standarde deviation. You will then use these 10,000 simulated stock prices and backtest your strategy on each of these stock prices. If you think that the average Sharpe ratio is good enough, then you can say your strategy is good. If you notice, the historical data is used to calculate the mean and standard deviation. These two inputs (obtained through historical data) are used to calculate the simulated stock prices. Don't forget to use the price returns to get the mean and standard deviation. Don't get the two metrics from stock prices. Get them from stock price returns.
You can reference this article to create this monte carlo simulation.
I hope these comments answer your queries,
If you have any more doubts,
Please let us know,
Thanks
Jose Carlos
@Neel, yes your approach is pretty solid. Statistical modelling and understanding these statstical models and their behaviours develops a solid foundation in quant finance, upon which you can take on your ML journey.
For the monte carlo simulation, it is really up to you. The basic monte carlo used in finance is usally some version of markov chain (MCMC) - e.g. random work. Of course if your strategy is based on predicting future prices, and you test it on random walk data, it should not perform (if it does, you should be very suspicious). Generating monte carlo for backtesting alpha strategies is tricky (as compared to say for computation of portfolio risk - which is relatively easier). You need to capture the behaviour of the market that defines your alpha in the price evolution. For example, for momentum strategy, it is usually returns autocorrelation. You can estimate an AR model on past data and change the autocorrelation assumption, and generate price data and then backtest on those generated data. If this is not very clear to you, perhaps you can read a bit more about momentum, ARMA models and markov chain monte carlo, and circle back here for more clarification.
@Jose, your example of finding the optimal window size is a classic example of p-hacking!
Hello Prodipta,
Yes! you're completely right! The example I gave is a common way of p-hacking. I forgot to mention that. Thanks for clarifying this important point to Neel.
The machine learning algorithms applied to historical data for finding a hidden pattern can also be understood as p-hacking in case you backtest multiple times on the same historical data until you find an accuracy that gives you a good, e.g., Sharpe ratio. We should be careful while applying ML algos in backtesting also.
Monte carlo simulation is one of the ways to reduce overfitting. There are many ways, while backtesting multiple times throughout your journey in algo trading you will learn the intricacies and be better at producing a good backtest.
According to López de Prado (2018), many of the research papers published in prestigious journals that concentrate on backtesting suffer some flaw in their procedure. Luo et al. (2014) in their study "Seven Sins of Quantitative Investing" highlight these usual suspects (summary made by López de Prado, 2018):
- Survivorship bias: Using as investment universe the current one, hence ignoring that some companies went bankrupt and securities were delisted along the way.
- Look-ahead bias: Using information that was not public at the moment the simulated decision would have been made. Be certain about the timestamp for each data point. Take into account release dates, distribution delays, and backfill corrections.
- Storytelling: Making up a story ex-post to justify some random pattern.
- Data mining and data snooping: Training the model on the testing set.
- Transaction costs: Simulating transaction costs is hard because the only way to be certain about that cost would have been to interact with the trading book (i.e., to do the actual trade).
- Outliers: Basing a strategy on a few extreme outcomes that may never happen again as observed in the past.
- Shorting: Taking a short position on cash products requires finding a lender. The cost of lending and the amount available is generally unknown, and depends on relations, inventory, relative demand, etc.
Besides, while learning, you will find yourself with common techniques like parameter optimization (like the example I gave), walk-forward optimization, machine learning algorithms, stationarity, random walk, etc. So López de Prado (2018) and Chan (2021) books will help you in your learning. Besides, Quantra courses are a good resource from which you can learn strategies properly before you start doing some backtesting with them.
Neel, if you have more doubts, please let us know!
Thanks Prodipta, thanks Neel,
Regards,
Jose Carlos
References:
Chan, E. (2021). Quantitative Trading: How to build your own algorithmic trading business. Jhon Wiley & Sons.
Lopez de Prado, M. (2018). Advances in financial machine learning. John Wiley & Sons.
Hi Prodipta and Jose Carlos,
The original question for which I am looking for an answer is what tools/methods I should use to get more confidence on backtest results.
I was trying to see how I can use NULL hypothesis testing to verify backtest results.
@Prodipta, expanding a bit on what I think you were referring to :
The Hypothesis testing can't be applied on stocks prices time series data since that is not stationary.
It is well established fact that most often than not the first difference in stock price will make the series stationary ie… stationary at level 1 (I(1)) which is the stock prices returns series.
Next on this series I will need to find auto-correlation coefficients of the lagged terms that affect future prices.
Once I derive the equation stitching together the instantaneous values with lagged values, future price can be forecasted.
The forecasted price retruns series can be compared with the actual series in the same holding period.
The residuals of the price returns (ie… differences between actual and that from the model) could be passed to ADF function.
The ADF function internally uses NULL hypothesis testing to determine if the residuals are stationary or not.
If the residual are stationary it would mean that the actual values hover around the forecasted values in either direction which in turn would mean the AR model is pretty accurate.
For all the above tasks there are python libraries.
We can repeat the above steps on a rolling window.
Holding period = 1 week
Lookback period = 3 months
I can move back the holding period and look back period window in small steps, say 1 day.
I can then evaluate for say past 10 days and check if the algorithm performs according to my set criteria.
It could be like - 75% of the time the AR model accuracy should be good (p value < 0.1) and
within those 75% my drawdown should not be greater than x% and profit should have been atleast y% within the holding period.
From a theoritical standpoint does the above sound good?
Most importantly from a practical real world standpoint will I really get positive outcomes or even be able to short list a reasonably big universe of stocks based solely on the above?
Fundamental factors could then be applied on the universe and a shortlist for actual deployment could be arrived at.
@Jose Carlos,
I have gone through the paper on Monte Carlo simulation. My understanding is that in essence the simulation will produce lots of datasets with similar distribution ie… mean and std deviation as the looback data.
Can each such dataset be considered equivalent to the forecasted data derived
from AR model that Prodipta was referring to and treated the same way to conclude if the back test results are statistically significant?
Thanks for your valuable suggestions.
-Neel
Hello Neel,
So cool you are asking interesting questions,
I'll try to answer them,
- You said: "The Hypothesis testing can't be applied on stocks prices time series data since that is not stationary. It is well established fact that most often than not the first difference in stock price will make the series stationary ie… stationary at level 1 (I(1)) which is the stock prices returns series."
Answer: Price series usually aren't stationary. Some price series are I(1), some I(2), etc. The formal way to find out the order of integration is by applying a unit root test to the price series.
a) First you apply a unit root test to the price series. If you reject the random walk hypothesis, then you say the price series is I(0) or stationary. In case you cannot reject it, you have to apply a unit root test for the first difference of the price series.
b) While applying a unit root test to the first difference, if you reject the random walk hypothesis, then you say the price series is I(1) and isn't stationary. If you cannot reject it, then you have to apply a unit root test for the second difference of the price series.
c) While applying a unit root test to the second difference, if you reject the random walk hypothesis, then you say the price series is I(2) and isn't stationary. If you cannot reject it, then you have to apply a unit root test for the third difference of the price series.
d) You continue to apply the unit root test until you find the correct order of integration of the price series.
- You said: "Next on this series I will need to find auto-correlation coefficients of the lagged terms that affect future prices."
Answer: I don't understand what you meant here. Do you mean an ARMA model?
The ARMA model is always applied to stationary series. If you have a price series that is I(d), then you need to use the "d" difference of the price series as the correct input to use it for the ARMA model.
- "Once I derive the equation stitching together the instantaneous values with lagged values, future price can be forecasted."
Answer: You can forecast the price series with an ARMA model.
a) If your price series is I(1), then you can create an ARIMA(p,1,q) model.
b) If your price series is I(2), then you can create an ARIMA(p,2,q) model, and so on.
- "The forecasted price retruns series can be compared with the actual series in the same holding period."
Answer: Yes you can compare actual vs forecasted returns.
- "The residuals of the price returns (ie… differences between actual and that from the model) could be passed to ADF function. "
Answer: I think you have two missunderstandings here.
a) One thing is the difference between the in-sample predicted values and the actual values of the stationary series, and
b) Another thing is the difference between the out-of-sample forecasted values and the actual values of the stationary series.
c) If you have created an ARMA model based on a stationary process, then there is no need to apply a unit root test on the residuals since the process is stationary. This is supposing, e.g., you apply an ARMA model on the first difference of an I(1) process. The forecasted, actual and residuals from the the first difference of the price series are all stationary by nature since you applied correctly the ARMA model on a stationary process.
- "The ADF function internally uses NULL hypothesis testing to determine if the residuals are stationary or not."
The unit root tests are applied to asset prices to find stationarity. Practioners call the application of unit root tests on model residuals when the model they talk is a regression model of an asset price series "A" on asset prices series "B".
- "If the residual are stationary it would mean that the actual values hover around the forecasted values in either direction which in turn would mean the AR model is pretty accurate."
Answer: This is a missunderstanding. The best way to find out if an AR, MA or ARMA model is a good model is to apply the Box-Jenkins methodology. First you fit the best ARMA model and then forecast the price series with this best model. Again, there is no need to apply an unit root test since the process you work on for the ARMA model is already stationary.
In case you want to compare different models' forecasted values, then you need to use other metrics like the Mean Squared Error, Mean Absolute Error to the residual series obtained by the difference between the forecasted values from the models and the actual valu
Hi Jose Carlos,
Thanks for going through my answers in detail and providing your feedback on it.
First I want to correct a statement I made - :"The Hypothesis testing can't be applied on stocks prices time series data since that is not stationary." Actually hypothesis testing can be applied on any dataset but the ideal one is a data set that is iid (random variables) using which you will get best results.
Anyways in the context of what is being disucssed I wanted to just state that there is a need to make data stationary because many useful analytical tools and statistical tests and models rely on it, like the ARMA model I was referring to in my post.
About comparing the time series I was referring to comparison between forecasted series and out of sample (ie… test data set). I get your point that finding Mean Squared Error, Mean Absolute Error is the best way to compare the two series. Is this comaprison good enough or need to evaluate other paramaters also?
In your experinece what values of these should be considered as a good fit ??
Thinking of using OLS() method of python stats package with forecasted series as dependent and test series as independent variables. I presume I should be able to get all that is required for assessing how good the fit is form the resulting stats model properties.
Also on the Monte carlo simulation - can I consider data sets produced by the simualtion as equivalent to that produced by ARMA mode? I can then apply MSE etc to judge if the simulation is good and if it is good I can use it for assessing the performance of strategy during the holding period?
Thanks,
Neel
Hello Neel,
Let's try to answer per each comment again:
1) First I want to correct a statement I made - :"The Hypothesis testing can't be applied on stocks prices time series data since that is not stationary." Actually hypothesis testing can be applied on any dataset but the ideal one is a data set that is iid (random variables) using which you will get best results.
Answer: Any descriptive statistics or inferencial statistics test can only be made to series which are stationary. For example, the Jarque-Bera test (to check normal distribution of the series), a simple t-test, and F-test, a Ljung-Box test, etc, can't be applied to non-stationary data. Unit root tests are a special case which can be applied to non-stationary data.
2) Anyways in the context of what is being disucssed I wanted to just state that there is a need to make data stationary because many useful analytical tools and statistical tests and models rely on it, like the ARMA model I was referring to in my post.
Answer: Yes, the statistical tools can only work on stationary data, as explained above. Unit root tests are a special case, they can be used in non-stationary data.
3) About comparing the time series I was referring to comparison between forecasted series and out of sample (ie… test data set). I get your point that finding Mean Squared Error, Mean Absolute Error is the best way to compare the two series. Is this comaprison good enough or need to evaluate other paramaters also?
Answer: In case the series to be tested are numbers, you can use the MSE or MAE to choose the best model. In case the series to be tested is categorical (e.g.,1 for positive returns and 0 for negative returns) then you can create an accuracy ratio and compare with it the models. A more advanced method for time series is the "Model Confidence Set" created by Hansen et al. (2011). The package rugarch in R provides the test.
4) In your experinece what values of these should be considered as a good fit ??
Answer: The models with lowest MSE or MAE values should be chosen.
5) Thinking of using OLS() method of python stats package with forecasted series as dependent and test series as independent variables. I presume I should be able to get all that is required for assessing how good the fit is form the resulting stats model properties.
Answer: Wrong procedure. For in-sample fit checking in a regression model, you can check the R-squared. For out-of-sample values, the MSE or MAE are used.
6) Also on the Monte carlo simulation - can I consider data sets produced by the simualtion as equivalent to that produced by ARMA mode? I can then apply MSE etc to judge if the simulation is good and if it is good I can use it for assessing the performance of strategy during the holding period?
Answer: The monte carlo simulations are not made to jugde them whether they ressemble the real series or not. Monte carlo simulation is done whenever you need to create simulated future prices to use them as input to, e.g., compare possible out-of-sample variance of different strategies and choose the lowest out-of-sample variance strategy.
Neel, I highly recommend you to read Gujarati's book on Econometrics. It's very useful to understand inferential statistics and econometrics. It's a very basic and easy-to-read book. Many of your questions are related to this topic.
I hope my answers and the recommended book can help you with your doubts,
Thanks and regards,
Jose Carlos
Thanks again for your guidance Jose Carlos.
The book by Gujrati looks to be a great free resource for getting the basics in place!!
Went through the explanation of Monte Carlo simulation in the book and I now see what you mean wrt application of Monte Carlo simulations.
I have few ideas now to use ARMA and Monte Carlo simulations in my strategy that would tell which stats model is appropriate.
I'll try out few things and circle back with you guys.
I think to detemrine the entry and exit points is another topic for consideration.
My sense is to programatically construct a candle scanner and apply some combination of volatility, trend reversal and such to generate entry/exit signals …
Hello Neel,
Great to know we helped you out with your doubts.
Regards,
Jose Carlos