Hi,
In the desctiption for the project following is given:
"The minute frequency data for the ten tickers is stored in the capstone_data_2020_2021.bz2
pickle file."
However, when I write out the dataframe to a csv file (after reading the pickle file), it seems to me that the data is actually 15-minute data. I have pasted the first few lines of the csv file for 'BAC' symbol. What am I missing?
PS: BTW, the data in model solution also has the same issue…
Regards,
Srinivas
,Open,High,Low,Close,Volume
2019-01-02 09:30:00,24.07,24.09,24.01,24.06,1078894.0
2019-01-02 09:45:00,24.06,24.64,24.04,24.54,4169202.0
2019-01-02 10:00:00,24.53,24.58,24.41,24.53,2175107.0
2019-01-02 10:15:00,24.52,24.67,24.48,24.64,2953343.0
2019-01-02 10:30:00,24.64,24.83,24.63,24.76,3012152.0
2019-01-02 10:45:00,24.76,24.91,24.73,24.91,2656067.0
2019-01-02 11:00:00,24.9,25.0,24.88,24.93,3306992.0
2019-01-02 11:15:00,24.93,25.06,24.88,24.99,3081897.0
2019-01-02 11:30:00,24.99,25.02,24.9,24.93,2357494.0
2019-01-02 11:45:00,24.93,25.0,24.88,24.98,1855745.0
2019-01-02 12:00:00,24.97,24.98,24.84,24.84,1507738.0
2019-01-02 12:15:00,24.84,24.89,24.82,24.85,1797244.0
2019-01-02 12:30:00,24.85,24.86,24.78,24.83,1147466.0
2019-01-02 12:45:00,24.83,24.94,24.79,24.94,1442801.0
2019-01-02 13:00:00,24.93,24.97,24.9,24.94,946553.0
2019-01-02 13:15:00,24.94,25.03,24.94,25.01,1397477.0
2019-01-02 13:30:00,24.99,25.08,24.96,25.08,1336732.0
2019-01-02 13:45:00,25.06,25.09,25.02,25.06,1777980.0
2019-01-02 14:00:00,25.06,25.19,25.05,25.09,1918104.0
2019-01-02 14:15:00,25.1,25.11,24.63,24.99,1884455.0
2019-01-02 14:30:00,24.98,25.04,24.98,25.02,1373447.0
2019-01-02 14:45:00,25.02,25.08,24.97,25.0,1141071.0
2019-01-02 15:00:00,25.0,25.04,24.93,24.94,1265584.0
2019-01-02 15:15:00,24.95,24.98,24.63,24.9,1695508.0
2019-01-02 15:30:00,24.91,24.94,24.82,24.84,2276863.0
2019-01-02 15:45:00,24.84,24.95,24.82,24.91,2047844.0
2019-01-02 16:00:00,24.92,25.03,24.87,24.96,5664076.0
2019-01-04 09:30:00,25.09,25.23,25.02,25.17,1279558.0
2019-01-04 09:45:00,25.17,25.28,25.1,25.26,5750044.0
2019-01-04 10:00:00,25.26,25.31,25.21,25.25,5054882.0
2019-01-04 10:15:00,25.24,25.27,25.09,25.13,2649199.0
2019-01-04 10:30:00,25.12,25.34,25.05,25.29,8108064.0
2019-01-04 10:45:00,25.28,25.31,25.14,25.15,6874410.0
2019-01-04 11:00:00,25.16,25.23,25.07,25.24,3961354.0
2019-01-04 11:15:00,25.24,25.35,24.56,25.29,3396469.0
Hi Srinivas,
Thanks for pointing this out. We have updated the capstone project zip files. The notebook will be changed very soon.
Thanks!
Hi Gaurav,
Thanks for addressing this. Hope you will let me know once it is updated so that I can download it again.
Regards,
Srinivas
Hi Srinivas,
Thank you for understanding. The data has been uploaded and the NB also updated accordingly.
Let me know if you need any help!
Regards,
Gaurav
Hi Gaurav,
Got back to this after a short break and downloaded the updated notebook again. The data now looks fine. However, I was wondering if the code in the solution template notebook is accurate. I am seeing issues in the 'Data Sanity' section in the implementation logic. Is this something intentional for us to find it :) ?
Regards,
Srinivas
Hi Srinivas,
Thanks for the feedback, we have corrected a possible bug in that function. Please let us know if you have any more feedback/queries!
Regards,
Gaurav
Hi Gaurav,
Thanks for the quick response. When can I download it?
Regards,
Srinivas
Hi Gaurav,
I downloaded it anyway and got the updates. However, I think there is still a problem in the logic. I feel in the function, the following two lines:
to_delete = list(pd.to_datetime(datapoints_day.index))
return price_data[~(price_data.index.isin(to_delete))]
should be modified to:
to_delete = list(pd.to_datetime(datapoints_day.index).strftime('%Y-%m-%d'))
return price_data[~(pd.to_datetime(price_data.index).strftime('%Y-%m-%d').isin(to_delete))]
Please confirm if this is correct and release a new version for the same if you agree.
Regards,
Srinivas
Hi Srinivas,
There is no binding reason to use the strftime function. The current code can handle the index comparison as it can be seen in the example below:
The dummy False value indicate that those are being dropped in the sample output.
Hope this helps!
Thanks,
Gaurav
Hi Gaurav,
I think there is some disconnect. I modify the code to something like below:
for asset, asset_data in resampled_asset_data.items():
print(asset)
print(asset_data.head(3))
asset_data = asset_data[2:]
print(asset_data.head(3))
resampled_asset_data[asset] = sanity_check(asset_data, asset)
print(resampled_asset_data[asset].head(3))
Here, I am dropping first 2 rows for each ticker. That is, for the first day, only 5 hourly time stamps would be present. So, if the logic in the function works correctly, it should drop the first day, right? But as you can see in the output below, the resampled_asset_data has the first day also (though it does not have the requied number of timestamps which is 7)
BAC Open High Low Close Volume 2018-01-02 10:00:00 29.74 29.80 29.61 29.66 10626570.0 2018-01-02 11:00:00 29.66 29.75 29.64 29.73 9047821.0 2018-01-02 12:00:00 29.73 29.77 29.63 29.65 7166519.0 Open High Low Close Volume 2018-01-02 12:00:00 29.73 29.77 29.63 29.65 7166519.0 2018-01-02 13:00:00 29.67 29.74 29.65 29.70 3924344.0 2018-01-02 14:00:00 29.71 29.94 29.69 29.73 3333392.0 Removing 1 data points. Open High Low Close Volume 2018-01-02 12:00:00 29.73 29.77 29.63 29.65 7166519.0 2018-01-02 13:00:00 29.67 29.74 29.65 29.70 3924344.0 2018-01-02 14:00:00 29.71 29.94 29.69 29.73 3333392.0
Adding the strftime() in the function as per my previous post overcomes this issue. Do you agree?
Regards,
Srinivas
Hi Srinivas,
Your reasoning is absolutely correct and after checking the code, the same has been updated in the capstone solution. Thanks for your valuable feedback!
Regards,
Gaurav Singh
Thanks Gaurav.
One more issue in the Screener section:
Apply the filtering criteria for each asset
for asset in tickers:
# Fetch the current asset data
asset_data = multi_asset_data[asset][:split]
Shouldn't you be using the data after sanity check here. The above statement is using the 1-minute raw data.
BTW, once you change this to use the data after sanity check, the screener filters off everything with the current thresholds. That is, therfe are no tickers left for the next stage processing.
Regards,
Srinivas
Hi Gaurav,
The last step of Performance analysis is giving errors. Can you please help.
Regards,
Srinvivas
----
Start date2019-03-22
End date2021-02-11
Total months33
BacktestAnnual return2.2%
Cumulative returns6.2%
Annual volatility11.9%
Sharpe ratio0.24
Calmar ratio0.09
Stability0.10
Max drawdown-24.4%
Omega ratio1.05
Sortino ratio0.35
Skew0.07
Kurtosis3.72
Tail ratio0.98
Daily value at risk-1.5%
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-34-299af769327e> in <module> 3 """ 4 # Pass the daily return to pyfolio ----> 5 pf.create_simple_tear_sheet( 6 trade_returns['Portfolio_Returns'].resample('1D').sum()) 7 ~/anaconda3/lib/python3.8/site-packages/pyfolio/plotting.py in call_w_context(*args, **kwargs) 50 if set_context: 51 with plotting_context(), axes_style(): ---> 52 return func(*args, **kwargs) 53 else: 54 return func(*args, **kwargs) ~/anaconda3/lib/python3.8/site-packages/pyfolio/tears.py in create_simple_tear_sheet(returns, positions, transactions, benchmark_rets, slippage, estimate_intraday, live_start_date, turnover_denom, header_rows) 378 i += 1 379 --> 380 plotting.plot_rolling_returns(returns, 381 factor_returns=benchmark_rets, 382 live_start_date=live_start_date, ~/anaconda3/lib/python3.8/site-packages/pyfolio/plotting.py in plot_rolling_returns(returns, factor_returns, live_start_date, logy, cone_std, legend_loc, volatility_match, cone_function, ax, **kwargs) 805 oos_cum_returns = pd.Series([]) 806 --> 807 is_cum_returns.plot(lw=3, color='forestgreen', alpha=0.6, 808 label='Backtest', ax=ax, **kwargs) 809 ~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in __call__(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds) 2732 yerr=None, xerr=None, 2733 label=None, secondary_y=False, **kwds): -> 2734 return plot_series(self._data, kind=kind, ax=ax, figsize=figsize, 2735 use_index=use_index, title=title, grid=grid, 2736 legend=legend, style=style, logx=logx, logy=logy, ~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds) 1992 ax = _gca() 1993 ax = MPLPlot._get_ax_layer(ax) -> 1994 return _plot(data, kind=kind, ax=ax, 1995 figsize=figsize, use_index=use_index, title=title, 1996 grid=grid, legend=legend, ~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in _plot(data, x, y, subplots, ax, kind, **kwds) 1802 plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds) 1803 -> 1804 plot_obj.generate() 1805 plot_obj.draw() 1806 return plot_obj.result ~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in generate(self) 256 def generate(self): 257 self._args_adjust() --> 258 self._compute_plot_data() 259 self._setup_subplots() 260 self._make_plot() ~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in _compute_plot_data(self) 358 # with ``dtype == object`` 359 data = data._convert(datetime=True, timedelta=True) --> 360 numeric_data = data.select_dtypes(include=[np.number, 361 "datetime", 362 "datetimetz", ~/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in select_dtypes(s
Hi Srinivas,
The resampled data was being used to place the trades, whereas the minute data was being used for filtering which stocks satisfy the screener criteria. Having said that, thanks for pointing out the error in the code as the solution assumed the hourly candles for screening but the minute data was being used. The same has been corrected on the portal.
As for the second error, the pyfolio one, I believe that is due to an incorrect package, or a changed NB code file on your system. You can refer to this blog for instructions to setup the Python environment in your local system.
Hope this helps!
Thanks,
Gaurav
Hi Gaurav,
Thanks for the updated NB. I have downloaded it.
I am getting the pyfolio error on this one too. I see that the python version now recommended is 3.9.5. I upgraded to this version. However, the same error is seen even now. I reviewed the blog you mentioned, but did not find any major issues. Following are some of the package versions that I am using…
print('Versions:')
print(f'Numpy: {np.version}')
print(f'Pandas: {pd.version}')
print(f'Talib: {ta.version}')
print(f'PyFolio: {pf.version}')
Versions: Numpy: 1.20.0 Pandas: 0.23.4 Talib: 0.4.20 PyFolio: 0.9.2 What am I missing? Below is the full error log (new) .... Regards, Srinivas
-------------
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-9-f2f833484768> in <module> 1 # Pass the daily return to pyfolio ----> 2 pf.create_simple_tear_sheet( 3 trade_returns['Portfolio_Returns'].resample('1D').sum())~/anaconda3/lib/python3.8/site-packages/pyfolio/plotting.py in call_w_context(*args, **kwargs)
50 if set_context:
51 with plotting_context(), axes_style():
—> 52 return func(*args, **kwargs)
53 else:
54 return func(*args, **kwargs)~/anaconda3/lib/python3.8/site-packages/pyfolio/tears.py in create_simple_tear_sheet(returns, positions, transactions, benchmark_rets, slippage, estimate_intraday, live_start_date, turnover_denom, header_rows)
378 i += 1
379
–> 380 plotting.plot_rolling_returns(returns,
381 factor_returns=benchmark_rets,
382 live_start_date=live_start_date,~/anaconda3/lib/python3.8/site-packages/pyfolio/plotting.py in plot_rolling_returns(returns, factor_returns, live_start_date, logy, cone_std, legend_loc, volatility_match, cone_function, ax, **kwargs)
805 oos_cum_returns = pd.Series()
806
–> 807 is_cum_returns.plot(lw=3, color='forestgreen', alpha=0.6,
808 label='Backtest', ax=ax, **kwargs)
809~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in call(self, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
2732 yerr=None, xerr=None,
2733 label=None, secondary_y=False, **kwds):
-> 2734 return plot_series(self._data, kind=kind, ax=ax, figsize=figsize,
2735 use_index=use_index, title=title, grid=grid,
2736 legend=legend, style=style, logx=logx, logy=logy,~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in plot_series(data, kind, ax, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, label, secondary_y, **kwds)
1992 ax = _gca()
1993 ax = MPLPlot._get_ax_layer(ax)
-> 1994 return _plot(data, kind=kind, ax=ax,
1995 figsize=figsize, use_index=use_index, title=title,
1996 grid=grid, legend=legend,~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in _plot(data, x, y, subplots, ax, kind, **kwds)
1802 plot_obj = klass(data, subplots=subplots, ax=ax, kind=kind, **kwds)
1803
-> 1804 plot_obj.generate()
1805 plot_obj.draw()
1806 return plot_obj.result~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in generate(self)
256 def generate(self):
257 self._args_adjust()
–> 258 self._compute_plot_data()
259 self._setup_subplots()
260 self._make_plot()~/anaconda3/lib/python3.8/site-packages/pandas/plotting/_core.py in _compute_plot_data(self)
358 # withdtype == object
359 data = data._convert(datetime=True, timedelta=True)
–> 360 numeric_data = data.select_dtypes(include=[np.number,
361 "dat
Hello Srinivas,
Your package version seems incorrect. The requirements file has pandas==1.2.4.
You can refer to this section of the blog. The requirement file is shared as a link in the same section (within the slides). Please set the environment as per the blog for the code to run properly.
Hope this helps!