RL question

Hi everyone!

Recently I have been working with an A2C agent, but I have a couple of questions that I can't find an answer. I hope someone could give me a hand here.



The way I trained the agent was by giving the agent a train dataset and run it for a lot of epochs. At the end of every episode, the agent's learn function was executed.



When the agent is done with all the epochs and (hopefully) finally converges, how should I test it in validation data? The way I think it should be is by running the agent for only 1 epoch and see the results, but should I execute the agent's learn function on every candle/trade that the agent makes?

Like, everytime the state changes the learning function should be executed or just at the end of the validation dataset? Should this be the same behaviour when live/demo trading?



Also, does anyone have had any experience with RL in trading? Specially with A2C or DDPG? Would appreciate any feedback, comment, advice and knowlegde anyone could share.



Thank you! :slight_smile:

Hello Mario,



Will address query sequentially.



"The way I trained the agent was by giving the agent a train dataset and run it for a lot of epochs. At the end of every episode, the agent's learn function was executed." - So, training data must be the prices, right? What was your reward function?



"When the agent is done with all the epochs and (hopefully) finally converges, how should I test it in validation data?" - so in a live situation, there will only one run on the online live data. Testing, however, can be done on multiple sets to help generate summary statistics for the reward, accuracy, profitability etc any sort of metric you're using to evaluate the agent on.

 

 "The way I think it should be is by running the agent for only 1 epoch and see the results, but should I execute the agent's learn function on every candle/trade that the agent makes?" - This will depend on how you define the MDPs state. For example, if you're using multiple stocks the state will be a vector of all the stock prices at a given timestep.



"Like, everytime the state changes the learning function should be executed or just at the end of the validation dataset? Should this be the same behaviour when live/demo trading?" - So, there are two things that will happen. One is the agent training where the best action to be taken for each state will be learnt using the reward function. If an action is more rewarding for a particular state the probability of the agent taking the action will increase over other actions. That is the learning in RL and this framework is called the Markov Decision Process.



The other thing is when the agent has learnt the best action for each state you can validate it by say, paper trading it on live data.



Also, does anyone have had any experience with RL in trading? Specially with A2C or DDPG? - Yes please go ahead if you have any more questions.

Hi Akshay, thank you for your answers.

So, training data must be the prices, right? What was your reward function?

Training data wasn't exactly prices. I gave the agent a couple of indicators. That's because I think that if I give the agent, for example, just 2 SMA I think that he will learn to trade SMA crossover in some way. Please, correct me if I'm wrong or my approach is not correct. That said, raw prices are not an input.

The reward function was the difference between the value of a state before taking an action and the valu of the state after taking an action.
      
         reward = prev_val - current_val,
        
         where prev_val floating PnL in pips, total acumulated pips based in only closed positions and total closed trades. This components are multiplied by a factor to give more emphasis to some parts of the reward (for example, prev_val= acum_pips*1.5+total_pos_open*0.001+floating_pnl_pips). This is the same for current_val but the values of total pips, total trades and floating pnl changes for each action the agent takes. Once again, please correct me if I'm wrong or my approach is not correct.

So in a live situation, there will only one run on the online live data. Testing, however, can be done on multiple sets to help generate summary statistics for the reward, accuracy, profitability etc any sort of metric you're using to evaluate the agent on. This will depend on how you define the MDPs state. For example, if you're using multiple stocks the state will be a vector of all the stock prices at a given timestep.


Exactly. Thats the reason I have my doubts on how to correctly execute the agent's learn function. For example, I'm trying to do this only for the EURUSD. The reason is because I guess that first I should learn to do it on 1 instrument and the rest should be similar (although I already coded a multistock enviroment).

The way I defined (or how I imagine) a state is as follows:
- last 10 value of indicators of respective candles (features calculated from prices, basically)
- number of open buy trades (always limited to 1)
- number of open sell trades (always limited to 1)
- total cash in hand in that candle (that is the cash produced by all closed trades), floating PnL (produced by actual open trades) and total closed trades. In this case it is not multiplied by anything.

Each candle the state is different because the value of the features change and also the observed PnL also changes, if there is any open trade. Also there are 6 possible actions:
- buy
- sell
- close buy
- close sell
- close both
- do nothing

So, that's exactly what a human trader would see in a chart, right? Thats the idea of doing all this A2C agent. Please, if I'm wrong or my approach is not correct, let me know!
 

So, there are two things that will happen. One is the agent training where the best action to be taken for each state will be learnt using the reward function. If an action is more rewarding for a particular state the probability of the agent taking the action will increase over other actions. That is the learning in RL and this framework is called the Markov Decision Process. The other thing is when the agent has learnt the best action for each state you can validate it by say, paper trading it on live data.


Perfect, I understand this. But based on how the state is defined, which should be the correct approach?

1) In train data, execute agent's learn function at the end of each episode. Then in validation/demo/live trading, execute learn function each time the state changes (if state changes, an action is performed).
2) In train data, execute learn function each time the state changes.Then in validation/demo/live do the same.
3) In train data, execute learn function everytime the memory's buffer is full. Do the same in validation/demo/live trading. If this is the case, which can be a suitable memory batch size, considering a window size of 10 candles back plus using 1000 tick candles.

(continuation of last post)

4. In train data, execute learn function at the end of each episode. Then in validation/demo/live execute learn function everytime the memory buffer is full.

5. Any other approach of how train/validation should be done.

Yes please go ahead if you have any more questions



At last, but not least, if I'm not mistaken you said that you already did trading RL? What was the performance? Was it better than a regular strategy/neural network? Was A2C,DDPG? Could you give me some tips and/or tell me your mistakes and things that I should focus more? Was with neural networks or other types of ML?

I'm curious in learning all this, because for training is all good, but I am having a hard time to find an example that was actually used for trading. How to handle the enviroment when it comes to live trading, how to handle the agent, etc. If you have any example that could be shared, I would be extremely grateful. I'm not looking for an agent that you currently use, I understand the hard work that this involves. But maybe the first agent you used for this and you actually don't use it anymore? I want to get one good example of how to pass this from the lab to a trading enviroment.

Thank you in advance for your answers!!
 

Hello Mario,



Will get a call scheduled for this query. Please ACK.

Hello Akshay,



Sure! No problem :).



How do we schedule a call? Shall I give you my skype or something?



Thank you :slight_smile:

 

Hello Mario,



We have scheduled a call. Please ACK. 



Thanks. 

Hello Mario,



If you want to schedule a call please send your skype or hangout id to quantra@quantinsti.com.