Need to explain the where part of the variables

Vivek_Datar · May 28, 2026, 6:00am

Course Name: Introduction to Machine Learning for Trading, Section No: 6, Unit No: 7, Unit type: Document

Exploration Mechanism
The Reinforcement Learning problem can be explored as follows:
● State transition function P(X(t)|X(t-1),A(t))
● Observation (output) function P(Y(t) | X(t), A(t))
● Reward function E(R(t) | X(t), A(t))
● State transition function: S(t) = f (S(t-1), Y(t), R(t), A(t))
● Policy/output function: A(t) = pi(S(t)))

Where P(X(t) = ?
S(t) = ?
S(t-1) = ?
A(t) = ?
Y(t) = ?
R(t) = ?

Rekhit_Pachanekar · May 29, 2026, 10:00am

Hi,

To simplify the explanation, we can assume the following things:

X(t) = The state of the environment at time t
This is the actual, underlying state of the world at time step t. In terms of trading, X(t) will represent the complete state of the market (every order book level, every participant’s position, every piece of news). The agent usually cannot see all of this directly.
X(t-1) = The environment state at the previous time step. The environment evolves from X(t-1) to X(t) based on what action the agent took.
A(t) = The action taken by the agent at time t
This is what the agent decides to do. In trading, A(t) could be “buy 1 share,” “sell 2 shares,” “hold,”
Y(t) = The observation the agent receives at time t
This is what the agent actually sees, which may be incomplete or noisy compared to the true state X(t). For example, you as a trader observe prices, volume, and a few indicators (Y(t)), but you don’t see every other trader’s intentions (X(t)).
R(t) = The reward received at time t
The feedback signal telling the agent how well it’s doing. In trading, this could be the PnL from the action taken, or whichever condition you have put for reward.
S(t) = The agent’s internal state at time t
This is the agent’s own summary or belief about what’s going on, built up from its history of observations, rewards, and actions.
S(t-1) = The agent’s internal state at the previous time step

P(…) = Probability distribution
The capital P means “probability of.” So P(X(t) | X(t-1), A(t)) reads as “the probability of the environment transitioning to state X(t), given that it was in state X(t-1) and the agent took action A(t).”
E(…) = Expected value
E(R(t) | X(t), A(t)) reads as “the expected reward at time t, given the environment state and the action taken.”
pi (π) = The policy
This is the agent’s strategy. A(t) = π(S(t)) means “the action is chosen by applying the policy function to the agent’s current internal state.”
How It All Fits Together
Here’s the flow in plain English:

The environment is in some true state X(t-1).
The agent, based on its internal state S(t-1), picks an action A(t) using its policy π.
The environment transitions to a new state X(t) based on X(t-1) and A(t).
The agent receives an observation Y(t) and a reward R(t) from the environment.
The agent updates its internal state to S(t) using Y(t), R(t), A(t), and its previous state S(t-1).