Reinforcement Learning for Stock Trading Agent

Kai Jun Eer
4 min readSep 14, 2020

The use of reinforcement learning (RL) to build a stock trading agent has been a widely explored topic. Some argue that an RL agent would not be able to reliably make good decisions in the stock market, as the market is highly volatile in nature and depends on multiple external factors such as behavourial finance and market manipulation. With that in mind, RL in trading could only be classified as a semi Markov Decision Process (the outcome is not solely based on the previous state and your action, it also depends on other traders).

In this article, I would like to go through the high level implementation of a deep reinforcement learning trading agent. While whether RL agents are reliable is very much debatable, I believe by carrying out an actual project we will be able to understand more about the core concepts of RL and its applications.

The core idea of reinforcement learning is agent and environment. At each time step, the environment sends the current state to the agent. The agent decides its action based on the state given. The environment then provides a reward to the agent based on its action to either encourage or discourage that particular action. Then the time step is increased and the same goes on at each time step.

Core concept of reinforcement learning

In the context of a trading agent, the environment will be the trading environment. At each time step, the environment will pass the current state to the agent. Think about what will be a good representation of the current state. In my implementation, the state is predicted future price, how long has the stock been held and the profit/loss if I were to sell the stock today. Depending on what features would be important for your trading agent to make decisions, you could add those features as part of your state. For example, the current budget would be a good feature if you want the agent to manage risk of your overall portfolio. However in my case, I wanted to only experiment with a simple setting, thus the trading agent only trades for one stock and could only hold a fixed amount of stocks at any time.

My trading agent has a discrete action space, where the permitted actions are only buy, hold or sell. I utilised a neural network to estimate action values at each state. Then, the trading agent will choose its action using the epsilon-greedy policy (ie choose the action with highest action values most of the time, except for a small probability where it will randomly choose an action to encourage exploration). The reward returned by the environment will be the profit earned on that particular day based on the action chosen. If a loss is made, a negative reward will be returned.

I utilised a LSTM model for future stock price prediction. The prediction together with other features are feed into the neural network as the state. The neural network will then predict the action values for each action.

At each time step, the agent will need to update its neural network weights to better predict the action values in subsequent time steps. In my implementation, I utilised the deep Q Learning algorithm to update the weights.

Update neural network architecture

To train the agent, simply run the trading agent for many time steps, then repeat. However, one risk of using Deep Q Learning algorithm is there is no guarantee the neural network model will converge. Since the action values are only predicted values and not actual values, the model might not be able to find a local minima, thus weight initialisation plays an important role here.

This article is only a high level overview of building a reinforcement learning trading agent. I hope it will be able to facilitate your thoughts and ideas to make it better. There is indeed much room for improvements, for example the action space could be made continuous. In that case, the trading agent will not merely be able to carry out basic actions such as buy, hold or sell, but also able to determine the amount of stock to buy/sell. For even more advanced implementation, the trading agent would also be able to choose which stock to buy/sell and manages the portfolio on its own. However, with a continuous action space, different algorithms will be needed such as Proximal Policy Optimisation (PPO) or Actor Critic.

Do let me know your thoughts on this article. Feel free to leave a comment. Happy exploring!

--

--

Kai Jun Eer

Wandering in the world of artificial intelligence and blockchain