Reinforcement Learning 101: Best Introduction for Beginners

10 min readApr 18, 2024

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.

I know this definition can be intimidating at first, so I will try to explain this definition to you with a real world example:

Imagine putting your little brother in front of a video game he never played, giving him a controller, and leaving him alone.

Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game he must get the coins.

But then, he presses the right button again and he touches an enemy. He just died, so that’s a -1 reward.

By interacting with his environment through trial and error, your little brother understands that he needs to get coins in this environment but avoid the enemies.

Without any supervision, the child will get better and better at playing the game.

That’s how humans and animals learn, through interaction. Reinforcement Learning is just a computational approach of learning from actions.

The reward hypothesis: the central idea of Reinforcement Learning

⇒ Why is the goal of the agent to maximize the expected return?

Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

That’s why in Reinforcement Learning, to have the best behavior, we aim to learn to take actions that maximize the expected cumulative reward.

THE RL PROCESS:

Our Agent receives state S0 from the Environment — we receive the first frame of our game (Environment).
Based on that state S0, the Agent takes action A0 — our Agent will move to the right.
The environment goes to a new state S1 — new frame.
The environment gives some reward R1 to the Agent — we’re not dead (Positive Reward +1).

This RL loop outputs a sequence of state, action, reward and next state.

Now time has come for you to understand all the common jargons used in Reinforcement Learning world. You can narrow your focus on screen now.

Observations/States Space:

Observations/States are the information our agent gets from the environment. In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.

There is a differentiation to make between observation and state, however:

State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.

Observation o: is a partial description of the state. In a partially observed environment.

In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.

In Super Mario Bros, we are in a partially observed environment. We receive an observation since we only see a part of the level.

Action Space

The Action space is the set of all possible actions in an environment.

The actions can come from a discrete or continuous space:

Discrete space: the number of possible actions is finite.

In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).

Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.

Continuous space: the number of possible actions is infinite.

A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…

Rewards and the Discounting

The reward is fundamental in RL because it’s the only feedback for the agent. Thanks to it, our agent knows if the action taken was good or not.

The cumulative reward at each time step t can be written as:

The cumulative reward equals the sum of all rewards in the sequence.

However, in reality, we can’t just add them like that. The rewards that come sooner (at the beginning of the game) are more likely to happen since they are more predictable than the long-term future reward.

Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse’s goal is to eat the maximum amount of cheese before being eaten by the cat.

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

Consequently, the reward near the cat, even if it is bigger (more cheese), will be more discounted since we’re not really sure we’ll be able to eat it.

To discount the rewards, we proceed like this:

We define a discount rate called gamma. It must be between 0 and 1. Most of the time between 0.95 and 0.99.

The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less likely to happen.

Type of Tasks

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuing.

Episodic task

In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States.

For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ends when you’re killed or you reached the end of the level.

Continuing tasks

These are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.

For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop it.

The Exploration/Exploitation trade-off

Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.

Exploration is exploring the environment by trying random actions in order to find more information about the environment.
Exploitation is exploiting known information to maximize the reward.

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

Let’s take an example:

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).

However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

But if our agent does a little bit of exploration, it can discover the big reward (the pile of big cheese).

This is what we call the exploration/exploitation trade-off. We need to balance how much we explore the environment and how much we exploit what we know about the environment.

Therefore, we must define a rule that helps to handle this trade-off. We’ll see the different ways to handle it in the future units.

If it’s still confusing, think of a real problem: the choice of picking a restaurant:

Exploitation: You go to the same one that you know is good every day and take the risk to miss another better restaurant.
Exploration: Try restaurants you never went to before, with the risk of having a bad experience but the probable opportunity of a fantastic experience.

Two main approaches for solving RL problems

Now that we learned the RL framework, how do we solve the RL problem?

In other words, how do we build an RL agent that can select the actions that maximize its expected cumulative reward?

The Policy π: the agent’s brain

The Policy π is the brain of our Agent, it’s the function that tells us what action to take given the state we are in. So it defines the agent’s behavior at a given time.

This Policy is the function we want to learn, our goal is to find the optimal policy π*, the policy that maximizes expected return when the agent acts according to it. We find this π* through training.

There are two approaches to train our agent to find this optimal policy π*:

Directly, by teaching the agent to learn which action to take, given the current state: Policy-Based Methods.
Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states: Value-Based Methods.

Policy-Based Methods

In Policy-Based methods, we learn a policy function directly.

This function will define a mapping from each state to the best corresponding action. Alternatively, it could define a probability distribution over the set of possible actions at that state.

We have two types of policies:

Deterministic: a policy at a given state will always return the same action.

Stochastic: outputs a probability distribution over actions.

policy(actions | state) = probability distribution over the set of actions given the current state

Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.

Value-based methods

In value-based methods, instead of learning a policy function, we learn a value function that maps a state to the expected value of being at that state.

The value of a state is the expected discounted return the agent can get if it starts in that state, and then acts according to our policy.

“Act according to our policy” just means that our policy is “going to the state with the highest value”.

Here we see that our value function defined values for each possible state.

Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

I understand that some of this might have been a bit confusing, but I don’t want you to feel overwhelmed. Let’s clear things up together! Here’s a simple breakdown of everything we’ve talked about.

Reinforcement Learning is all about maximizing the cumalative reward/expected return by optimizing the Policy π (the agent’s brain).
You could also understand it like this: when training neural networks, our goal has to minimize the loss function by optimizing the weights.
To optimize the policy, we have two options. First, by using policy based methods which optimize the policy directly. And second, value based methods which indirectly helps optimizing the policy.
And ultimately all reinforcement learning revolves around agent, states, actions and rewards.

By now, I hope you’ve grasped the essence of reinforcement learning and even picked up on some of the jargon commonly used in this field. If you’re eager to dive deeper into the world of reinforcement learning, here are some resources to help you along the way 👇

Books:

These two are my personal best books. First book, explains reinforcement learning theories whereas second books teaches you to code algorithms from scratch to solidify your knowledge.

Additional course you could take (not a book): Deep Reinforcement Learning course by Hugging face community.

Note: This blog is highly inspired by the introduction chapeter of the above course. If you liked this blog, I would highly recommend you to take this course.

Youtube Playlists:

Reinforcement Learning course by CIS 522
Reinforcement Learning 101 by CodeEmporium
Reinforcement Learning series by DeepMind x UCL

Note: I haven’t watched many youtube videos on reinforcement learning, these are the best playlists I’ve come around so far. You could also search for better tutorials on yourselves.

If you’ve come to the end and liked reading this blog, you could show me your appreciation by giving me round of applause. And yeah definitely, you could follow me for more reinforcement learning related blogs (or in general for Deep Learning)