Skip to content

Instantly share code, notes, and snippets.

@JoaoLages
Last active January 14, 2025 12:14
Show Gist options
  • Save JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093 to your computer and use it in GitHub Desktop.
Save JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093 to your computer and use it in GitHub Desktop.
Reinforcement Learning from Human Feedback (RLHF) - a simplified explanation

Maybe you've heard about this technique but you haven't completely understood it, especially the PPO part. This explanation might help.

We will focus on text-to-text language models 📝, such as GPT-3, BLOOM, and T5. Models like BERT, which are encoder-only, are not addressed.

Reinforcement Learning from Human Feedback (RLHF) has been successfully applied in ChatGPT, hence its major increase in popularity. 📈

RLHF is especially useful in two scenarios 🌟:

  • You can’t create a good loss function
    • Example: how do you calculate a metric to measure if the model’s output was funny?
  • You want to train with production data, but you can’t easily label your production data
    • Example: how do you get labeled production data from ChatGPT? Someone needs to write the correct answer that ChatGPT should have answered

RLHF algorithm ⚙️:

  1. Pretraining a language model (LM)
  2. Training a reward model
  3. Fine-tuning the LM with RL

1 - Pretraining a language model (LM)

In this step, you need to either train one language model from scratch or just use a pretrained one like GPT-3.

Once you have that pretrained language model, you can also do an extra optional step, called Supervised Fine-Tuning (STF). This is nothing more than getting some human-labeled (input, output) text pairs and fine-tuning the language model you have. STF is considered high-quality initialization for RLHF.

At the end of this step, we end up with our trained LM which is our main model, and the one we want to train further with RLHF.

image

Figure 1: Our pretrained language model.

2 - Training a reward model

In this step, we are interested in collecting a dataset of (input text, output text, reward) triplets.

In Figure 2, there's a representation of the data collection pipeline: using input text data (if production data, better), pass it through your model, and have a human attribute a reward to the generated output text.

image

Figure 2: Pipeline to collect data for reward model training.

The reward is usually an integer between 0-5, but it can be a simple 0/1 in a 👍/👎 experience.

image

Figure 3: Simple 👍/👎 reward collection in ChatGPT.

image

Figure 4: A more complete reward collection experience: the model outputs two texts and the human has to choose which one was better, and also give an overall rating with comments.

With this new dataset, we will train another language model to receive the (input, output) text and return a reward scalar! This will be our reward model.

The main objective here is to use the reward model to mimic the human's reward labeling and therefore be able to do RLHF training offline, without the human in the loop.

image

Figure 5: The trained reward model, that will mimic the rewards given by humans.

3 - Fine-tuning the LM with RL

It's in this step that magic really happens and RL comes into play.

The objective of this step is to use the rewards given by the reward model to train the main model, your trained LM. However, since the reward will not be differentiable, we will need to use RL to be able to construct a loss that we can backpropagate to the LM.

image

Figure 6: Fine-tuning the main LM using the reward model and the PPO loss calculation.

At the beginning of the pipeline, we will make an exact copy of our LM and freeze its trainable weights. This copy of the model will help to prevent the trainable LM from completely changing its weights and starting outputting gibberish text to fool the reward model.

That is why we calculate the KL divergence loss between text output probabilities of both the frozen and non-frozen LM.

This KL loss is added to the reward that is produced by the reward model. Actually, if you are training your model while in production (online learning), you can replace this reward model with the human reward score directly. 💡

Having your reward and KL loss, we can now apply RL to make the reward loss differentiable.

Why isn't the reward differentiable? Because it was calculated with a reward model that received text as input. This text is obtained by decoding the output log probabilities of the LM. This decoding process is non-differentiable.

To make the loss differentiable, finally Proximal Policy Optimization (PPO) comes into play! Let's zoom in.

image

Figure 7: Zoom-in on the RL Update box - PPO loss calculation.

The PPO algorithm calculates a loss (that will be used to make a small update on the LM) like this:

  1. Make "Initial probs" equal to "New probs" to initialize.
  2. Calculate a ratio between the new and initial output text probabilities.
  3. Calculate the loss given the formula loss = -min(ratio * R, clip(ratio, 0.8, 1.2) * R), where R is the reward + KL (or a weighted average like 0.8 * reward + 0.2 * KL) previously computed and clip(ratio, 0.8, 1.2) is just bounding the ratio to be 0.8 <= ratio <= 1.2. Note that 0.8/1.2 are just commonly used hyperparameter values that are simplified here. Also not that we want to maximize the reward, that's why we add the minus -, so that we minimize the negation of the loss with gradient descent.
  4. Update the weights of the LM by backpropagating the loss.
  5. Calculate the "New probs" (i.e., new output text probabilities) with the newly updated LM.
  6. Repeat from step 2 up to N times (usually, N=4).

That's it, this is how you use RLHF in text-to-text language models!

Things can get more complicated because there are also other losses that you can add to this base loss that I presented, but this is the core implementation.

@JoaoLages
Copy link
Author

I thought that the target output text would be the example we've got the reward from. In this case - what are the probability ratios? For which tokens?

The LLM produces some initial distribution of token log probabilities. This distribution is decoded to text that is given to the reward model. This reward model produces a single score for the input text. The reward is a float number.
In the first step of PPO a loss is calculated with that reward and the KL loss. This loss is used to update the LLM weights. The LLM then produces a slightly different distribution of token log probabilities for the same input text. We compare this new distribution with the initial one, that is our ratio. Then we calculate the loss again with PPO and repeat the process N times.

@smartparrot
Copy link

When training the ppo, the prompt sentence , (eg. a scentence inputs to GPT), is the input state to GPT policy , but what is the one step action? Is the action the whole output scentence by GPT OR just one token outputted by the model ?
Thanks

@JoaoLages
Copy link
Author

JoaoLages commented Feb 20, 2023

When training the ppo, the prompt sentence , (eg. a scentence inputs to GPT), is the input state to GPT policy , but what is the one step action? Is the action the whole output scentence by GPT OR just one token outputted by the model ? Thanks

I'm not sure about this one, but I think that:

  • The model is the agent
  • The reward model is the environment (or the human labeler if there is no reward model)
  • The actions are the output texts
  • The state are the weights of the model (not really sure about this one)

@OS-bartmatejczyk
Copy link

Nice article. Please try to avoid abbreviations like probs :p

@Johnrobmiller
Copy link

This explanation works for me 👍

@pacman100
Copy link

Thank you! Easy to understand and concise explanation 😄

@aburkov
Copy link

aburkov commented Mar 25, 2023

Thanks for this! One question: ratio = new probs/initial probs. What exactly that means? Is "new probs" one value or many? If one, how is it obtained? If many, how the division "new probs/initial probs" is calculated?

@JoaoLages
Copy link
Author

JoaoLages commented Mar 27, 2023

Thanks for this! One question: ratio = new probs/initial probs. What exactly that means? Is "new probs" one value or many?

"new probs" = new probabilities obtained from the model after model update, passing it the same text as input. This is a sequence of probabilities (one probability per generated token, which corresponds to the probability of the target token of that step - we don't care about the other vocabulary probabilities).

If many, how the division "new probs/initial probs" is calculated?

The loss formula is calculated per step (per step we have a single probability number) and then we calculate the average for all steps.
If this wasn't clear, I suggest looking at how this pg_loss is calculated in the TRL library. Code never lies! 😄

@aburkov
Copy link

aburkov commented Mar 28, 2023

The loss formula is calculated per step (per step we have a single probability number) and then we calculate the average for all steps.
If this wasn't clear, I suggest looking at how this pg_loss is calculated in the TRL library. Code never lies!

Thanks! So, looking at the code, the reward from the last step is used to multiply the ratio at each step, and then the obtained values are averaged to get the loss. I'm I correct?

@JoaoLages
Copy link
Author

The loss formula is calculated per step (per step we have a single probability number) and then we calculate the average for all steps.
If this wasn't clear, I suggest looking at how this pg_loss is calculated in the TRL library. Code never lies!

Thanks! So, looking at the code, the reward from the last step is used to multiply the ratio at each step, and then the obtained values are averaged to get the loss. I'm I correct?

Yes, the reward multiplies by the ratio at each step, as hinted in Figure 7.
I abstracted the complexity of thinking in a "list of ratios" (one ratio per token step), but in the end, this is how an autoregressive LM works: it does multiple forwards to obtain 1 token at a time and there is always an associated loss for that step. To train it you average the losses obtained in each step.

@aburkov
Copy link

aburkov commented Mar 29, 2023

Got it. If I understand correctly, PPO doesn't have proof of optimality or convergence. It was only evaluated empirically, right?

@JoaoLages
Copy link
Author

Got it. If I understand correctly, PPO doesn't have proof of optimality or convergence. It was only evaluated empirically, right?

Very true! I think that other techniques will come ahead soon. I recently read this paper, you may be interested in it!

@aburkov
Copy link

aburkov commented Apr 1, 2023

Got it. If I understand correctly, PPO doesn't have proof of optimality or convergence. It was only evaluated empirically, right?

Very true! I think that other techniques will come ahead soon. I recently read this paper, you may be interested in it!

Nice, thanks!

@Nevermore12138
Copy link

Thanks for this nice explanation! So there is no sequence decision in this RL train procedure, and no traditional value function learning process which is a basic paradigm in standard RL. I think PPO objective used here resembles a zero-order optimization method and has nothing to do with RL.

@Yujun-Qian
Copy link

Thanks for this article, it does help me a lot.

In the formula: ratio = new probs/initial probs, my understanding is that "new probs" is function of θ, and "initial probs" is regarded as a constant, do I get it right?

and in Figure 7, it says the ratio "estimates divergence between new and old policy", but it seems to me the loss function would encourage the new policy to move further away from the old policy, (unlike the calculation of reward, where the divergence is subtracted from the original reward), so the divergence in the ratio is not intended as a penalty?

@microcoder-py
Copy link

Can you explain how the probs ratios are calculated?

@JoaoLages
Copy link
Author

Can you explain how the probs ratios are calculated?

I suggest looking into the code to better understand what is going on. The language model calculates text probabilities per step, and for each step, you are able to calculate a loss. In the end, you have a single dimension vector with all the losses per step - you can do this for the model before and the model after the PPO weights update, and then just divide both vectors.

@JoaoLages
Copy link
Author

Thanks for this article, it does help me a lot.

Thanks! 🙏

In the formula: ratio = new probs/initial probs, my understanding is that "new probs" is function of θ, and "initial probs" is regarded as a constant, do I get it right?

Yes, I think it is fair to put it like that 👍

and in Figure 7, it says the ratio "estimates divergence between new and old policy", but it seems to me the loss function would encourage the new policy to move further away from the old policy, (unlike the calculation of reward, where the divergence is subtracted from the original reward), so the divergence in the ratio is not intended as a penalty?

Why is it moving away from the old policy? Notice that the loss has a minus - in it! If it hadn't, you'd be correct :)

@Yujun-Qian
Copy link

Thanks for this article, it does help me a lot.

Thanks! 🙏

In the formula: ratio = new probs/initial probs, my understanding is that "new probs" is function of θ, and "initial probs" is regarded as a constant, do I get it right?

Yes, I think it is fair to put it like that 👍

and in Figure 7, it says the ratio "estimates divergence between new and old policy", but it seems to me the loss function would encourage the new policy to move further away from the old policy, (unlike the calculation of reward, where the divergence is subtracted from the original reward), so the divergence in the ratio is not intended as a penalty?

Why is it moving away from the old policy? Notice that the loss has a minus - in it! If it hadn't, you'd be correct :)

I noted there is a minus "-" in the loss, but as we are trying to minimize the loss, that means we are gonna maximize the "new probs/ initial probs" part, right?

I recently learned a technique called "importance sampling" (please refer to the "importance sampling" part in https://jonathan-hui.medium.com/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9). I think the probs ratio is a kind of importance sampling, the point is to reuse the sample data (i.e. the generated text) for N iterations (e.g. N = 4). In the N iterations, the initial probs and the reward R are fixed; and after N iterations, we need to resample the data (regenerate the text) to prevent the variance from exploding, and then we can reuse the sample data for the next N iterations.

@JoaoLages
Copy link
Author

I noted there is a minus "-" in the loss, but as we are trying to minimize the loss, that means we are gonna maximize the "new probs/ initial probs" part, right?

Right, sorry! 😴 😪
I think you're right, but that's the way I see it:
1 - for each reward R that you get from the reward model, you want to backpropagate that reward to update the model weights, but the reward is not differentiable
2 - So, the best that you can do is to use this reward as a scalar to increase/decrease how much the model weights changes
3 - The ratio between the new and initial probabilities is differentiable and can be multiplied by this reward. So it is true, if the reward is big and the divergence between the probabilities is also big, the model new weights will move away from the old ones (exactly as you said)
4 - however, the model can learn to fool the reward model, making it output gibberish with high rewards. So the KL loss was introduced for the model to not move so much away from their initial state - empirically this works and that's why it is used afaik

@jamesharrisivi
Copy link

jamesharrisivi commented Aug 25, 2023

@JoaoLages Why do you need the advantage (ratio). if you just had the new probabilities and not the initial, isn't that still differentiable?

∇θ​logPθ​(ti​∣t0:i​)R

where ti is the ith token. If it's a good output e.g. t0..tm is good than this will still encourage the model Pθ​ to assign higher probabilities for this output.

Is the ratio to normalize it, rather than make it differentiable?

@JoaoLages
Copy link
Author

Is the ratio to normalize it, rather than make it differentiable?

No, R is really just a non-differentiable constant, given by the reward system (a reward model, a human in the loop, etc).

@JoaoLages Why do you need the advantage (ratio). if you just had the new probabilities and not the initial, isn't that still differentiable?

Good question!
Afaik, it is still differentiable. My best guess is that this ratio is just standard in RL, and it is another good way to make sure that the new model weights do not diverge a lot from the previous ones (note that we are still doing the KL loss exactly for that reason).

@YizeMinimax
Copy link

Thank you! Very nice blog.

Should we minus the KL divergence loss when defining the reward? Since we want it to be small.

Another question is, should the reward be both negative and positive? Or we will always be encouraging the model to improve the probability of the generated text sequence.

@JoaoLages
Copy link
Author

Thank you! Very nice blog.

Thanks for the kind words! 🙏

Should we minus the KL divergence loss when defining the reward? Since we want it to be small.

No, that way we will be pushing the KL divergence to be higher.
Imagine: if the KL loss if 5, you add a minus and make it -5. But now your optimization algorithm (gradient descent) is trying to minimize that loss, so it will try to make it even lower than -5, which is not what you want. You want the KL loss to be pushed to 0.

Another question is, should the reward be both negative and positive? Or we will always be encouraging the model to improve the probability of the generated text sequence.

The reward is usually positive, as I state in the article, but can also be negative. I don't know how well it works in practice though, choosing the right rewards is very tricky in RL.

@chrishzhao
Copy link

This is great article! If R in PPO is fixed, may I know how the back-propagation of loss can help push the probability distribution to the direction that maximize R? What is the intuition behind it.

@JoaoLages
Copy link
Author

This is great article!

Thanks you!

If R in PPO is fixed, may I know how the back-propagation of loss can help push the probability distribution to the direction that maximize R? What is the intuition behind it.

What do you mean by fixed R? R changes every PPO step as you can see in Figure 6. The intuition is that this reward will tell the model if it is updating the weights correctly or not.

@junsukha
Copy link

junsukha commented Sep 23, 2024

Thank you! Very nice blog.

Thanks for the kind words! 🙏

Should we minus the KL divergence loss when defining the reward? Since we want it to be small.

No, that way we will be pushing the KL divergence to be higher. Imagine: if the KL loss if 5, you add a minus and make it -5. But now your optimization algorithm (gradient descent) is trying to minimize that loss, so it will try to make it even lower than -5, which is not what you want. You want the KL loss to be pushed to 0.

Another question is, should the reward be both negative and positive? Or we will always be encouraging the model to improve the probability of the generated text sequence.

The reward is usually positive, as I state in the article, but can also be negative. I don't know how well it works in practice though, choosing the right rewards is very tricky in RL.

Shouldn't the total reward R = reward - KL ? We want KL as small as possible. So we negate it when we compute the total reward R. Likewise, we can set the loss = -R. In this case, the loss = -reward + KL. We want our loss as small as possible, i.e, bigger reward (as there's the negative sign at the front) and smaller KL.

Thanks for the concise explanation btw.

@JoaoLages
Copy link
Author

Shouldn't the total reward R = reward - KL ? We want KL as small as possible. So we negate it when we compute the total reward R. Likewise, we can set the loss = -R. In this case, the loss = -reward + KL. We want our loss as small as possible, i.e, bigger reward (as there's the negative sign at the front) and smaller KL.

Correct, you're right 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment