By: Eric

Eric — Thu, 27 Jul 2023 21:48:58 +0000

When training LLM using RLHF, if the model generates a sentence of 16 tokens (one token for each time step) and gets a reward, how does RLFH allocate the credits amongst the 16 tokens? If so, how does this compare to simply rank the sentence at the end, and allocate the loss in one shot?

Comments on: What is reinforcement learning from human feedback (RLHF)?

By: Eric