-

@ Jonathan
2025-05-17 15:20:56
https://arxiv.org/abs/2505.06120
Microsoft wrote a paper about how much LLMs degrade in multi-turn convos. I’m glad my hunch was finally confirmed. I’ve felt for a while that resetting the context and tweaking the first prompt is often far more effective than clarifying with a follow up prompt when the LLM misunderstands or fails to solve the problem on the first attempt.
My guess was that there just aren’t enough examples of multi turn convos in the RLHF training dataset, especially of LLMs recovering from a mistake, but I hadn’t considered how benchmarks play into it. Because the format of most benchmarks is just “question->answer” without any follow ups your LLM can test well and still fall apart the moment the user asks a second question.