How to analyze and fix errors in LLM applications

Ben Dickson

7 months ago

Large language models (LLMs) have created a new paradigm for machine learning applications. On the one hand, you have a machine learning model that you can customize for your own needs and tasks. On the other hand, you don’t have access to the model’s weights and hyperparameters. You control the model’s behavior by tweaking your prompts and the information you give to the model.

This creates a dilemma for people who are used to developing classic machine learning applications. Without a systematic approach for analyzing errors and making corrections, you can get caught up in a confusing web of making random changes to your prompt without being able to measure the impact of each modification.

I use this four-stage process to systematically understand and fix errors in LLM applications.

Stage 1: Preparation

Before fixing errors, you should be able to measure them. In this stage, you will formulate the target task in a way that allows you to track the performance of the model.

1- Create a dataset: Create 50-100 examples that represent the target task and the kind of requests the application’s users will send and the expected responses. Each example should include the request and the response. If you expect the response to contain additional information such as reasoning chains, make sure to include them in your examples. If the response must be in a specific format such as JSON or key-value pairs, make sure all your examples are well-formatted.

2- Develop an evaluation method: You need to figure out a method to compare the responses of the model to the ground truth in your dataset. For numerical tasks and question-answering, evaluation will be as easy as creating a function that compares the responses.

For generative tasks such as summarization, you will need more elaborate methods such as separate LLM prompts. For example, I usually start by manually reviewing and correcting a few model responses and developing a small set of write and wrong answers. Then I create a separate LLM-as-a-Judge prompt with the examples to grade new model answers.

3- Specify target acceptance criteria: Not all tasks require perfect outputs. For example, in many applications, you will always use a human in the loop and are only using the LLM as an amplifier to carry out a part of the legwork. In such cases, specify an accuracy level that will make the LLM application acceptable for launch. You can then gather extra data to refine your prompts and gradually increase their accuracy. For example, I was helping a team that was translating special text that needed heavy prompt engineering. We started with a simple prompt that increased their speed by 25%. Although not perfect, the application was very valuable and freed up valuable time. Gradually, we refined the prompt with feedback and examples to do 80% of the work for them.

Stage 2: Evaluation

The goal of this stage is to identify and classify the errors that the model makes in a way that can be addressed methodically.

1- Track errors on the dataset: Run your prompt on the dataset, compare the model’s response to the ground truth, and separate the examples on which the model makes errors. For this step, you will need to use the evaluation function you developed in the previous stage.

2- Classify errors: Create a spreadsheet with the examples that the model made errors, the model’s responses, and the correct responses. Classify the errors into a few common causes and categories. Some examples of causes can be “lack of knowledge,” “incorrect reasoning,” “wrong calculation,” and “incorrect output format.” (Tip: You can use frontier models to help you find patterns in the errors. Provide the model with the correct and incorrect answers and instruct it to classify them into a few categories. It might not provide perfect results, but it will give you a good starting point.)

Stage 3: Correction

The goal in this stage is to modify the prompt to correct the common errors found in the previous stage. At each step of this stage, make a single modification to your prompt and rerun the examples where the model made errors. If the errors are not solved go to the next stage and try a more complicated solution.

1- Correct your prompt: Based on the error categories you found in the previous stage, make corrections to your prompt. Start with very simple modifications such as adding or changing instructions (e.g., “Only output the answer without extra details,” “Only output JSON,” “Think step-by-step and write your reasoning before responding to the question”).

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

2- Add knowledge to your prompt: Sometimes, the problem is that the model doesn’t have the base knowledge about the task. Create a “knowledge” section in your prompt where you can include any facts or extra information that can help the model.

This section can include anything that is relevant to your task, including documentation, code, and articles. (Don’t worry about prompt length at this stage and only focus on error analysis. You can later worry about optimizing your prompt for cost and scale.)

3- Use few-shot examples: If simple instructions and extra knowledge don’t solve the problem, try adding few-shot examples to the prompt. Add an “examples” section to your prompt where you include question-answer pairs and demonstrate the way the model should solve the problem. Start with two or three examples to keep the prompt short. Gradually add more examples if the errors are not resolved. As today’s frontier models support very long prompts, you can add hundreds of examples. I’ve personally experimented with Claude, Gemini, and ChatGPT, and with the right in-context example formats, you can get very impressive results.

4- Break down your prompt into several steps: Sometimes, you’re asking too much in a single prompt. If your task can be broken down into individual steps, try creating a separate prompt for each step. When asked to do a single task, the model is much more likely to perform it well. You can then create a prompt pipeline that feeds the output of each step as the input text for the next prompt. More advanced pipelines might include extra logic uses different prompt sequences. For example, the first step might be a prompt that determines whether the request requires external information. If it does, the request will be channeled to a RAG prompt. Otherwise, it will be passed on to another prompt that uses the model’s in-memory knowledge.

Stage 4: Finalization

The goal of this stage is to make sure that your corrections don’t cause other problems and don’t break the model’s general abilities on the target task.

1- Re-evaluate the entire dataset: Once you correct the model’s errors to an acceptable level, rerun all your examples through the corrected prompt to make sure everything is working fine. If you encounter new errors, repeat the evaluation and correction stages.

2- Keep a separate validation dataset: To make sure your prompt doesn’t overfit to your dataset, keep a holdout set for a final evaluation. This will make sure that your prompts generalize beyond the training examples. This is similar to classic machine learning where you have separate test and validation sets. If the model performs poorly on your validation set, you will have to repeat the evaluation and correction stages for the errors. After that, you’ll need to create new validation examples. (Hint: You can use frontier models to generate new examples for you by providing it with a few previous examples and asking it to generate diverse but similar examples.)

As you can see, the basic principles of ML error analysis also apply to LLM applications. You just need to think about your problem and solution from the right perspective.

If you’re interested in learning to create LLM applications, GoPractice has a fantastic GenAI Simulator course that gives you the perfect framework to think about generative AI and what kinds of problems you can solve with it. If you want to learn more about ML product management in general, you can try their broader AI/ML Simulator for PM course. I highly recommend both courses. I will also be mentoring the next cohort of the AI/ML Simulator course, scheduled for October 2024. Sign up here to take part in the course.