工作流-评估优化器

流程说明

在评估-优化器中, 一个LLM调用生成响应, 另一个在循环中提供评估和反馈

  1. 用户输入task
  2. LLM生成第一轮的回复
  3. 对LLM的回复生成评估
    1. 检查评估的结果是不是PASS, 如果是, 直接返回最终结果
  4. 将所有的历史memory都添加到上下文中, 以及将回复评估添加到上下文中
  5. 进行下一次循环

代码

loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
private RefinedResponse loop(String task, String context, List<String> memory,
List<Generation> chainOfThought) {

Generation generation = generate(task, context);
memory.add(generation.response());
chainOfThought.add(generation);

EvaluationResponse evaluationResponse = evalute(generation.response(), task);

if (evaluationResponse.evaluation().equals(EvaluationResponse.Evaluation.PASS)) {
// Solution is accepted!
return new RefinedResponse(generation.response(), chainOfThought);
}

// Accumulated new context including the last and the previous attempts and
// feedbacks.
StringBuilder newContext = new StringBuilder();
newContext.append("Previous attempts:");
for (String m : memory) {
newContext.append("\n- ").append(m);
}
newContext.append("\nFeedback: ").append(evaluationResponse.feedback());

return loop(task, newContext.toString(), memory, chainOfThought);
}

generate

1
2
3
4
5
6
7
8
9
10
11
12
13
private Generation generate(String task, String context) {
Generation generationResponse = chatClient.prompt()
.user(u -> u.text("{prompt}\n{context}\nTask: {task}")
.param("prompt", this.generatorPrompt)
.param("context", context)
.param("task", task))
.call()
.entity(Generation.class);

System.out.println(String.format("\n=== GENERATOR OUTPUT ===\nTHOUGHTS: %s\n\nRESPONSE:\n %s\n",
generationResponse.thoughts(), generationResponse.response()));
return generationResponse;
}

evalute

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private EvaluationResponse evalute(String content, String task) {

EvaluationResponse evaluationResponse = chatClient.prompt()
.user(u -> u.text("{prompt}\nOriginal task: {task}\nContent to evaluate: {content}")
.param("prompt", this.evaluatorPrompt)
.param("task", task)
.param("content", content))
.call()
.entity(EvaluationResponse.class);

System.out.println(String.format("\n=== EVALUATOR OUTPUT ===\nEVALUATION: %s\n\nFEEDBACK: %s\n",
evaluationResponse.evaluation(), evaluationResponse.feedback()));
return evaluationResponse;
}

Generator prompt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
"""
Your goal is to complete the task based on the input. If there are feedback
from your previous generations, you should reflect on them to improve your solution.

CRITICAL: Your response must be a SINGLE LINE of valid JSON with NO LINE BREAKS except those explicitly escaped with \\n.
Here is the exact format to follow, including all quotes and braces:

{"thoughts":"Brief description here","response":"public class Example {\\n // Code here\\n}"}

Rules for the response field:
1. ALL line breaks must use \\n
2. ALL quotes must use \\"
3. ALL backslashes must be doubled: \\
4. NO actual line breaks or formatting - everything on one line
5. NO tabs or special characters
6. Java code must be complete and properly escaped

Example of properly formatted response:
{"thoughts":"Implementing counter","response":"public class Counter {\\n private int count;\\n public Counter() {\\n count = 0;\\n }\\n public void increment() {\\n count++;\\n }\\n}"}

Follow this format EXACTLY - your response must be valid JSON on a single line.
"""

Evaluator prompt

1
2
3
4
5
6
7
8
9
10
"""
Evaluate this code implementation for correctness, time complexity, and best practices.
Ensure the code have proper javadoc documentation.
Respond with EXACTLY this JSON format on a single line:

{"evaluation":"PASS, NEEDS_IMPROVEMENT, or FAIL", "feedback":"Your feedback here"}

The evaluation field must be one of: "PASS", "NEEDS_IMPROVEMENT", "FAIL"
Use "PASS" only if all criteria are met with no improvements needed.
"""

适用场景

在有清晰的评估标准, 并且迭代改进提供可衡量的价值. 良好契合的两个标志是

  1. 当人类清晰地表达反馈时, LLM的答案可以得到显著的改进
  2. LLM能表达这种反馈

原文:

When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.

  • 复杂的搜索任务需要多轮搜索和分析才能收集全面的信息, 评估人员是否需要进一步搜索
  • 生成代码的自优化