網易首頁 > 網易號 > 正文申請入駐

LLM推理時計算技術詳解：四種提升大模型推理能力的方法

2026-02-06 20:55:58　來源: deephub

北京舉報

分享至

2025年LLM領域有個有意思的趨勢：與其繼續卷模型訓練，不如在推理階段多花點功夫。這就是所謂的推理時計算（Test-Time / Inference-Time Compute）：在推理階段投入更多計算資源，包括更多Token、更多嘗試、更深入的搜索，但不會改動模型權重。

ARC-AGI基準測試就是個典型案例。通過推理時技術可以達到87.5%的準確率，但代價是每個任務超過1000美元的推理成本。沒用這些技術的LLM通常只能拿到不到25%。

本文要講四種主流的推理時計算技術：深度方向的Chain-of-Thought，寬度方向的Self-Consistency，搜索方向的Tree-of-Thoughts，以及迭代方向的Reflexion/Self-Refine。

預備知識：LLM調用封裝

先把基礎設施搭好。下面是通用的LLM調用接口和輔助函數：

from collections import Counter, deque
import re
# ---- LLM調用封裝 ----
def llm(prompt: str, temperature: float = 0.7, max_tokens: int = 800) -> str:
"""
LLM調用的占位函數。
在實際使用中，可以替換為OpenAI、Claude或本地模型的API調用。
參數:
prompt: 輸入提示詞
temperature: 采樣溫度，控制輸出多樣性
max_tokens: 最大生成token數
返回:
模型生成的文本
"""
# 示例：使用OpenAI API
# from openai import OpenAI
# client = OpenAI()
# response = client.chat.completions.create(
# model="gpt-4",
# messages=[{"role": "user", "content": prompt}],
# temperature=temperature,
# max_tokens=max_tokens
# )
# return response.choices[0].message.content
raise NotImplementedError("請實現你的LLM調用邏輯")
# ---- 輔助函數：提取最終答案 ----
def extract_final_answer(text: str) -> str:
"""
從模型輸出中提取最終答案。
尋找格式為 "FINAL: <答案>" 或 "Final: <答案>" 的模式。
在實際應用中，建議：
- 讓模型輸出JSON格式，如 {"final": "..."}
- 或使用針對具體任務的解析邏輯
參數:
text: 模型的完整輸出文本
返回:
提取的最終答案（最多200字符）
"""
m = re.search(r"(FINAL|Final)\s*[:\-]\s*(.*)", text)
return (m.group(2).strip() if m else text.strip())[:200]

深度（Depth）：鏈式思維推理

Chain-of-Thought（CoT）是最基礎也用得最多的推理時技術。核心思想很直白：讓模型「思考」久一點。

傳統調用方式期望模型直接給答案，但復雜問題不是這么解決的。CoT讓模型生成詳細的中間推理步驟，在數學、邏輯推理、編程這些任務上效果很明顯。

為什么管用？首先是分解作用，大問題拆成小步驟，每一步更容易做對。其次是中間步驟充當了一種「外部記憶」，幫模型追蹤推理過程。第三是強制模型展示推理，減少直接「猜」答案的情況。最后，模型推理過程中可以自查前面步驟對不對。

觸發CoT有幾種常見辦法：零樣本提示就是加一句「Let's think step by step」；少樣本提示是給2-3個帶推理步驟的例子；指令微調是用帶CoT標注的數據集訓練；系統提示則是在system message里定義推理風格。

def solve_with_cot(question: str) -> str:
"""
使用鏈式思維（Chain-of-Thought）解決問題。
通過精心設計的提示詞，引導模型：
1. 進行逐步推理
2. 展示中間計算過程
3. 最后給出明確的最終答案
參數:
question: 需要解答的問題
返回:
包含推理過程和最終答案的完整響應
"""
prompt = f"""You are a careful reasoner. Your task is to solve the following problem.
Instructions:
1. Break down the problem into smaller steps
2. Show your reasoning for each step
3. Double-check your calculations
4. End with a clear final answer
Format your response as:
Step 1: [your first step]
Step 2: [your second step]
...
FINAL:
Question: {question}
"""
# 使用較低的temperature以獲得更確定性的輸出
return llm(prompt, temperature=0.2, max_tokens=900)
# 使用示例
if __name__ == "__main__":
question = "一個農場有雞和兔，共35個頭和94只腳。請問有多少只雞和多少只兔？"
result = solve_with_cot(question)
print(result)
print("\n提取的最終答案:", extract_final_answer(result))

CoT適合數學應用題、邏輯推理、代碼調試、規劃任務這類需要多步計算的問題。簡單事實問答用CoT有點浪費，創意寫作也不太合適——過度結構化會限制發揮。

局限性也很明顯。Token消耗會上升，輸出越長成本越高。模型可能在推理鏈中犯錯，錯誤還會傳播。輸出格式也不總是穩定，需要后處理。

寬度（Width）：自洽性采樣

Self-Consistency的想法很簡單：與其相信單次輸出，不如生成多個答案，選最一致的那個。

有點像集體決策——單條推理鏈可能出錯，但如果多條獨立路徑都指向同一答案，那答案八成是對的。

這方法管用的原因：單次采樣可能因為隨機性出錯，多次采樣能平均掉這些錯誤。正確答案往往能通過多條不同路徑得到。不同路徑可能捕捉問題的不同側面。答案的一致性程度還順便反映了模型的「信心」。

做Self-Consistency有幾個關鍵決策要做。

第一是采樣多樣性。這點至關重要。如果所有采樣都走同一條推理路徑，自洽性就沒意義了。高多樣性設置是temperature 0.7-0.9、top_p 0.9-0.95，加上多樣的提示詞變體。temperature太低或提示詞太固定都不行。

第二是采樣數量。3-5個邊際收益最高，適合成本敏感場景；10-20個是常規配置；40個以上適合對準確率要求極高的場景，但邊際收益已經很低了。

第三是聚合策略。最常用的是多數投票，選出現次數最多的答案。也可以加權投票，根據置信度加權。還可以把相似答案聚類后再投票。

def solve_with_self_consistency(
question: str,
n: int = 10,
temperature: float = 0.8
) -> dict:
"""
使用自洽性（Self-Consistency）方法解決問題。
通過高溫度采樣生成多個多樣化的答案，
然后通過多數投票選擇最一致的答案。
參數:
question: 需要解答的問題
n: 采樣數量，建議10-20
temperature: 采樣溫度，建議0.7-0.9以確保多樣性
返回:
包含以下鍵的字典:
- final: 最終答案（得票最多的）
- votes: 該答案的得票數
- confidence: 置信度（得票數/總數）
- all_finals: 所有提取的答案列表
- vote_distribution: 完整的投票分布
- samples: 所有原始輸出（用于調試）
"""
prompt_template = """Solve this problem step by step.
Show your reasoning, then end with 'FINAL: ...'
Question: {question}"""
samples = []
for i in range(n):
out = llm(
prompt_template.format(question=question),
temperature=temperature, # 高溫度確保多樣性
max_tokens=900
)
samples.append(out)
# 提取所有最終答案
finals = [extract_final_answer(s) for s in samples]
# 統計投票
vote_counter = Counter(finals)
most_common = vote_counter.most_common()
winner = most_common[0]
return {
"final": winner[0],
"votes": winner[1],
"confidence": winner[1] / n,
"all_finals": finals,
"vote_distribution": dict(vote_counter),
"samples": samples
}
def solve_with_weighted_consistency(
question: str,
n: int = 10,
score_fn=None
) -> dict:
"""
帶權重的自洽性方法。
除了多數投票外，還可以根據每個答案的質量分數加權。
參數:
question: 需要解答的問題
n: 采樣數量
score_fn: 評分函數，接受(question, answer)返回0-1的分數
返回:
包含加權投票結果的字典
"""
samples = []
for _ in range(n):
out = llm(
f"Solve step by step. End with 'FINAL: ...'\n\nQ: {question}",
temperature=0.8,
max_tokens=900
)
samples.append(out)
finals = [extract_final_answer(s) for s in samples]
# 加權投票
weighted_votes = {}
for final, sample in zip(finals, samples):
weight = score_fn(question, sample) if score_fn else 1.0
weighted_votes[final] = weighted_votes.get(final, 0) + weight
winner = max(weighted_votes.items(), key=lambda x: x[1])
return {
"final": winner[0],
"weighted_score": winner[1],
"weighted_distribution": weighted_votes,
"all_finals": finals
}
# 使用示例
if __name__ == "__main__":
question = "如果今天是星期三，那么100天后是星期幾？"
result = solve_with_self_consistency(question, n=10)
print(f"最終答案: {result['final']}")
print(f"得票數: {result['votes']}/{len(result['all_finals'])}")
print(f"置信度: {result['confidence']:.1%}")
print(f"投票分布: {result['vote_distribution']}")

Self-Consistency適合有確定答案的問題（數學、編程、事實問答）、答案空間有限的問題（選擇題、是/否問題）、以及生產環境中需要高可靠性的場景。開放式問題答案空間太大，每次答案都不同，投票沒意義。創意任務沒有「正確」答案可投票，也不適用。

局限性：成本線性增長，N次采樣就是N倍成本。如果模型系統性地偏向某個錯誤答案，投票也救不了。同一答案的不同表述可能被當作不同答案，答案標準化是個麻煩事。

搜索（Search）：思維樹探索

Tree-of-Thoughts（ToT）把推理過程當成搜索問題來做。每個節點是一個「思維狀態」，也就是部分推理結果；每條邊是一個「思維步驟」，即推理動作；目標是找到通向正確答案的路徑。

跟線性的CoT不同，ToT允許分支（從一個狀態探索多個可能的下一步）、回溯（放棄沒希望的分支，回到之前的狀態）、評估（判斷當前狀態離目標有多近）。

為什么有效？線性推理一旦犯錯就沒法恢復，ToT可以回溯。某些問題天然是樹形結構，比如博弈、規劃。通過評估函數引導搜索，避免盲目探索。只深入探索有希望的分支，Token利用率更高。

搜索策略有幾種選擇。BFS廣度優先，逐層探索，不會錯過淺層解但內存消耗大。DFS深度優先，一條路走到底，內存效率高但可能陷入死胡同。Beam Search每層保留top-k狀態，平衡效率和覆蓋，但可能丟失最優解。A*用啟發式函數引導，最優且高效，但需要好的啟發函數。MCTS蒙特卡洛樹搜索能處理大搜索空間，但需要大量模擬。

def tot_bfs(
question: str,
max_depth: int = 4,
beam: int = 3,
branch: int = 4,
external_evaluator=None
) -> dict:
"""
使用BFS策略的思維樹（Tree-of-Thoughts）方法。
工作流程:
1. 從空狀態開始
2. 對當前frontier中的每個狀態，生成多個可能的下一步
3. 評估所有新狀態
4. 保留得分最高的beam個狀態作為新frontier
5. 重復直到達到最大深度
6. 從最佳狀態生成最終答案
參數:
question: 需要解答的問題
max_depth: 最大搜索深度
beam: 每層保留的狀態數（beam width）
branch: 每個狀態擴展的分支數
external_evaluator: 外部評估函數（可選），
接受(question, state)返回分數
返回:
包含以下鍵的字典:
- final_text: 最終答案
- best_state: 最佳推理狀態
- best_score: 最佳狀態的分數
- search_tree: 搜索過程的記錄（用于可視化）
"""
def propose_next_steps(state: str) -> list:
"""
給定當前推理狀態，生成多個可能的下一步。
"""
prompt = f"""You are exploring different ways to solve a problem.
Question: {question}
Current reasoning state:
{state if state else "(Starting from scratch)"}
Propose {branch} different possible next steps to continue the reasoning.
Each step should be a distinct approach or calculation.
Return as a numbered list:
1. [first possible step]
2. [second possible step]
...
"""
raw = llm(prompt, temperature=0.9, max_tokens=400)
# 解析編號列表
steps = []
for line in raw.splitlines():
line = line.strip()
if line and line[0].isdigit():
# 移除編號前綴
step = line.split(".", 1)[-1].strip()
if step:
steps.append(step)
return steps[:branch] if steps else [raw.strip()]
def llm_score_state(state: str) -> float:
"""
使用LLM評估一個推理狀態的promising程度。
注意：在實際應用中，使用外部評估器（如單元測試、規則檢查）
通常比LLM自我評估更可靠。
"""
if external_evaluator:
return external_evaluator(question, state)
prompt = f"""Evaluate how promising this partial solution is.
Question: {question}
Current reasoning state:
{state}
Consider:
1. Is the reasoning logical and correct so far?
2. Is it making progress toward a solution?
3. Are there obvious errors or dead ends?
Rate from 0 to 10 (10 = very promising, likely to lead to correct answer).
Output only a number.
"""
s = llm(prompt, temperature=0.0, max_tokens=10).strip()
try:
return float(re.findall(r"\d+(\.\d+)?", s)[0])
except:
return 5.0 # 默認中等分數
# 初始化
frontier = [""] # 初始狀態為空
best_state = ""
best_score = -1.0
search_tree = [] # 記錄搜索過程
for depth in range(max_depth):
candidates = []
depth_record = {"depth": depth, "states": []}
for state in frontier:
next_steps = propose_next_steps(state)
for step in next_steps:
# 構建新狀態
new_state = (state + "\n" + step).strip()
# 評估新狀態
score = llm_score_state(new_state)
candidates.append((score, new_state))
depth_record["states"].append({
"state": new_state[:200] + "..." if len(new_state) > 200 else new_state,
"score": score
})
search_tree.append(depth_record)
# 排序并保留top-k
candidates.sort(reverse=True, key=lambda x: x[0])
frontier = [s for _, s in candidates[:beam]]
# 更新最佳狀態
if candidates and candidates[0][0] > best_score:
best_score, best_state = candidates[0]
# 從最佳狀態生成最終答案
final_prompt = f"""Based on the reasoning below, produce the final answer.
Question: {question}
Reasoning:
{best_state}
Provide a clear, concise final answer.
End with: FINAL:
"""
final = llm(final_prompt, temperature=0.2, max_tokens=400)
return {
"final_text": final,
"final_answer": extract_final_answer(final),
"best_state": best_state,
"best_score": best_score,
"search_tree": search_tree
}
def tot_dfs(
question: str,
max_depth: int = 5,
branch: int = 3,
threshold: float = 3.0
) -> dict:
"""
使用DFS策略的思維樹方法。
通過深度優先搜索探索解決方案空間，
當某個分支的分數低于閾值時進行剪枝。
參數:
question: 需要解答的問題
max_depth: 最大搜索深度
branch: 每個狀態擴展的分支數
threshold: 剪枝閾值，分數低于此值的分支被放棄
返回:
包含最終答案和搜索路徑的字典
"""
best_result = {"state": "", "score": -1.0}
visited_count = [0] # 使用列表以便在嵌套函數中修改
def propose_steps(state: str) -> list:
prompt = f"""Propose {branch} next reasoning steps.
Question: {question}
Current state:
{state if state else "(empty)"}
Return as numbered list."""
raw = llm(prompt, temperature=0.9, max_tokens=300)
steps = [l.split(".", 1)[-1].strip()
for l in raw.splitlines()
if l.strip()[:1].isdigit()]
return steps[:branch] if steps else [raw.strip()]
def score_state(state: str) -> float:
prompt = f"""Rate this partial solution 0-10.
Question: {question}
State: {state}
Output only a number."""
s = llm(prompt, temperature=0.0, max_tokens=10).strip()
try:
return float(re.findall(r"\d+(\.\d+)?", s)[0])
except:
return 5.0
def dfs(state: str, depth: int):
visited_count[0] += 1
if depth >= max_depth:
score = score_state(state)
if score > best_result["score"]:
best_result["state"] = state
best_result["score"] = score
return
for step in propose_steps(state):
new_state = (state + "\n" + step).strip()
score = score_state(new_state)
# 剪枝：跳過低分分支
if score < threshold:
continue
if score > best_result["score"]:
best_result["state"] = new_state
best_result["score"] = score
dfs(new_state, depth + 1)
dfs("", 0)
# 生成最終答案
final = llm(
f"""Produce final answer based on:
Question: {question}
Reasoning: {best_result['state']}
End with FINAL: ...""",
temperature=0.2
)
return {
"final_text": final,
"final_answer": extract_final_answer(final),
"best_state": best_result["state"],
"best_score": best_result["score"],
"states_visited": visited_count[0]
}
# 使用示例
if __name__ == "__main__":
question = "使用數字1, 5, 6, 7（每個只能用一次），通過加減乘除得到24。"
result = tot_bfs(question, max_depth=3, beam=2, branch=3)
print("=== BFS Tree-of-Thoughts ===")
print(f"最佳推理路徑:\n{result['best_state']}")
print(f"\n最佳分數: {result['best_score']}")
print(f"\n最終答案: {result['final_answer']}")

ToT適合組合問題（24點游戲、數獨）、規劃任務、博弈問題（象棋、圍棋）、頭腦風暴這類需要探索不同方向的場景。答案空間極大時可能需要配合啟發式剪枝。簡單問題用不著——直接CoT就夠了。

局限性：計算成本高昂，需要大量LLM調用來評估和擴展節點。LLM自評估不太可靠，評估函數質量直接決定效果。實現復雜度比其他幾種方法高不少。還有些問題壓根沒有明顯的樹形結構，ToT就不太適用。

迭代（Iteration）：反思與自我改進

Reflexion和Self-Refine用的是經典的「生成-評估-改進」循環：模型先產生初始答案，拿到反饋后修正答案，如此反復直到滿意或達到最大輪數。

人類學習不也是這樣嗎？很少有事情一次就做對，總是通過反饋不斷改進。

但有個重要的坑要注意：沒有可靠外部反饋的「自我糾正」可能適得其反。

研究表明，模型僅靠自己判斷來「自我糾正」時，可能把正確答案改成錯誤答案，可能對錯誤判斷過度自信，可能在無效修改上浪費Token。

所以最佳實踐是盡量用外部反饋源。代碼執行（單元測試、錯誤信息）和規則檢查（格式驗證、約束檢查）最可靠。工具調用（計算器、搜索引擎）和人類反饋也不錯。另一個LLM做交叉驗證勉強能用。同一個LLM自評效果最差，缺乏外部參照。

def self_refine(
question: str,
score_fn,
rounds: int = 3,
improvement_threshold: float = 0.1
) -> dict:
"""
使用自我改進（Self-Refine）方法迭代優化答案。
核心流程：生成 -> 評估 -> 根據反饋改進 -> 重復
參數:
question: 需要解答的問題
score_fn: 評估函數，簽名為:
score_fn(answer_text) -> (score: float, feedback: str)
- score: 0.0-1.0之間的分數
- feedback: 具體的改進建議
強烈建議使用外部評估器！
rounds: 最大改進輪數
improvement_threshold: 最小改進閾值，低于此值則提前停止
返回:
包含以下鍵的字典:
- final: 最終答案
- final_score: 最終分數
- history: 完整的改進歷史
- rounds_used: 實際使用的輪數
"""
# 生成初始答案
initial_prompt = f"""Provide a thoughtful answer to this question.
Show your reasoning and end with FINAL: ...
Question: {question}
"""
answer = llm(initial_prompt, temperature=0.4)
history = []
prev_score = -float('inf')
for round_num in range(rounds):
# 評估當前答案
score, feedback = score_fn(answer)
history.append({
"round": round_num + 1,
"answer": answer,
"score": score,
"feedback": feedback
})
# 檢查是否有足夠的改進
if round_num > 0 and (score - prev_score) < improvement_threshold:
# 如果改進不明顯，考慮提前停止
if score >= prev_score:
pass # 繼續，至少沒有退步
else:
# 退步了，恢復上一個答案
answer = history[-2]["answer"]
score = history[-2]["score"]
break
# 如果分數已經很高，提前停止
if score >= 0.95:
break
prev_score = score
# 根據反饋改進答案
refine_prompt = f"""Improve your answer based on the feedback below.
Question: {question}
Your current answer:
{answer}
Feedback (score: {score:.2f}/1.00):
{feedback}
Instructions:
1. Keep what is correct in your current answer
2. Fix the issues mentioned in the feedback
3. Make sure not to introduce new errors
4. End with FINAL: ...
Improved answer:
"""
answer = llm(refine_prompt, temperature=0.3)
# 最終評估
final_score, final_feedback = score_fn(answer)
return {
"final": answer,
"final_answer": extract_final_answer(answer),
"final_score": final_score,
"history": history,
"rounds_used": len(history)
}
# ---- 示例評估函數 ----
def make_code_evaluator(test_cases: list):
"""
創建一個代碼評估函數。
參數:
test_cases: 測試用例列表，每個元素是(input, expected_output)
返回:
評估函數
"""
def evaluator(code_answer: str) -> tuple:
# 提取代碼塊
code_match = re.search(r"```python\n(.*?)```", code_answer, re.DOTALL)
if not code_match:
return 0.0, "No Python code block found. Please wrap your code in ```python ... ```"
code = code_match.group(1)
passed = 0
failed_cases = []
for inp, expected in test_cases:
try:
# 危險：實際應用中應使用沙箱！
local_vars = {}
exec(code, {"__builtins__": {}}, local_vars)
# 假設代碼定義了solve函數
if 'solve' in local_vars:
result = local_vars['solve'](inp)
if result == expected:
passed += 1
else:
failed_cases.append(f"Input: {inp}, Expected: {expected}, Got: {result}")
else:
return 0.0, "No 'solve' function found in your code."
except Exception as e:
failed_cases.append(f"Input: {inp}, Error: {str(e)}")
score = passed / len(test_cases)
if failed_cases:
feedback = "Failed test cases:\n" + "\n".join(failed_cases[:3]) # 最多顯示3個
if len(failed_cases) > 3:
feedback += f"\n... and {len(failed_cases) - 3} more failures"
else:
feedback = "All test cases passed!"
return score, feedback
return evaluator
def make_math_evaluator(correct_answer):
"""
創建一個數學答案評估函數。
參數:
correct_answer: 正確答案
返回:
評估函數
"""
def evaluator(answer_text: str) -> tuple:
extracted = extract_final_answer(answer_text)
# 嘗試數值比較
try:
extracted_num = float(re.findall(r"-?\d+\.?\d*", extracted)[0])
correct_num = float(correct_answer)
if abs(extracted_num - correct_num) < 0.01:
return 1.0, "Correct!"
else:
return 0.0, f"Incorrect. Your answer: {extracted_num}, Expected: {correct_num}"
except:
pass
# 字符串比較
if extracted.lower().strip() == str(correct_answer).lower().strip():
return 1.0, "Correct!"
else:
return 0.0, f"Incorrect. Your answer: {extracted}, Expected: {correct_answer}"
return evaluator
def make_llm_evaluator(criteria: str):
"""
創建一個基于LLM的評估函數（不推薦作為唯一評估源）。
參數:
criteria: 評估標準描述
返回:
評估函數
"""
def evaluator(answer_text: str) -> tuple:
prompt = f"""Evaluate this answer based on the following criteria:
Criteria: {criteria}
Answer to evaluate:
{answer_text}
Provide:
1. A score from 0.0 to 1.0
2. Specific feedback on what's wrong and how to improve
Format:
SCORE: [number]
FEEDBACK: [your feedback]
"""
response = llm(prompt, temperature=0.0)
try:
score = float(re.search(r"SCORE:\s*([\d.]+)", response).group(1))
score = min(1.0, max(0.0, score))
except:
score = 0.5
try:
feedback = re.search(r"FEEDBACK:\s*(.+)", response, re.DOTALL).group(1).strip()
except:
feedback = response
return score, feedback
return evaluator
# 使用示例
if __name__ == "__main__":
# 示例1：代碼任務
question = "編寫一個函數solve(n)，返回n的階乘。"
test_cases = [
(0, 1),
(1, 1),
(5, 120),
(10, 3628800)
]
result = self_refine(
question=question,
score_fn=make_code_evaluator(test_cases),
rounds=3
)
print("=== Self-Refine for Code ===")
print(f"最終分數: {result['final_score']:.2%}")
print(f"使用輪數: {result['rounds_used']}")
print(f"\n改進歷史:")
for h in result['history']:
print(f" Round {h['round']}: score={h['score']:.2f}")
# 示例2：數學任務
question = "計算 17 * 23 + 45 - 12"
correct = 17 * 23 + 45 - 12
result = self_refine(
question=question,
score_fn=make_math_evaluator(correct),
rounds=2
)
print("\n=== Self-Refine for Math ===")
print(f"最終答案: {result['final_answer']}")
print(f"正確答案: {correct}")
print(f"最終分數: {result['final_score']:.2%}")

Self-Refine適合代碼生成（有單元測試作為外部反饋）、格式化任務（有明確規范可檢查）、約束滿足問題（可驗證約束是否滿足）、事實核查（可通過檢索驗證）。主觀任務需要人類反饋或多模型交叉驗證。沒有反饋來源時別用——純LLM自評不靠譜。

局限性：反饋質量決定上限，垃圾反饋只會導致垃圾改進。模型有時候會在不同版本之間來回「改」，出現震蕩。每輪迭代都消耗Token，成本會累積。也無法保證收斂——模型可能根本沒法利用反饋真正改進。

技術對比與選擇指南

四種技術各有特點。CoT思考更深，Token消耗低，LLM只調用一次，實現簡單，不需要外部反饋，適合推理鏈問題。Self-Consistency采樣更廣，Token消耗中等，LLM調用N次，實現也簡單，不需要外部反饋，適合有確定答案的問題。ToT探索更多，Token消耗高，LLM調用次數是分支數乘以深度，實現復雜，外部反饋可選但推薦，適合組合和規劃問題。Self-Refine改進更好，Token消耗中等，LLM調用次數是輪數乘以2，實現復雜度中等，強烈推薦外部反饋，適合可迭代改進的問題。

選擇思路如下，需要分步推理就先試CoT，不穩定的話加上Self-Consistency。有確定答案且需要可靠性，直接用Self-Consistency。組合或搜索問題用ToT。有外部反饋源就用Self-Refine。不確定用什么就先用CoT，看效果再定。

這些技術可以組合使用。CoT加SC是每次采樣都用CoT然后多數投票。ToT加SC是ToT生成多個最終答案用SC選擇。ToT加SR是用SR迭代改進ToT的最佳結果。復雜任務可能需要把多種技術串成流水線。

def combined_approach(question: str, score_fn) -> str:
"""
組合使用多種推理時技術。
流程:
1. 用ToT探索解決方案空間
2. 用Self-Consistency從多個ToT結果中選擇
3. 用Self-Refine迭代改進最終答案
"""
# 第一階段：ToT探索（運行3次）
tot_results = []
for _ in range(3):
result = tot_bfs(question, max_depth=3, beam=2, branch=3)
tot_results.append(result['final_answer'])
# 第二階段：Self-Consistency選擇
vote = Counter(tot_results).most_common(1)[0][0]
# 第三階段：Self-Refine改進
final_result = self_refine(
question=question,
score_fn=score_fn,
rounds=2
)
return final_result['final_answer']

實踐建議

別一上來就用最復雜的技術。推薦的順序是：先直接提問作為baseline，然后加CoT提示，再加Self-Consistency，最后才考慮ToT或Self-Refine。

對于Self-Refine和ToT，評估器質量直接決定效果。花時間構建好的評估器比調參更重要。

推理時技術能大幅提升性能，但成本也會大幅增加。建議設置Token預算上限，記錄每個任務的實際消耗，根據任務重要性調整投入。

部署到生產環境前做A/B測試，找到最佳的性能/成本權衡點。

總結

推理時計算技術代表了LLM能力釋放的新范式。在推理階段多投入一些計算，同一個模型不重新訓練就能有明顯提升。本文介紹的四種技術——CoT、Self-Consistency、Tree-of-Thoughts、Self-Refine——各有特點和適用場景。理解原理和局限性，選擇合適的技術或組合，是LLM應用開發的關鍵技能。

隨著這一領域的發展，會有更多創新的推理時技術出現。但核心原則不會變：給模型更多「思考」的空間，讓它展示真正的推理能力。

https://avoid.overfit.cn/post/2bb5bb4e569a4687a272dc6e9fe6809a

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.