網(wǎng)易首頁 > 網(wǎng)易號 > 正文申請入駐

DeepSeek-R1 的啟示：Test-Time Compute 技術(shù)不再迷信參數(shù)堆疊

2025-12-13 22:45:48　來源: deephub

北京舉報

分享至

過去2年，整個行業(yè)仿佛陷入了一場參數(shù)競賽，每一次模型發(fā)布的敘事如出一轍：“我們堆了更多 GPU，用了更多數(shù)據(jù)，現(xiàn)在的模型是 1750 億參數(shù)，而不是之前的 1000 億。”

這種慣性思維讓人誤以為智能只能在訓(xùn)練階段“烘焙”定型，一旦模型封裝發(fā)布，能力天花板就被焊死了。

但到了 2025 年，這個假設(shè)徹底被打破了。

先是 DeepSeek-R1 證明了只要給予思考時間，Open-weights 模型也能展現(xiàn)出驚人的推理能力。緊接著 OpenAI o3 登場，通過在單個問題上消耗分鐘級而非毫秒級的時間，橫掃了各大基準(zhǔn)測試。

大家突然意識到我們一直優(yōu)化錯了變量。技術(shù)突破點(diǎn)不在于把模型做得更大，而在于讓模型在輸出結(jié)果前學(xué)會暫停、思考和驗(yàn)證。

這就是 Test-Time Compute（測試時計(jì)算），繼 Transformer 之后，數(shù)據(jù)科學(xué)領(lǐng)域最重要的一次架構(gòu)級范式轉(zhuǎn)移。

推理側(cè) Scaling Law：比 GPT-4 更深遠(yuǎn)的影響

以前我們奉 Chinchilla Scaling Laws 為圭臬，認(rèn)為性能嚴(yán)格受限于訓(xùn)練預(yù)算。但新的研究表明，Inference Scaling（訓(xùn)練后的計(jì)算投入）遵循著一套獨(dú)立的、往往更為陡峭的冪律曲線。

幾項(xiàng)關(guān)鍵研究數(shù)據(jù)揭示了這一趨勢：

arXiv:2408.03314 指出，優(yōu)化 LLM 的測試時計(jì)算往往比單純擴(kuò)展參數(shù)更有效。一個允許“思考” 10 秒的小模型，其實(shí)際表現(xiàn)完全可以碾壓一個瞬間給出答案但規(guī)模大 14 倍的巨型模型。

實(shí)戰(zhàn)數(shù)據(jù)也印證了這一點(diǎn)。2025 年 1 月發(fā)布的 DeepSeek-R1，其純強(qiáng)化學(xué)習(xí)版本在 AIME 數(shù)學(xué)基準(zhǔn)測試中，僅通過學(xué)習(xí)自我驗(yàn)證（Self-Verify），得分就從 15.6% 暴漲至 71.0%；引入 Majority Voting（多數(shù)投票）機(jī)制后，更是飆升至 86.7%。到了 4 月，OpenAI o3 在 AIME 上更是達(dá)到了驚人的 96.7%，在 Frontier Math 上拿到 25.2%，但代價是處理每個復(fù)雜任務(wù)的成本超過 $1.00。

結(jié)論很明顯：在推理階段投入算力的回報率，正在超越訓(xùn)練階段。

新的“思考”格局

到了 2025 年底，OpenAI 不再是唯一的玩家，技術(shù)路徑已經(jīng)分化為三種。

這里需要潑一盆冷水：Google 的 Gemini 2.5 Flash Thinking 雖然展示了透明的推理過程，但當(dāng)我讓它數(shù)“strawberry”里有幾個 R 時，它自信滿滿地列出邏輯，最后得出結(jié)論——兩個。這說明展示過程不等于結(jié)果正確，透明度固然好，但沒有驗(yàn)證閉環(huán)（Verification Loop）依然是徒勞。

在效率方面，DeepSeek-R1 的架構(gòu)設(shè)計(jì)值得玩味。雖然它是一個擁有 6710 億參數(shù)的龐然大物，但得益于 Mixture-of-Experts (MoE) 技術(shù)，每次推理僅激活約 370 億參數(shù)。這好比一個存有 600 種工具的巨型車間，工匠干活時只取當(dāng)下最順手的 3 件。這種機(jī)制讓它的成本比 o1 低了 95% 卻保持了高密度的推理能力。正是這種 MoE 帶來的經(jīng)濟(jì)性，才讓超大模型跑復(fù)雜的多步 Test-Time Compute 循環(huán)在商業(yè)上變得可行。

現(xiàn)成的工程模式：Best-of-N with Verification

搞 Test-Time Compute 不需要千萬美元的訓(xùn)練預(yù)算，甚至不需要 o3 的權(quán)重。其核心架構(gòu)非常簡單，普通開發(fā)者完全可以復(fù)刻。

核心就三步：

Divergent Generation（發(fā)散生成）：提高 Temperature，讓模型對同一問題生成 N 種不同的推理路徑。
Self-Verification（自我驗(yàn)證）：用模型自身（或更強(qiáng)的 Verifier）去批判每一個方案。
Selection（擇優(yōu)）：選出置信度最高的答案。

學(xué)術(shù)界稱之為Best-of-N with Verification，這與論文 [s1: Simple test-time scaling (arXiv:2501.19393)] 的理論高度吻合。

你只需要任何一個主流 LLM API（OpenAI, DeepSeek, Llama 3 均可）、幾分錢的額度和一個簡單的 Python 腳本。

代碼實(shí)現(xiàn)如下：

import os
import numpy as np
from typing import List
from pydantic import BaseModel, Field
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# 1. Define structure for "System 2" thinking
class StepValidation(BaseModel):
is_correct: bool = Field(description="Does the solution logically satisfy ALL constraints?")
confidence_score: float = Field(description="0.0 to 1.0 confidence score")
critique: str = Field(description="Brief analysis of potential logic gaps or missed constraints")
# 2. Divergent Thinking (Generate)
def generate_candidates(prompt: str, n: int = 5) -> List[str]:
"""Generates N distinct solution paths using high temperature."""
candidates = []
print(f"Generating {n} candidate solutions with gpt-4o-mini...")
for _ in range(n):
response = client.chat.completions.create(
model="gpt-4o-mini", # Small, fast generator
messages=[
{"role": "system", "content": "You are a thoughtful problem solver. Show your work step by step."},
{"role": "user", "content": prompt}
],
temperature=0.8 # High temp for diverse reasoning paths
)
candidates.append(response.choices[0].message.content)
return candidates
# 3. Convergent Thinking (Verify)
def verify_candidate(problem: str, candidate: str) -> float:
"""
Uses the SAME small model to critique its own work.
This proves that 'time to think' > 'model size'.
"""
verification_prompt = f"""
You are a strict logic reviewer.
Review the solution below for logical fallacies or missed constraints.
PROBLEM: {problem}
PROPOSED SOLUTION:
{candidate}
Check your work. Does the solution actually fit the constraints?
Rate the confidence from 0.0 (Wrong) to 1.0 (Correct).
"""
response = client.beta.chat.completions.parse(
model="gpt-4o-mini", # Using the small model as a Verifier
messages=[{"role": "user", "content": verification_prompt}],
response_format=StepValidation
)
return response.choices[0].message.parsed.confidence_score
# 4. Main loop
def system2_solve(prompt: str, effort_level: int = 5):
print(f"System 2 Activated: Effort Level {effort_level}")
candidates = generate_candidates(prompt, n=effort_level)
scores = []
for i, cand in enumerate(candidates):
score = verify_candidate(prompt, cand)
scores.append(score)
print(f" Path #{i+1} Confidence: {score:.2f}")
best_index = np.argmax(scores)
print(f"Selected Path #{best_index+1} with confidence {scores[best_index]}")
return candidates[best_index]
# 5. Execute
if __name__ == "__main__":
# The "Cognitive Reflection Test" (Cyberpunk Edition)
# System 1 instinct: 500 credits (WRONG)
# System 2 logic: 250 credits (CORRECT)
problem = """
A corporate server rack and a cooling unit cost 2500 credits in total.
The server rack costs 2000 credits more than the cooling unit.
How much does the cooling unit cost?
"""
answer = system2_solve(problem, effort_level=5) # Increased effort to catch more failures
print("\nFINAL ANSWER:\n", answer)

實(shí)測案例：“服務(wù)器機(jī)架”陷阱

我在認(rèn)知反射測試（Cognitive Reflection Test）的一個變體上跑了這個腳本。這是一種專門設(shè)計(jì)用來誘導(dǎo)大腦（和 AI）做出快速錯誤判斷的邏輯題。

題目是：“總價 2500，機(jī)架比冷卻單元貴 2000，冷卻單元多少錢？”
System 1（直覺）幾乎總是脫口而出500（因?yàn)?2500-2000=500）。
System 2（邏輯）才會算出250（x + x + 2000 = 2500）。

運(yùn)行結(jié)果非常典型：

System 2 Activated: Effort Level 5
Generating 5 candidate solutions...
Path #1 Confidence: 0.10 <-- Model fell for the trap (500 credits)
Path #2 Confidence: 1.00 <-- Model derived the math (250 credits)
Path #3 Confidence: 0.00 <-- Model fell for the trap
...
Selected Path #2 with confidence 1.0

注意 Path #1。在常規(guī)應(yīng)用中，用戶直接拿到的就是這個 500 credits（錯誤）的答案。通過生成 5 條路徑，我們發(fā)現(xiàn) 40% 的結(jié)果都掉進(jìn)了陷阱。但關(guān)鍵在于，作為驗(yàn)證者的同一個小模型，成功識別了邏輯漏洞，并將包含正確推導(dǎo)的 Path #2 撈了出來。

僅僅是“多想一會兒”，一個可靠性 60% 的模型就被強(qiáng)行拉到了 100%。

算力經(jīng)濟(jì)賬

這肯定更貴。但值不值？

我的實(shí)驗(yàn)成本確實(shí)增加了 40 倍，但別忘了絕對值只有 3 美分。這 3 美分換來的是 22% 的準(zhǔn)確率提升。如果你在做醫(yī)療推理或生產(chǎn)環(huán)境 Debug，這簡直是白菜價；如果你只是做個閑聊機(jī)器人，那確實(shí)是貴了。

新的模型：Inference Budget

展望 2026 年，架構(gòu)討論的焦點(diǎn)將從“誰的模型更聰明”轉(zhuǎn)移到“我們的推理預(yù)算（Inference Budget）是多少”。

未來的決策可能會變成這樣：

System 1 (Standard API)：延遲要求 < 2秒，或者搞搞創(chuàng)意寫作。
System 2 (DeepSeek-R1 / o3)：準(zhǔn)確性至上（數(shù)學(xué)、代碼、邏輯），且能容忍 10-30 秒的延遲。
System 3 (Custom Loops)：需要形式化保證，必須依賴多 Agent 投票和驗(yàn)證的關(guān)鍵決策。

建議大家把上面的代碼拷下來跑一跑，找一個你現(xiàn)在的 LLM 經(jīng)常翻車的邏輯題或冷門 Bug 試一下，看著它實(shí)時自我修正。

你會發(fā)現(xiàn)，我們不該再把 LLM 當(dāng)作“神諭（Oracle）”，而應(yīng)將其視為預(yù)算可配置的“推理引擎”。懂 Inference-time compute 的數(shù)據(jù)科學(xué)家，才是 2026 年定義下一代 AI 產(chǎn)品的人。

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters(arXiv:2408.03314).
s1: Simple test-time scaling(arXiv:2501.19393).
DeepSeek AI (2025)—DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning(arXiv:2501.12948).
https://avoid.overfit.cn/post/a2f09be2577e48b59d2f9f2fc5e6549c

作者：Cagatay Akcam

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺“網(wǎng)易號”用戶上傳并發(fā)布，本平臺僅提供信息存儲服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.