NOTE
3 月 4 日凌晨發出那句「me stepping down. bye my beloved qwen」之后,林俊旸在社交媒體上沉默了三周
今天,他在 X(Twitter)上發布了離職以來的第一篇長文
![]()
https://x.com/JustinLin610/status/2037116325210829168
在這篇文章里,他沒有談離職原因,沒有回應去向傳聞。全文只做了一件事:寫下他對 AI 下一階段方向的判斷
從「讓模型想得更久」到「讓模型邊做邊想」
以下是原文全文,采用中英對照呈現
開篇
The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.
過去兩年,整個行業對模型的評判標準和預期都變了。OpenAI 的 o1 讓大家看到,「思考」本身可以是一種被訓練出來的能力。DeepSeek-R1 緊隨其后,證明推理式的后訓練可以在原始實驗室之外被復現、被擴展
That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.
那個階段很重要,但 2025 上半年基本還是在圍繞一個問題打轉:怎么讓模型在推理的時候多想一會兒。現在該問下一步了。我的判斷是智能體式思考(agentic thinking)。為了行動而思考,在跟環境打交道的過程中思考,根據真實反饋不斷修正計劃
1. o1 和 R1 真正教會了我們什么
The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.
第一波推理模型教會我們一件事:要在語言模型上把強化學習跑起來,反饋信號得是確定的、穩定的、能規模化的。數學、代碼、邏輯這些可以驗證對錯的領域,成了 RL 的主戰場。因為在這些場景里,獎勵信號的質量遠高于「讓人類標注員投票選哪個回答更好」
Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.
模型一旦開始在更長的推理軌跡上訓練,RL 就不再是 SFT 上面加的一層薄薄的東西了,它變成了一個系統工程問題。你需要大規模的 rollout、高吞吐的驗證、穩定的策略更新。推理模型的誕生,說到底是一個基礎設施的故事。第一個大轉變:從擴展預訓練,到擴展推理后訓練
2. 真正的難題從來不是「合并思考與指令」
At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.
2025 年初,我們千問團隊有一個很大的野心:做一個統一的系統,讓思考模式和指令模式合二為一。用戶可以調推理力度,低、中、高三檔。更好的情況是模型自己判斷這道題該想多久,簡單的直接答,難的多花點算力
Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.
方向是對的。Qwen3 是業內最清晰的一次公開嘗試,引入了「混合思考模式」,一個模型家族里同時支持想和不想兩種狀態,還有一個四階段的后訓練流水線,專門做「思考模式融合」
But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.
但做起來比說起來難多了。難點在數據。大家聊合并的時候,第一反應往往是模型側的問題:一個 checkpoint 能不能同時裝兩種模式。真正的麻煩在更深處,兩種模式要的數據分布和行為目標,本質上就不一樣
We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.
這件事我們沒有全做對。過程中我們一直在看用戶到底怎么用這兩種模式。好的指令模型講究干脆利落,回復短、格式規矩、延遲低,適合企業里那種大批量的改寫、標注、模板客服。好的思考模型則相反,它需要在難題上多花 token,走不同的路徑去探索,保持足夠的內部計算來真正提升最終的準確率
These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.
兩種行為畫像天然互斥。數據沒策展好的話,兩頭都會變平庸:思考模式變得啰嗦、膨脹、不果斷,指令模式則變得不夠干脆、不夠穩定,還比客戶實際需要的更貴
Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.
所以分開做在實踐中仍然有吸引力。2025 年下半年,Qwen 的 2507 版本就發了獨立的 Instruct 和 Thinking 版本,30B 和 235B 各一套。很多企業客戶要的就是高吞吐、低成本、高度可控的指令模型。分開做讓團隊能更干凈地解決各自的問題
Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.
也有實驗室走了反方向。Anthropic 明確主張集成路線,Claude 3.7 Sonnet 就是一個混合推理模型,用戶想讓它多想就多想,API 還能設思考預算。GLM-4.5、DeepSeek V3.1 后來也往這個方向走了
The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.
關鍵在于這個合并是不是自然長出來的。如果兩種模式只是硬塞在一個 checkpoint 里,表現得像兩個尷尬拼起來的人格,用戶體驗不會好。真正成功的合并需要一個平滑的推理力度光譜,模型能自己判斷該花多少力氣去想。GPT 的 effort control 機制指向了這個方向:對計算的策略分配,而非開關切換
3. Anthropic 的方向是一個有用的糾偏
Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.
Anthropic 在 Claude 3.7 和 Claude 4 上的公開表述一直比較克制。強調的是集成推理、用戶可控的思考預算、真實世界任務、代碼質量。到了 Claude 4,推理可以跟工具調用交叉進行了,編程和 Agent 工作流被放在了最優先的位置
Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.
推理鏈更長,不等于模型更聰明。很多時候,推理鏈越長,反而說明模型在亂花算力。什么都用同一種冗長的方式去想,說明它不會分輕重、不會壓縮、不會動手。Anthropic 的路徑暗示了一個更有紀律的思路:思考應該由目標任務來塑造。寫代碼就幫你導航代碼庫、做規劃、拆解問題。跑 Agent 工作流就提升長周期的執行質量,而非產出漂亮的中間文本
This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.
這個思路往大了看,指向的是一個更根本的變化:我們正在從訓練模型的時代,走向訓練智能體的時代。我們在 Qwen3 的博客里也明確寫過這句話。Agent 是什么?能做計劃、能判斷什么時候動手、能用工具、能感知環境給的反饋、能改策略、能持續跑下去。它的定義特征是跟真實世界的閉環交互
4. 「智能體式思考」到底指什么
Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.
智能體式思考和推理式思考,優化目標就不一樣。推理式思考看的是模型在給出最終答案之前的內部推演質量:能不能解這道定理,能不能寫對代碼,能不能過 benchmark。智能體式思考看的是另一件事:模型在跟環境打交道的過程中,能不能持續往前走
The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:
核心問題從「模型能不能想得夠久」變成了「模型能不能用一種撐得起有效行動的方式來想」。智能體式思考要處理幾件純推理模型基本不用管的事:
+ 什么時候該停下來不想了,開始動手
+ 該調哪個工具,先調哪個后調哪個
+ 環境給回來的信息可能是殘缺的、有噪聲的,得能用
+ 失敗了得能改計劃
+ 跨很多輪對話、很多次工具調用,思路不能斷
Agentic thinking is a model that reasons through action.
智能體式思考,就是通過行動來推理
5. 為什么智能體 RL 的基礎設施更難
Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.
目標一旦從解 benchmark 變成解交互式任務,整個 RL 技術棧就得跟著變。以前推理 RL 的基礎設施不夠用了。推理 RL 里,rollout 基本上是自己跑完的一條軌跡,配個相對干凈的評估器就行。智能體 RL 里,策略被塞進了一個大得多的 harness:工具服務器、瀏覽器、終端、搜索引擎、模擬器、沙盒、API 層、記憶系統、編排框架。環境不再是一個靜態的判分器,它是訓練系統的一部分
This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.
這就帶來一個新的系統需求:訓練和推理必須更干凈地分開。不分開的話,rollout 吞吐量直接塌掉。舉個例子,一個編程 Agent 得把生成的代碼對著真實測試跑一遍。推理端等著執行反饋,訓練端等著完整軌跡,整條流水線的 GPU 利用率遠不如你預想的那么高。再加上工具延遲、信息不完整、環境有狀態,這些低效層層疊加。結果就是實驗變慢,還沒到你想要的能力水平就已經很痛苦了
The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.
環境本身也成了一等研究對象。SFT 時代大家執著于數據多樣性,Agent 時代應該執著于環境質量:穩不穩定、夠不夠真實、覆蓋面多大、狀態夠不夠豐富、模型能不能找到漏洞刷分。環境構建已經開始變成一個真正的創業方向了,不再是邊角料
6. 下一個前沿是更有用的思考
My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.
我預期智能體式思考會成為主流。它大概率會替代掉大部分舊式的推理方式:那種又長又封閉的內部獨白,試圖靠吐出越來越多的文字來彌補自己沒法跟外界交互的缺陷。哪怕是極難的數學或編程任務,一個真正先進的系統也應該能搜索、能模擬、能執行、能檢查、能回頭改。目標是把問題穩穩當當地解決
The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.
訓練這類系統最難的是 reward hacking。模型一旦拿到工具,作弊就變得容易得多。有搜索能力的模型可能在 RL 訓練時直接去查答案;編程 Agent 可能利用倉庫里不該看到的信息、濫用日志、找到繞過任務的捷徑。環境里藏著漏洞的話,策略看起來超強,其實是學會了作弊。這是 Agent 時代比推理時代更微妙的地方。工具越好,模型越有用,但虛假優化的空間也越大。接下來真正卡脖子的研究瓶頸大概率來自環境設計、評估器的魯棒性、反作弊機制。但方向是清楚的:能用工具的思考就是比封閉思考更有用
Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.
智能體式思考也意味著 harness 工程會變得越來越重要。核心智能會越來越多地取決于多個 Agent 怎么組織:誰來編排分工,誰當領域專家,誰執行具體任務同時幫忙管上下文、防污染。從訓練模型到訓練智能體,再從訓練智能體到訓練系統
結語
The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.
推理浪潮的第一階段確立了一件事:反饋信號夠可靠、基礎設施撐得住的話,語言模型上的 RL 能產出質變級別的認知提升
The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.
更深層的變化是從推理式思考到智能體式思考:從想得更久,到為了動手而想。訓練的核心對象變了,變成了模型加環境的整個系統。哪些東西重要也跟著變了:模型架構和訓練數據當然還重要,但環境設計、rollout 基礎設施、評估器的穩健程度、多個 Agent 之間怎么協調,這些都進了核心圈。「好的思考」的定義也變了:在真實約束下最能撐起行動的那條軌跡,而非最長或最顯眼的那條
It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.
競爭優勢的來源也不一樣了。推理時代拼的是 RL 算法、反饋信號、訓練流水線的擴展性。智能體時代拼的是環境質量、訓練和推理的緊耦合、harness 工程能力,以及能不能把模型的決策和決策的后果真正串成一個閉環
原文發布于 X(Twitter),作者 林俊旸(Junyang Lin)
編譯 賽博禪心
特別聲明:以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布,本平臺僅提供信息存儲服務。
Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.