GPT‑5.2 是我们目前在数学和科学方面表现最出色的模型。
我们对强大的人工智能抱有的愿景之一,是让它能够加速科研进展,惠及全人类,并帮助研究人员探索更多想法、更快速地验证假设,将发现转化为实际效益。
在过去的一年里,我们与数学、物理、生物学和计算机科学等领域的科学家密切合作,以了解 AI 能在哪些方面发挥作用,以及在哪些方面仍存在不足。上个月,我们发布了一篇论文,汇集了数学、物理、生物学、计算机科学、天文学和材料科学等领域的早期案例研究,展示了 GPT‑5 如何已经开始为科研工作做出贡献。随着 GPT‑5.2 的推出,我们看到这些能力变得更稳定、更可靠。
在高精度任务中的卓越表现
GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是我们目前在科学和数学方面实力最强的模型。
强大的数学推理能力是科学与技术工作可靠性的基础。它使模型能够遵循多步骤逻辑、保持量纲一致,并避免那些在真实分析中可能不断累积的细微错误 — 从模拟与统计,到预测与建模。在诸如 FrontierMath 这样的基准测试中的成绩提升,体现的不是单一技能的改进,而是更强的整体推理与抽象能力,这些能力会直接融入科学工作流程,例如编程、数据分析和实验设计。
这些能力也与通用智能的发展紧密相连。一个能够稳定地进行抽象推理、在长链思考中保持一致,并能跨领域泛化的系统,展现的正是 AGI 的核心特质:不是针对某个任务的技巧,而是广泛且可迁移的推理能力,可以真正影响科学、工程以及现实世界的决策。
我们深信,GPT‑5.2 Pro 和 GPT‑5.2 Thinking 是目前最能支持并加快科研进展的模型。在研究生级防 Google 问答基准测试 GPQA Diamond 中,GPT‑5.2 Pro 取得了 93.2% 的成绩,GPT‑5.2 Thinking 紧随其后,达到 92.4%。
GPQA 钻石级 科学问题GPT-5.2 ProGPT-5.2 ThinkingGPT-5.1 Thinking0%20%40%60%80%100%准确性92.4%88.1%93.2%
在 GPQA Diamond(在新窗口中打开) 评测中,模型回答涉及物理、化学和生物的多项选择题。此时未启用任何工具,但推理强度同样设置为最高。
在专家级数学评测 FrontierMath (Tier 1–3) 中,GPT‑5.2 Thinking 树立了新的技术标杆,解决了 40.3% 的问题。
FrontierMath (Tier 1–3) 高等数学GPT-5.2 ThinkingGPT-5.1 Thinking0%10%20%30%40%50%准确性40.3%31.0%
在 FrontierMath(在新窗口中打开) 评测中,模型需要解决专家级的数学问题。此时启用了 Python 工具,并将推理强度设置为最高。
案例研究
GPT‑5.2 is not only strong at graduate-level science problems. We now regularly see our frontier models contributing solutions to previously unsolved—and increasingly subtle—questions in mathematics and the sciences.
In this case study, we describe how GPT‑5.2 Pro helped resolve an open research problem in statistical learning theory, documented in a new paper, On Learning-Curve Monotonicity for Maximum Likelihood Estimators(在新窗口中打开).
The question (“If you collect more data, do your results reliably get better?”) shows up any time you fit a model from data. You can draw a learning curve that tracks average error as you add more examples. In the best case, the curve is monotone. More data means less error, every step of the way. That is the behavior people hope for, and often assume.
But over the last few years, researchers have learned that this intuition can fail. A line of work kicked off by an open problem posed at the Conference on Learning Theory (COLT) in 2019 by Viering, Mey, and Loog showed that the answer is often no. Even very simple, well-behaved toy setups can have non-monotonic learning curves, where adding data increases expected error. That surprise triggered a wave of follow-up papers. They expanded the list of settings where these reversals happen and proposed increasingly elaborate methods designed to restore monotone behavior.
Still, one of the most basic cases remained unresolved. What happens in the cleanest textbook situation, where the statistical model is actually correct and the data follow the familiar bell curve pattern, with a known mean but unknown standard deviation? Researchers already knew that small changes to this setup could break monotonic behavior. But the answer remained unknown in this core case.
Our new paper demonstrates that in this clean setting, intuition prevails: learning is predictably improved by more data, rather than behaving in surprising or unstable ways. What makes this paper unusual is how the proof was obtained. The authors did not work out a strategy and then ask the model to fill in steps. They did not provide intermediate arguments or a proof outline. Instead, they asked GPT‑5.2 Pro to solve the open problem directly, and then carefully verified the proof, including review and validation by external subject-matter experts.
The authors then asked simple follow-up questions to see how far the idea could go. GPT‑5.2 Pro extended the result beyond the original problem to higher dimensional settings and other common statistical models. Throughout, the human role stayed focused on verification and clear writing, rather than supplying mathematical scaffolding.
展望未来
这一结果为 AI 系统如何支持科学研究指明了一个有价值的方向,尤其是在数学、理论计算机科学等具有公理化理论基础的领域。在这些场景中,前沿模型能够协助探索证明、检验假设,并发现那些原本需要大量人力才能挖掘出的潜在联系。
与此同时,这些系统本身并不是独立的研究者。专家的判断、验证过程以及对领域的深入理解依然不可或缺。即便是能力很强的模型,也可能出错,或依赖未被明确验证的假设。但它们同样能够生成结构清晰、细节充分的论证,值得研究者认真审视和打磨。因此,要让 AI 带来可靠的进展,就必须依靠强调验证、透明度与协作的工作流程。
从案例研究的角度来看,这一结果展示了一种正在兴起的研究模式。像 GPT‑5.2 这样的模型可以作为工具,支持数学推理并加速早期探索阶段,而正确性、解释和语境的责任仍由研究者承担。在谨慎使用的前提下,这类系统有望简化理论研究中的重要环节,同时不会削弱人类判断在科学探究中的核心地位。

