How AI Really Computes ‘36 + 59’: Anthropic’s Breakthrough in Opening the Black Box

Understanding the "Black Box" of AI Reasoning

For years, language models like GPT and Claude have impressed users with their ability to generate fluent responses, perform calculations, and mimic reasoning. But a fundamental question has lingered: Do they actually "think" like humans, or are they just stringing together likely words based on patterns? Thanks to a groundbreaking paper by Anthropic, we now have the clearest picture yet of what's happening inside a model's head.

Anthropic, Language Models, AI Interpretability, Circuit Tracing, Attribution Graph, Claude 3.5 Haiku, Chain-of-Thought, AI Reasoning, Explainable AI, AI Safety, AI Black Box, AI Research, Generative AI


Circuit Tracing: Mapping the Neural Logic

In their recent research titled "Circuit Tracing: Revealing Computational Graphs in Language Models", Anthropic introduces a novel interpretability method. Their approach replaces parts of the model with trackable modules, which mimic the behavior of the original layers. By training these modules and analyzing their behavior, the team constructs attribution graphs—maps showing how internal features contribute to an answer.

A Peek Inside: How AI Adds 36 + 59

One of the most compelling demonstrations visualizes how Claude 3.5 Haiku computes "36 + 59". The model's output—"Add 6 and 9 to get 15. Add 30 and 50 to get 80. Add them to get 95."—sounds human-like. But internally, the model jumps directly to the answer through approximations such as “40 + 50 ≈ 90” and corrects from there, without actually performing step-by-step addition.

Rethinking Explainability and Trust in AI

Anthropic's work shows that many of the chain-of-thought explanations generated by LLMs are not accurate representations of internal logic. They may be rationalizations tailored for human readers, rather than actual reasoning pathways. This underscores the value of attribution graphs as tools for AI safety and transparency.

More Than Just Next-Word Prediction

Although LLMs are trained on next-token prediction, their actual behavior is more complex. Anthropic's findings reveal that models often internally know the answer in advance and then generate explanations for human consumption. This challenges the idea that LLMs only work word-by-word.

Chain-of-Thought Is a Scripted Performance

Anthropic’s research reveals that CoT outputs are not real-time reasoning steps. They are like a scripted performance: coherent explanations written after reaching a conclusion, not actual evidence of the thinking process.

Attribution Graphs: Google Maps for Neurons

Attribution graphs function like Google Maps for neurons. Each feature node represents a concept like "approximate sum" or "rhyme", and edges trace how they interact to form reasoning. This makes internal computation traceable and interpretable for researchers.

Planning Ahead: AI's Unexpected Foresight

Claude 3.5 Haiku doesn’t always generate text sequentially. In poetry tasks, for instance, it chooses a rhyming word before constructing the preceding line, showing evidence of forward planning.

Language-Independent Reasoning

The model activates shared conceptual features across different languages, suggesting it works in a language-agnostic conceptual space. This has important implications for multilingual AI systems.

Hallucinations and Misaligned Objectives

Even sophisticated models like Claude may hallucinate—generate incorrect yet plausible explanations. This becomes riskier when hidden objectives are introduced via fine-tuning. Attribution graphs help researchers detect and mitigate these failures.

Conclusion: The Black Box Is Cracking Open

This research by Anthropic is a milestone in making AI interpretable and safe. It shows that what AI says is not always what it does, and offers tools to begin opening the black box.

References

Circuit Tracing: Revealing Computational Graphs in Language Models - Anthropic, 2025

繁體中文摘要(擴寫版)

AI 模型是否真的會「思考」,一直是科技界與公眾所關心的核心問題。語言模型如 GPT 或 Claude 經常表現出擬人化的推理過程,但這些過程是否真的反映其內部邏輯,一直無從得知。近期由 AI 公司 Anthropic 發表的研究《Circuit Tracing: Revealing Computational Graphs in Language Models》,透過一種稱為「電路追蹤(Circuit Tracing)」的新方法,讓我們第一次得以觀察大型語言模型內部的「思考流程」,為可解釋性與 AI 安全帶來突破性的進展。

研究團隊以 Claude 3.5 Haiku 為對象,將模型中的 MLP 層替換為可追蹤模組,訓練這些模組模仿原模型的輸出行為,進而建立出一張名為「貢獻圖(Attribution Graph)」的推理結構圖。這張圖中的每個節點代表模型中重要的內部特徵,而每條邊則代表這些特徵對最終決策所做的貢獻。

在「36 + 59」的案例中,模型看似使用人類邏輯(6+9、30+50),實際上則是直接激活「40+50 ≈ 90」的估算概念,再進行誤差修正,得出正確答案 95。這說明模型產出的 Chain-of-Thought 敘述,其實是「說給人聽的故事」,而非真實的思考歷程。

值得注意的是,雖然語言模型的訓練目標是「預測下一個 token」,但實際運作上遠不只是逐字計算。研究發現,模型經常在生成回答前,內部早已完成推理與概念整合,並在此基礎上才寫出整段語句。這打破了過去我們對模型僅為「統計預測器」的印象。

研究也發現模型具備預先規劃能力。在創作押韻詩句時,它會先決定韻尾再構思開頭,展現前瞻性的結構安排。更令人驚訝的是,模型在面對不同語言提問時,會啟用相似概念特徵,反映出跨語言共通的抽象思考框架。

不過,Anthropic 也指出模型存在幻覺風險與錯誤路徑。一旦微調過程中引入隱藏目標,模型行為可能偏離預期,這對信任與安全性構成挑戰。貢獻圖提供了可觀察這些潛在偏差的重要工具。

總體而言,這項研究代表開啟 AI 黑盒的重要一步,未來在 AI 解釋性、安全性與決策可預測性上的應用潛力不容小覷。

Post a Comment

0 Comments