Do retrieval heads speak the same language?
This project analyzes retrieval heads in multilingual LLMs using Needle-in-a-Haystack tasks across English, German, and Chinese. We find that strong retrieval heads are largely language-agnostic and critical for performance. Masking them leads to significant accuracy drops, offering insights for optimizing KV caching and multilingual model efficiency.
Project Summary

1. Introduction
This project investigates retrieval heads in multilingual LLMs. Retrieval heads are attention heads responsible for pulling relevant information from long contexts. We extend existing mechanistic analyses to multilingual settings and show that:
- 30–40% of retrieval heads are language-specific.
- Strong retrieval heads tend to be language-agnostic.
- Masking top retrieval heads significantly degrades model performance.
Our findings offer insights for KV-cache pruning and multilingual model optimization.
2. Related Work
We build on work that identifies induction, suppression, and retrieval heads in monolingual models. Prior multilingual studies reveal latent English-centric behavior in many models. We unify these directions to analyze how retrieval heads behave across languages and translation tasks.
3. Experimental Setup
We use the Needle-in-a-Haystack (NIAH) task to locate retrieval heads in Qwen-2.5-3B and Phi-3.5-Mini LLMs. NIAH consists of a long context (haystack), a query, and a needle (answer). We calculate retrieval scores as:
|gh ∩ k| / |k|
We extend this to multilingual setups using translated haystacks from Wikipedia dumps and evaluate across English, German, and Chinese.

4. Results
We categorize retrieval heads by their overlap across languages, their strength, and their placement in the transformer. Key findings include:
Finding 1: Language-Agnostic vs Specific Heads
Only 40–70% of retrieval heads are shared across all three languages.


Finding 2: Strength Correlates with Generality
Strong heads (retrieval score ≥ 0.5) tend to be shared across languages. Weaker heads are often language-specific.

Finding 3: Retrieval Head Similarity Tracks Language Distance



Finding 4: Retrieval + Translation Fails in Hybrid Tasks
When haystack and needle are in English but response is expected in Chinese, Qwen-2.5 fails to activate retrieval heads. This suggests current methods don’t generalize well to retrieval-translation tasks.

Finding 5: Masking Top Retrieval Heads Degrades All Languages

5. Conclusion and Future Work
We find that strong retrieval heads tend to be language-agnostic and essential across tasks. These insights can guide KV-caching and pruning strategies in multilingual systems. Future work should explore how retrieval heads emerge during training and how to adapt retrieval-translation experiments to better reflect real-world QA settings.