Do retrieval heads speak the same language?

This project analyzes retrieval heads in multilingual LLMs using Needle-in-a-Haystack tasks across English, German, and Chinese. We find that strong retrieval heads are largely language-agnostic and critical for performance. Masking them leads to significant accuracy drops, offering insights for optimizing KV caching and multilingual model efficiency.

Project Summary

1. Introduction

This project investigates retrieval heads in multilingual LLMs. Retrieval heads are attention heads responsible for pulling relevant information from long contexts. We extend existing mechanistic analyses to multilingual settings and show that:

  • 30–40% of retrieval heads are language-specific.
  • Strong retrieval heads tend to be language-agnostic.
  • Masking top retrieval heads significantly degrades model performance.

Our findings offer insights for KV-cache pruning and multilingual model optimization.

2. Related Work

We build on work that identifies induction, suppression, and retrieval heads in monolingual models. Prior multilingual studies reveal latent English-centric behavior in many models. We unify these directions to analyze how retrieval heads behave across languages and translation tasks.

3. Experimental Setup

We use the Needle-in-a-Haystack (NIAH) task to locate retrieval heads in Qwen-2.5-3B and Phi-3.5-Mini LLMs. NIAH consists of a long context (haystack), a query, and a needle (answer). We calculate retrieval scores as:

|gh ∩ k| / |k|

We extend this to multilingual setups using translated haystacks from Wikipedia dumps and evaluate across English, German, and Chinese.

Figure 1 shows retrieval accuracy drop in German. Figure 2 illustrates the multilingual NIAH setup.

4. Results

We categorize retrieval heads by their overlap across languages, their strength, and their placement in the transformer. Key findings include:

Finding 1: Language-Agnostic vs Specific Heads

Only 40–70% of retrieval heads are shared across all three languages.

Shared heads dominate strong retrieval behavior. Language-specific heads appear in later transformer layers. Top: Qwen 2.5-3B-Instruct, Bottom: Phi 3.5-3B-MiniInstruct

Finding 2: Strength Correlates with Generality

Strong heads (retrieval score ≥ 0.5) tend to be shared across languages. Weaker heads are often language-specific.

Finding 3: Retrieval Head Similarity Tracks Language Distance

The difference in ranks of retrieval heads is more pronounced for weaker, language-specific attention heads. Top: Qwen 2.5-3B-Instruct, Bottom: Phi 3.5-3B-MiniInstruct
Higher correlation between English-German vs English-Chinese retrieval heads.

Finding 4: Retrieval + Translation Fails in Hybrid Tasks

When haystack and needle are in English but response is expected in Chinese, Qwen-2.5 fails to activate retrieval heads. This suggests current methods don’t generalize well to retrieval-translation tasks.

Finding 5: Masking Top Retrieval Heads Degrades All Languages

ROUGE scores drop more when language-agnostic heads are masked compared to language-specific heads.

5. Conclusion and Future Work

We find that strong retrieval heads tend to be language-agnostic and essential across tasks. These insights can guide KV-caching and pruning strategies in multilingual systems. Future work should explore how retrieval heads emerge during training and how to adapt retrieval-translation experiments to better reflect real-world QA settings.