Do retrieval heads speak the same language?

Project Summary

1. Introduction

This project investigates retrieval heads in multilingual LLMs. Retrieval heads are attention heads responsible for pulling relevant information from long contexts. We extend existing mechanistic analyses to multilingual settings and show that:

30–40% of retrieval heads are language-specific.
Strong retrieval heads tend to be language-agnostic.
Masking top retrieval heads significantly degrades model performance.

Our findings offer insights for KV-cache pruning and multilingual model optimization.

2. Related Work

We build on work that identifies induction, suppression, and retrieval heads in monolingual models. Prior multilingual studies reveal latent English-centric behavior in many models. We unify these directions to analyze how retrieval heads behave across languages and translation tasks.

3. Experimental Setup

We use the Needle-in-a-Haystack (NIAH) task to locate retrieval heads in Qwen-2.5-3B and Phi-3.5-Mini LLMs. NIAH consists of a long context (haystack), a query, and a needle (answer). We calculate retrieval scores as:

|gh ∩ k| / |k|

We extend this to multilingual setups using translated haystacks from Wikipedia dumps and evaluate across English, German, and Chinese.

Figure 1 shows retrieval accuracy drop in German. Figure 2 illustrates the multilingual NIAH setup.

4. Results

We categorize retrieval heads by their overlap across languages, their strength, and their placement in the transformer. Key findings include:

Finding 1: Language-Agnostic vs Specific Heads

Only 40–70% of retrieval heads are shared across all three languages.

Figure 3A: Heatmap of Retrieval Head Activations in Qwen 2.5-3B-Instruct

Figure 3B: Heatmap of Retrieval Head Activations in Phi 3.5-3B-MiniInstruct

Shared heads dominate strong retrieval behavior. Language-specific heads appear in later transformer layers. Top: Qwen 2.5-3B-Instruct, Bottom: Phi 3.5-3B-MiniInstruct

Finding 2: Strength Correlates with Generality

Strong heads (retrieval score ≥ 0.5) tend to be shared across languages. Weaker heads are often language-specific.

Figure 4: Language-Specific vs Shared Heads

Finding 3: Retrieval Head Similarity Tracks Language Distance

Figure 5: Pairwise Rank Correlation of Heads in Qwen 2.5-3B-Instruct

The difference in ranks of retrieval heads is more pronounced for weaker, language-specific attention heads. Top: Qwen 2.5-3B-Instruct, Bottom: Phi 3.5-3B-MiniInstruct

Table 1: Language Distance vs Correlation

Higher correlation between English-German vs English-Chinese retrieval heads.

Finding 4: Retrieval + Translation Fails in Hybrid Tasks

When haystack and needle are in English but response is expected in Chinese, Qwen-2.5 fails to activate retrieval heads. This suggests current methods don’t generalize well to retrieval-translation tasks.

Figure 6: Retrieval-Translation Task Failures

Finding 5: Masking Top Retrieval Heads Degrades All Languages

ROUGE scores drop more when language-agnostic heads are masked compared to language-specific heads.

5. Conclusion and Future Work

We find that strong retrieval heads tend to be language-agnostic and essential across tasks. These insights can guide KV-caching and pruning strategies in multilingual systems. Future work should explore how retrieval heads emerge during training and how to adapt retrieval-translation experiments to better reflect real-world QA settings.