Attention-Aware DPO for Reducing Hallucinations in Multi-Image QA

Harsh Sutariya, Jeet Patel, Shaswat Patel, Vishvesh Trivedi

Equal contribution. Authors in ascending lexographical order.

Project Summary

This project explores hallucination reduction in Large Vision-Language Models (LVLMs) when answering queries over multiple images. We propose Attention-aware Direct Preference Optimization (AA-DPO) and extend AdaptVis for inference-time optimization, achieving performance improvements in both alignment and answer quality.

1. Introduction

LVLMs excel at single-image reasoning, but multi-image tasks expose alignment flaws. We augment Direct Preference Optimization (DPO) with attention-aware penalties to discourage incorrect image focus. Our method improves accuracy by 8.5% and further by 10% using AdaptVis-based inference.

Figure 1: Multi-image hallucination example

Model improves target image attention post fine-tuning.

2. Related Work

Prior work mitigates hallucinations via preference learning (e.g., PPO, DPO) or contrastive/inference-time decoding. However, most approaches focus on single-image scenarios. Our method introduces an explicit attention-based training signal for multi-image QA.

3. Method

We modify the Direct Preference Optimization (DPO) loss to incorporate an attention penalty that discourages misallocated focus on irrelevant images. Specifically, the loss function combines:

LDPO: Encourages higher probability for preferred answers
Lattn: Penalizes low attention on the correct image

The final objective is: Ltotal = LDPO + λ × Lattn, with attention weights extracted from decoder layers 14 to 22. These layers consistently show high focus on the target image.

Top: Training pipeline overview. Bottom: Layers 14–22 show strongest attention to target image.

Inference-Time Optimization

At inference, we extend AdaptVis, which scales attention scores based on model confidence. High confidence leads to sharper focus on visual tokens; low confidence causes smoothing to reduce overcommitment to incorrect regions.

Evaluation Strategy

To assess performance, we use the PixMo dataset, covering Sequence, Collage, and Pic-in-Pic scenarios. Each format includes 500 queries with 2–8 images. We evaluate models using a rubric scored by a strong LLM (Gemini, 2025) on:

Relevance: Does the answer directly address the question?
Accuracy: Is the response factually correct?
Clarity: Is the answer coherent and unambiguous?
Completeness: Does the answer cover all necessary information?

We prompted the LLM with a carefully crafted rubric (shown below) to ensure consistent, interpretable evaluations.

Evaluation prompt used for LLM-as-a-Judge scoring on relevance, accuracy, clarity, and completeness.

4. Results

We evaluate on PixMo dataset across Sequence, Collage, and Pic-in-Pic formats. Metrics include Relevance, Accuracy, Clarity, and Completeness (1–5 scale), as judged by a strong LLM.

Our method (w/ AdaptVis) achieves top scores across all evaluation metrics.

AA-DPO improves average accuracy to 3.20 (from 2.95 baseline), with attention ratios better aligned to target images. With AdaptVis, performance improves further, especially on ambiguous visual compositions.

5. Conclusion

Our pipeline enhances LVLM alignment for multi-image inputs by integrating attention penalties into training. Future directions include benchmarking on complex datasets like MANTIS and exploring attention-based loss variants. A difficulty-tiered benchmark using CLIP embeddings is also proposed.

Side-by-side output comparison between baseline, MIA-DPO, and our method. Our method attends the correct image carefully thereby providing more precise fine-grained captions

Harsh Sutariya*, Jeet Patel*, Shaswat Patel*, Vishvesh Trivedi*