Skip to content

Instantly share code, notes, and snippets.

View Kshitij-Ambilduke's full-sized avatar
🤖
Aspiring to be inspiring

Kshitij Ambilduke Kshitij-Ambilduke

🤖
Aspiring to be inspiring
  • IvLabs
  • VNIT, Nagpur
View GitHub Profile

Dyanmic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering

Sauce : https://arxiv.org/abs/1812.05252

Novelty : The inter- and intra-modality relations were never jointly investigated in a unified framework for solving the VQA problem. We argue that, for the VQA problem, the intra-modality relations within each modality is complementary to the inter-modality relations, which were mostly ignored by existing VQA methods.

Overall Approach of DFAF (Dynamic Fusion Attention Flow) :

Hierarchical Question-Image Co-Attention for Visual Question Answering

Sauce : https://arxiv.org/abs/1606.00061

The paper presents a novel Co-Attention mechanism for VQA that jointly reasons about image and question attentions. (SOTA - 2017)

Novelty is seen in hierarchical question encoding and a new attention mechanism which seems to work just fine.

Question Hierarchy :

A hierarchical architecture was built that co-attends the image and question at 3 levels : : (a) word level, (b) phrase level and (c) question level.

Where to look: Focus Regions for Visual Question Answering

Sauce : https://arxiv.org/abs/1511.07394

Key features of this paper :

  • Presents an image-region selection mechanism that learns to identify image regions relevant to questions.
  • Provide a general baseline for VQA task. Also provide a margin-based loss for MCQ in VQA which outperforms previous baselines.

Approach :

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Sauce : https://arxiv.org/abs/1902.03751

The paper gives a way of adding human attention in order to make the model look at the appropriate location for searching answers. By using 6% of training data for giving the model a gist of human attention and the rest training data as it is, the model achieved new SOTA (previous + 8%).

The problem which was tried to be tackled in this paper was that, in VQA models, there is high language bias. The model gives less importance to visual features. To nullify this, a technique of adding human attention to the visual features so that the model focuses on image too was devised.

Points :

Dual Attention Network for Multimodal reasoning and Matching

Sauce : https://arxiv.org/abs/1611.00471

The authors propose 2 architechtures r-DAN (reasoning DAN) and m-DAN (matching DAN).

r-DAN : The reasoning model allows visual and textual attentions to steer each other during collaborative inference, which is useful for tasks such as Visual Question Answering (VQA)