Kshitij Ambilduke Kshitij-Ambilduke

## Dyanmic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Kshitij-Ambilduke
                / Dyanmic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering.md
            
            
              Created
              May 30, 2021 21:10
            
          
    Dyanmic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering


Sauce : https://arxiv.org/abs/1812.05252

Novelty :
The inter- and intra-modality relations were never jointly investigated in a unified framework for solving the VQA problem. We argue that, for the VQA problem, the intra-modality relations within each modality is complementary to the inter-modality relations, which were mostly ignored by existing VQA methods.
Overall Approach of DFAF (Dynamic Fusion Attention Flow) :


## Hierarchical Question-Image Co-Attention for Visual Question Answering.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Kshitij-Ambilduke
                / Hierarchical Question-Image Co-Attention for Visual Question Answering.md
            
            
              Created
              May 30, 2021 21:09
            
          
    Hierarchical Question-Image Co-Attention for Visual Question Answering


Sauce : https://arxiv.org/abs/1606.00061

The paper presents a novel Co-Attention mechanism for VQA that jointly reasons about image and question attentions. (SOTA - 2017)

Novelty is seen in hierarchical question encoding and a new attention mechanism which seems to work just fine.

Question Hierarchy :

A hierarchical architecture was built that co-attends the image and question at 3 levels : : (a) word level, (b) phrase level and (c) question level.

  
## Where to look: Focus Regions for Visual Question Answering.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Kshitij-Ambilduke
                / Where to look: Focus Regions for Visual Question Answering.md
            
            
              Created
              May 30, 2021 21:09
            
          
    Where to look: Focus Regions for Visual Question Answering


Sauce : https://arxiv.org/abs/1511.07394

Key features of this paper :

Presents an image-region selection mechanism that learns to identify image regions relevant to questions.
Provide a general baseline for VQA task. Also provide a margin-based loss for MCQ in VQA which outperforms previous baselines.

Approach :


## Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Kshitij-Ambilduke
                / Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded.md
            
            
              Created
              May 30, 2021 21:09
            
          
    Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded


Sauce : https://arxiv.org/abs/1902.03751

The paper gives a way of adding human attention in order to make the model look at the appropriate location for searching answers.
By using 6% of training data for giving the model a gist of human attention and the rest training data as it is, the model achieved new SOTA (previous + 8%).

The problem which was tried to be tackled in this paper was that, in VQA models, there is high language bias. The model gives less importance to visual features. To nullify this, a technique of adding human attention to the visual features so that the model focuses on image too was devised.

Points :

  
## Dual Attention Network for Multimodal reasoning and Matching.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                Kshitij-Ambilduke
                / Dual Attention Network for Multimodal reasoning and Matching.md
            
            
              Created
              May 30, 2021 21:08
            
          
    Dual Attention Network for Multimodal reasoning and Matching


Sauce : https://arxiv.org/abs/1611.00471

The authors propose 2 architechtures r-DAN (reasoning DAN) and m-DAN (matching DAN).
r-DAN :
The reasoning model allows visual and textual attentions to steer each other during collaborative inference, which is useful for tasks such as Visual Question Answering (VQA)