Kshitij-Ambilduke/Hierarchical Question-Image Co-Attention for Visual Question Answering.md

## Hierarchical Question-Image Co-Attention for Visual Question Answering.md

      
    Raw
  

              Hierarchical Question-Image Co-Attention for Visual Question Answering.md
            
          
    Hierarchical Question-Image Co-Attention for Visual Question Answering


Sauce : https://arxiv.org/abs/1606.00061

The paper presents a novel Co-Attention mechanism for VQA that jointly reasons about image and question attentions. (SOTA - 2017)

Novelty is seen in hierarchical question encoding and a new attention mechanism which seems to work just fine.

Question Hierarchy :

A hierarchical architecture was built that co-attends the image and question at 3 levels : : (a) word level, (b) phrase level and (c) question level.
Method :

Notations :

T : Number of words in the question sentence.
Q (Question features) = {q1, . . . q_t } : q_t is the feature vecctor of t-th word.
q_t^w : word embedding @ position t
q_t^p : phrase embedding @ position t
q_t^w : question embedding @ position t
V (image features) = {v1, ..., v_n } : v_n is the feature vector at spatial location 'n'.
The co-attention features of image and question at each level in the hierarchy are denoted as v^r and q^r where r ∈ {w, p, s}
Question Hierarchy :

Word Embeddings :

Q = {q1, . . . , qT } : one hot encoding
We embed these words to a vector space Q^w = {q₁^w, . . . , q_t^w}
Phrase emebeddings :
We apply 1D convolution on the word embedding vectors to obtain a phrase embedding (, at each word location, we compute the inner product of the word vectors with filters of three window sizes: unigram, bigram and trigram.)
For the t-th word, the convolution output with window size s is given by:

Zero padding is performed appropriately to achieve the same sequence length as before.
Given the convolution result, we then apply max-pooling across different n-grams at each word location to obtain phrase-level features

Question embeddings :
The question embedding is formed by passing the phrase embeddings to a LSTM network and getting their hidden states.
Co-Attention :

Parallel Co-Attention :
Attends to Image and question simulatneously.
V ∈ R^d×N : Visual feature map
Q ∈ R^d×T : Question feature map
We calculate an affinity matrix C ∈ R^T*N as follows :

Where, Wb ∈ R^d*d

Wv,Wq ∈ R^k×d, whv, whq ∈ R^k
The affinity matrix C transforms question attention space to image attention space. (vice versa for C^T)
Based on the above attention weights, the image and question attention vectors are calculated as the weighted sum of the image features and question features, i.e.,


The parallel co-attention is done at each level in the hierarchy, leading to v^r and q^r where r ∈ {w, p, s}.

Encoding for answer prediction :

Considering VQA as a classification task :

Where Ww,Wp,Ws and Wh are again parameters of the model. [.] is the concatenation operation on 2 vectors, p is the probability of the final answer.