Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Kshitij-Ambilduke/213529b8ad53d4aa672f6848e7b84b9c to your computer and use it in GitHub Desktop.
Save Kshitij-Ambilduke/213529b8ad53d4aa672f6848e7b84b9c to your computer and use it in GitHub Desktop.

Hierarchical Question-Image Co-Attention for Visual Question Answering

Sauce : https://arxiv.org/abs/1606.00061

The paper presents a novel Co-Attention mechanism for VQA that jointly reasons about image and question attentions. (SOTA - 2017)

Novelty is seen in hierarchical question encoding and a new attention mechanism which seems to work just fine.

Question Hierarchy :

A hierarchical architecture was built that co-attends the image and question at 3 levels : : (a) word level, (b) phrase level and (c) question level.

Method :

Notations :

T : Number of words in the question sentence. Q (Question features) = {q1, . . . qt } : qt is the feature vecctor of t-th word. qtw : word embedding @ position t qtp : phrase embedding @ position t qtw : question embedding @ position t V (image features) = {v1, ..., vn } : vn is the feature vector at spatial location 'n'. The co-attention features of image and question at each level in the hierarchy are denoted as vr and qr where r ∈ {w, p, s}

Question Hierarchy :

Word Embeddings :

Q = {q1, . . . , qT } : one hot encoding We embed these words to a vector space Qw = {q1w, . . . , qtw}

Phrase emebeddings : We apply 1D convolution on the word embedding vectors to obtain a phrase embedding (, at each word location, we compute the inner product of the word vectors with filters of three window sizes: unigram, bigram and trigram.) For the t-th word, the convolution output with window size s is given by: Zero padding is performed appropriately to achieve the same sequence length as before. Given the convolution result, we then apply max-pooling across different n-grams at each word location to obtain phrase-level features

Question embeddings : The question embedding is formed by passing the phrase embeddings to a LSTM network and getting their hidden states.

Co-Attention :

Parallel Co-Attention : Attends to Image and question simulatneously. V ∈ Rd×N : Visual feature map Q ∈ Rd×T : Question feature map We calculate an affinity matrix C ∈ RT*N as follows : Where, Wb ∈ Rd*d Wv,Wq ∈ Rk×d, whv, whq ∈ Rk The affinity matrix C transforms question attention space to image attention space. (vice versa for CT) Based on the above attention weights, the image and question attention vectors are calculated as the weighted sum of the image features and question features, i.e.,

The parallel co-attention is done at each level in the hierarchy, leading to vr and qr where r ∈ {w, p, s}.

Encoding for answer prediction :

Considering VQA as a classification task : Where Ww,Wp,Ws and Wh are again parameters of the model. [.] is the concatenation operation on 2 vectors, p is the probability of the final answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment