Sauce : https://arxiv.org/abs/1606.00061
The paper presents a novel Co-Attention mechanism for VQA that jointly reasons about image and question attentions. (SOTA - 2017)
Novelty is seen in hierarchical question encoding and a new attention mechanism which seems to work just fine.
A hierarchical architecture was built that co-attends the image and question at 3 levels : : (a) word level, (b) phrase level and (c) question level.
T : Number of words in the question sentence. Q (Question features) = {q1, . . . qt } : qt is the feature vecctor of t-th word. qtw : word embedding @ position t qtp : phrase embedding @ position t qtw : question embedding @ position t V (image features) = {v1, ..., vn } : vn is the feature vector at spatial location 'n'. The co-attention features of image and question at each level in the hierarchy are denoted as vr and qr where r ∈ {w, p, s}
Word Embeddings :
Q = {q1, . . . , qT } : one hot encoding We embed these words to a vector space Qw = {q1w, . . . , qtw}
Phrase emebeddings : We apply 1D convolution on the word embedding vectors to obtain a phrase embedding (, at each word location, we compute the inner product of the word vectors with filters of three window sizes: unigram, bigram and trigram.) For the t-th word, the convolution output with window size s is given by: Zero padding is performed appropriately to achieve the same sequence length as before. Given the convolution result, we then apply max-pooling across different n-grams at each word location to obtain phrase-level features
Question embeddings : The question embedding is formed by passing the phrase embeddings to a LSTM network and getting their hidden states.
Parallel Co-Attention : Attends to Image and question simulatneously. V ∈ Rd×N : Visual feature map Q ∈ Rd×T : Question feature map We calculate an affinity matrix C ∈ RT*N as follows : Where, Wb ∈ Rd*d Wv,Wq ∈ Rk×d, whv, whq ∈ Rk The affinity matrix C transforms question attention space to image attention space. (vice versa for CT) Based on the above attention weights, the image and question attention vectors are calculated as the weighted sum of the image features and question features, i.e.,
The parallel co-attention is done at each level in the hierarchy, leading to vr and qr where r ∈ {w, p, s}.
Considering VQA as a classification task :
Where Ww,Wp,Ws and Wh are again parameters of the model. [.]
is the concatenation operation on 2 vectors, p is the probability of the final answer.