Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Kshitij-Ambilduke/c0b0243a64d0012fd9b0330df0b8350b to your computer and use it in GitHub Desktop.
Save Kshitij-Ambilduke/c0b0243a64d0012fd9b0330df0b8350b to your computer and use it in GitHub Desktop.

Where to look: Focus Regions for Visual Question Answering

Sauce : https://arxiv.org/abs/1511.07394

Key features of this paper :

  • Presents an image-region selection mechanism that learns to identify image regions relevant to questions.
  • Provide a general baseline for VQA task. Also provide a margin-based loss for MCQ in VQA which outperforms previous baselines.

Approach :

Their method learns to embed the textual question and the set of visual image regions into a latent space where the inner product yields a relevance weighting for each region.

Margin objective :

yp : Score of the correct answer yn : Score of the highest scoring incorrect answer

ap : Fraction of annotators giving answer p an : Fraction of annotators giving answer n

This objective makes the correct answer to be on top of the incorrect answer by some margin which is decided by the annotations made by the annotators.

Region Selection Layer :

Xr : Each region vector is a column of this matrix. Gr : Projection of all region features in columnvectors of Xr xL : Question/Answer embedding vector sL,r : Softmax scores for weigthed average.

Next, the text features are concatenated directly with image features for each region to produce different feature vectors. This is shown in the horizontal stacking of Xr and repetitions of xL below. Each feature vector is linearly projected with W, and the weighted average is computed using sL,r to attain feature vector aL for each question and answer pair, which is then fed through relu and batch-normalization layers.

Language representations :

  • We represent our words with 300-dimensional word2vec vectors for their simplicity and compact representation.

  • Using averages across word2vec vectors, they construct fixed-length vectors for each question-answer pair, which their model then learns to score.

  • They even tested the model using the LSTM for language modelling but the results with LSTM were far worse than those obtained with Word2Vec.

  • They used Stanford Parser with the word2vec to calculate bins which were concatenated and the final language vector was made as follows :

    • Bin 1 : captures the type of question by averaging the word2vec representation of the first two words. For example, “How many” tends to require a numerical answer, while “Is there” requires a yes or no answer.
    • Bin 2 : Contains the nominal subject to encode subject of question.
    • Bin 3 : Contains the average of all other noun words.
    • Bin 4 : Contains the average of all remaining words, excluding determiners such as “a,” “the,” and “few.”
  • Each bin then contains a 300-dimensional representation, which are concatenated with a bin for the words in the candidate answer to yield a 1500-dimensional question/answer representation.

Visual Features :

First select candidate regions by extracting the top-ranked 99 Edge Boxes from the image after performing non-max suppression with a 0.2 intersection over union overlap criterion.

Non-max suppresion : To select the best bounding box, from the multiple predicted bounding boxes, these object detection algorithms use non-max suppression. This technique is used to “suppress” the less likely bounding boxes and keep only the best one.

They extract features using the VGG-s network [3], concatenating the output from the last fully connected layer (4096 dimensions) and the pre-softmax layer (1000 dimensions) to get a 5096 dimensional feature per region. The pre-softmax classification layer was included to provide a more direct signal for objects from the Imagenet classification task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment