- The paper demonstrates how simple CNNs, built on top of word embeddings, can be used for sentence classification tasks.
- Link to the paper
- Implementation
- Pad input sentences so that they are of the same length.
- Map words in the padded sentence using word embeddings (which may be either initialized as zero vectors or initialized as word2vec embeddings) to obtain a matrix corresponding to the sentence.
- Apply convolution layer with multiple filter widths and feature maps.
- Apply max-over-time pooling operation over the feature map.
- Concatenate the pooling results from different layers and feed to a fully-connected layer with softmax activation.
- Softmax outputs probabilistic distribution over the labels.
- Use dropout for regularisation.
- RELU activation for convolution layers
- Filter window of 3, 4, 5 with 100 feature maps each.
- Dropout - 0.5
- Gradient clipping at 3
- Batch size - 50
- Adadelta update rule.
- CNN-rand
- Randomly initialized word vectors.
- CNN-static
- Uses pre-trained vectors from word2vec and does not update the word vectors.
- CNN-non-static
- Same as CNN-static but updates word vectors during training.
- CNN-multichannel
- Uses two set of word vectors (channels).
- One set is updated and other is not updated.
- Sentiment analysis datasets for Movie Reviews, Customer Reviews etc.
- Classification data for questions.
- Maximum number of classes for any dataset - 6
- Good results on benchmarks despite being a simple architecture.
- Word vectors obtained by non-static channel have more meaningful representation.
- Small data with few labels.
- Results are not very detailed or exhaustive.