##Translating Videos to Natural Language Using Deep Recurrent Neural Networks
Paper : NAACL-HLT 2015 PDF
Download Model: NAACL15_VGG_MEAN_POOL_MODEL (220MB)
The model is an improved version of the mean pooled model described in the NAACL-HLT 2015 paper. It uses video frame features from the VGG-16 layer model. This is trained only on the Youtube video dataset.
Translating Videos to Natural Language Using Deep Recurrent Neural Networks
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K. Saenko
North American Chapter of the Association for Computational Linguistics – Human Language Technologies
NAACL-HLT 2015
Please consider citing the above paper if you use this model.
The METEOR score of this model is 27.7% on the Youtube (MSVD) video test dataset. (refer to Table 2 in the Sequence to Sequence - Video to Text paper).
The models are currently supported by the recurrent
branch of the Caffe fork
by Jeff Donahue and
Subhashini Venugopalan, but are not yet
compatible with master
branch of Caffe.
More details on the code and data can be found on this Project Page.
The prototxts for the network and solver can also be found here: https://github.com/vsubhashini/caffe/tree/recurrent/examples/youtube