Cross Modal Content Based Objective for Learning Adequate ...

ein mann liegt neben geparkten autos auf einer bank . (= A man lies ... image caption generation with visual attention,” arXiv preprint arXiv:1502.03044 , 2015.
1MB Größe 2 Downloads 217 Ansichten
Cross Modal Content Based Objective for Learning Adequate Multimodal Representations MVRL Workshop @ ICML 2016 Presenter: Seungwhan (Shane) Moon Adhiguna Kuncoro Akash Bharadwaj Volkan Cirik Louis-Philippe Morency Chris Dyer 1

Multimodal Machine Translation - Are pictures worth a thousand words?

Input:

+

“Two young, White males are outside near many bushes.”

Output: “Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.”

Multimodal Machine Translation - Motivation

Current leading MT system output:

Multimodal Machine Translation - Reconciling Two Separate Tasks Multimodal Machine Translation Caption (Source Language)

Image

Caption (Target Language)

Multimodal Machine Translation - Reconciling Two Separate Tasks Machine Translation Caption

Caption

(Source Language)

(Target Language)

Challenge: decoder is often overly fluent but inadequate

Image Captioning Image

Caption (Target Language)

Challenge: CNN representation is competent for a discriminative task, but not for a generative task

Baselines

NMT Enc-Dec Model (= Baseline Unimodal Model: “Blind”)

NMMT (+ Text Attention)

Our Approach

CNN Filtering

Motivation for CNN Filtering Source text views may not be relevant since English and German captions are all i.i.d and crowdsourced

English Captions

German Captions

- A man in jeans is reclining on a green metal bench along a busy sidewalk and crowded street .

- viele autos am straße rand geparkt , auf diesem liegt ein mann auf einer bank (= many cars parked on the roadside , in this a man lying on a bench)

- A white male with a blue sweater and gray pants laying on a sidewalk bench .

- ein seite streifen mit parkenden autos und metall säulen die bis zum flucht punkt des bildes alle auf einer linie hinter einander stehen . (= a side strip with parked cars and metal columns that appear on a line to the vanishing point of the image , all in a row.)

- A man in a blue shirt and gray pants is sleeping on a sidewalk bench . - A man sleeping on a bench in a city area .

- ein mann liegt neben geparkten autos auf einer bank . (= A man lies next to parked cars on a bench .)

Filtering CNN Representation

Hypothesis: Strong regularization of projection from CNN features is useful for input to generative process

Filtering CNN Representation Bag of Verbs

Bag of content words

FC 7 (Optional)

C N N

laying sleeping leaning . . .

man bench cars . . .

To decoder for generative process

Bag of Nouns

(Optional)

green white blue . .

Bag of Adjectives

Filtering CNN Representation

CNN Filtering Training Objective:

Datasets ▪



Main Dataset: Flickr30k (Multi-Modal, Multi-Lingual): ▪

Training set: only 29k professionally-translated English-German captions.



Dev set: 1,014 sentences, blind test set: 1,000 sentences

Multi-lingual word embedding: trained to map each German and English word to the same space. Trained purely on the dataset (no external resources) using the Berkeley word aligner as pre-processing

Results ( Image+English → German ) Model

Visual Features

Meteor

NMMT + Attention

fc7 + CNN Filter

30.28

NMMT + Attention

fc7

29.54

NMMT

fc7 + CNN Filter

19.32

NMMT

fc7

18.72

NMT

N/A

18.8

Takeaway

▪ Strong regularization of the CNN representation helps a generative process ▪ It is complementary with the real-valued FC7!

Relevant papers ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪

S. F. E. H. Elliott, Desmond, “Multi-language image description with neural sequence models,” arXiv preprint arXiv:1510.04709,2015. Koehn, Philipp, et. al “Moses: Open source toolkit for statistical machine translation,” In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. pp. 177–180, 2007 D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR , vol. abs/1409.0473, 2014 J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” arXiv preprint arXiv:1506.07503, 2015 K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” arXiv preprint arXiv:1502.03044 , 2015 B. Zoph and K. Knight, “Multi-source neural translation,” arXiv preprint arXiv:1601.00710 , 2016. O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural machine translation with a shared attention mechanism,” arXiv preprint arXiv:1601.01073, 2016 Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” arXiv preprint arXiv:1511.02274, 2015 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014 C. D. M. Minh-Thang Luong, Hieu Pham, “Bilingual word representations with monolingual quality in mind,” 1st Workshop on Vector Space Modeling for Natural Language Processing, vol. 1, no. 1, pp. 151–159, 2015 R. Z. Ryan Kiros, Ruslan Salakhutdinov, “Unifying visual-semantic embeddings with multi-modal neural language models,” TACL, 2015 N. Chomsky, Syntactic structures. Mouton, 1957

Cross Modal Content Based Objective for Learning Adequate Multimodal Representations

Questions? Presenter: Seungwhan (Shane) Moon [email protected]

18

Adhiguna Kuncoro Akash Bharadwaj Volkan Cirik Louis-Philippe Morency Chris Dyer