Abstract
We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g., German, and transferring knowledge across nearby languages. The text encoder adopts a hierarchical self-attention mechanism to attend more to both key words within a piece of text and informative components of a news article. We extensively experiment our system on a large-scale text-image database containing multimodal multilingual news articles collected from Swiss local news media websites. The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly supervised learning setting. Besides, we also offer insights on the advantage of using multiple textual sources and multilingual data.
- Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web. Springer, 722--735.Google Scholar
Digital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5 (2017), 135--146.Google Scholar
Cross Ref
- Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a “siamese” time delay neural network. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’94). 737--744.Google Scholar
Cross Ref
- K. Chatfield, R. Arandjelović, O. M. Parkhi, and A. Zisserman. 2015. On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimedia Inf. Retr. 4, 2 (2015), 75--93.Google Scholar
Cross Ref
- Shizhe Chen, Qin Jin, and Alexander Hauptmann. 2019. Unsupervised bilingual lexicon induction from mono-lingual multimodal data. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19), Vol. 33. 8207--8214.Google Scholar
Cross Ref
- Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, and Jin Zhou. 2019. Neural storyboard artist: Visualizing stories with coherent image sequences. In Proceedings of the 27th ACM International Conference on Multimedia. 2236--2244.Google Scholar
Digital Library
- Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google Scholar
Cross Ref
- Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 539--546.Google Scholar
Digital Library
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Vol. 1. 4171--4186.Google Scholar
- Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).Google Scholar
- Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. 2018. Finding beans in burgers: Deep semantic-visual embedding with localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3984--3993.Google Scholar
Cross Ref
- F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18). Retrieved from https://github.com/fartashf/vsepp.Google Scholar
- Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? Automatic caption generation for news images. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, 1239--1249.Google Scholar
Digital Library
- Yansong Feng and Mirella Lapata. 2010. Topic models for image annotation and text illustration. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’10). Association for Computational Linguistics, 831--839.Google Scholar
- Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE Trans. Pattern Anal. Mach. Intell. 35, 4 (2013), 797--812.Google Scholar
Digital Library
- Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 4125--4134.Google Scholar
Cross Ref
- A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’13). 2121--2129.Google Scholar
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the International Conference on Machine Learning (ICML’17). 1243--1252.Google Scholar
- S. Gella, R. Sennrich, F. Keller, and M. Lapata. 2017. Image pivoting for learning multilingual multimodal representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2839--2845.Google Scholar
- X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTAT’11). 315--323.Google Scholar
- Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 529--545.Google Scholar
Cross Ref
- Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
- Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06).Google Scholar
Digital Library
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google Scholar
- S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google Scholar
Digital Library
- Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853--899.Google Scholar
Cross Ref
- Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. To appear (2017). https://spacy.io.Google Scholar
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 3.Google Scholar
- Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’13).Google Scholar
Digital Library
- Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017. Learning robust visual-semantic embeddings. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 3571--3580.Google Scholar
- Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association of Computational Linguistics (TACL) 5, 1 (2017), 339--351.Google Scholar
Cross Ref
- Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18).Google Scholar
Cross Ref
- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 (2016).Google Scholar
- Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google Scholar
Cross Ref
- Y. Khalid and S. Noah. 2011. A framework for integrating DBpedia in a multi-modality ontology news image retrieval system. In Proceedings of the International Conference on Semantic Technology and Information Retrieval. IEEE, 144--149.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015).Google Scholar
- R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Trans. Assoc. Comput. Ling. (2015).Google Scholar
- G. Lample, A. Conneau, L. Denoyer, and M.-A. Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. International Conference on Learning Representations (ICLR).Google Scholar
- Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations (ICLR’18). https://openreview.net/forum?id=H196sainb.Google Scholar
- Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML’15), Vol. 37. JMLR.org, 2085--2094.Google Scholar
- Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201--216.Google Scholar
Cross Ref
- Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19). 4654--4662.Google Scholar
Cross Ref
- A. Y. Lin, J. Ford, E. Adar, and B. Hecht. 2018. VizByWiki: Mining data visualizations from the web to enrich news articles. In Proceedings of the World Wide Web Conference (WWW’18). International World Wide Web Conferences Steering Committee, 873--882.Google Scholar
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740--755.Google Scholar
- Fangyu Liu and Rongtian Ye. 2019. A strong and robust baseline for text-image matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 169--176.Google Scholar
Cross Ref
- Fangyu Liu, Rongtian Ye, Xun Wang, and Shuaipeng Li. 2020. HAL: Improved text-image matching by mitigating visual semantic hubs. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20).Google Scholar
Cross Ref
- Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2623--2631.Google Scholar
Digital Library
- Diane Pecher and Rolf A. Zwaan. 2005. Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking. Cambridge University Press.Google Scholar
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2641--2649.Google Scholar
- Janarthanan Rajendran, Mitesh M. Khapra, Sarath Chandar, and Balaraman Ravindran. 2016. Bridge correlational neural networks for multilingual multimodal representation learning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 171--181.Google Scholar
Cross Ref
- A. Ramisa. 2017. Multimodal news article analysis. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 5136--5140.Google Scholar
Cross Ref
- A. Ramisa, F. Yan, F. Moreno-Noguer, and K. Mikolajczyk. 2018. BreakingNews: Article annotation by image and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2018), 1072--1085.Google Scholar
- Hareesh Ravi, Lezi Wang, Carlos Muniz, Leonid Sigal, Dimitris Metaxas, and Mubbasir Kapadia. 2018. Show me a story: Towards coherent neural story illustration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7613--7621.Google Scholar
Cross Ref
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.Google Scholar
Digital Library
- Guy Rotman, Ivan Vulić, and Roi Reichart. 2018. Bridging languages through images with deep partial canonical correlation analysis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), Vol. 1. 910--921.Google Scholar
Cross Ref
- Alexander Rush. 2018. The annotated transformer. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS’18). 52--60.Google Scholar
Cross Ref
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google Scholar
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.Google Scholar
Digital Library
- Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations (ICLR’20). Retrieved from https://openreview.net/forum?id=SygXPaEYvH.Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’17). 5998--6008.Google Scholar
- I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google Scholar
- Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2018), 394--407.Google Scholar
Digital Library
- Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1398--1406.Google Scholar
Digital Library
- Jônatas Wehrmann and Rodrigo C. Barros. 2018. Bidirectional retrieval made simple. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google Scholar
- Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).Google Scholar
Cross Ref
- Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google Scholar
Cross Ref
- Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088--2096.Google Scholar
Digital Library
- Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. 2018. ShakeDrop regularization. In Proceedings of the International Conference on Learning Representations Workshop Track (ICLR Workshop’18). Retrieved from https://openreview.net/forum?id=S1NHaMW0b.Google Scholar
- T. Zahavy, A. Magnani, A. Krishnan, and S. Mannor. 2018. Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. In Proceedings of the 30th AAAI Conference on Innovative Applications of Artificial Intelligence.Google Scholar
- Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2017. Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017).Google Scholar
Index Terms
Upgrading the Newsroom: An Automated Image Selection System for News Articles
Recommendations
Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images
AbstractAlthough documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing ...
Understanding Human Language: Can NLP and Deep Learning Help?
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalThere is a lot of overlap between the core problems of information retrieval (IR) and natural language processing (NLP). An IR system gains from understanding a user need and from understanding documents, and hence being able to determine whether a ...
Unsupervised translated word sense disambiguation in constructing bilingual lexical database
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingThe performance of a machine translation system depends on the availability of bilingual lexical dictionary and completion of its word sense disambiguation performance. Word sense disambiguation plays a vital role in several applications such as machine ...






Comments