skip to main content
research-article
Open Access

Upgrading the Newsroom: An Automated Image Selection System for News Articles

Published:05 July 2020Publication History
Skip Abstract Section

Abstract

We propose an automated image selection system to assist photo editors in selecting suitable images for news articles. The system fuses multiple textual sources extracted from news articles and accepts multilingual inputs. It is equipped with char-level word embeddings to help both modeling morphologically rich languages, e.g., German, and transferring knowledge across nearby languages. The text encoder adopts a hierarchical self-attention mechanism to attend more to both key words within a piece of text and informative components of a news article. We extensively experiment our system on a large-scale text-image database containing multimodal multilingual news articles collected from Swiss local news media websites. The system is compared with multiple baselines with ablation studies and is shown to beat existing text-image retrieval methods in a weakly supervised learning setting. Besides, we also offer insights on the advantage of using multiple textual sources and multilingual data.

References

  1. Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web. Springer, 722--735.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  3. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5 (2017), 135--146.Google ScholarGoogle ScholarCross RefCross Ref
  4. Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a “siamese” time delay neural network. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’94). 737--744.Google ScholarGoogle ScholarCross RefCross Ref
  5. K. Chatfield, R. Arandjelović, O. M. Parkhi, and A. Zisserman. 2015. On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimedia Inf. Retr. 4, 2 (2015), 75--93.Google ScholarGoogle ScholarCross RefCross Ref
  6. Shizhe Chen, Qin Jin, and Alexander Hauptmann. 2019. Unsupervised bilingual lexicon induction from mono-lingual multimodal data. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19), Vol. 33. 8207--8214.Google ScholarGoogle ScholarCross RefCross Ref
  7. Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, and Jin Zhou. 2019. Neural storyboard artist: Visualizing stories with coherent image sequences. In Proceedings of the 27th ACM International Conference on Multimedia. 2236--2244.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder--decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  9. Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 539--546.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 248--255.Google ScholarGoogle Scholar
  11. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Vol. 1. 4171--4186.Google ScholarGoogle Scholar
  12. Terrance DeVries and Graham W. Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).Google ScholarGoogle Scholar
  13. Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. 2018. Finding beans in burgers: Deep semantic-visual embedding with localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 3984--3993.Google ScholarGoogle ScholarCross RefCross Ref
  14. F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference (BMVC’18). Retrieved from https://github.com/fartashf/vsepp.Google ScholarGoogle Scholar
  15. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? Automatic caption generation for news images. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, 1239--1249.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yansong Feng and Mirella Lapata. 2010. Topic models for image annotation and text illustration. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’10). Association for Computational Linguistics, 831--839.Google ScholarGoogle Scholar
  17. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE Trans. Pattern Anal. Mach. Intell. 35, 4 (2013), 797--812.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 4125--4134.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’13). 2121--2129.Google ScholarGoogle Scholar
  20. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the International Conference on Machine Learning (ICML’17). 1243--1252.Google ScholarGoogle Scholar
  21. S. Gella, R. Sennrich, F. Keller, and M. Lapata. 2017. Image pivoting for learning multilingual multimodal representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2839--2845.Google ScholarGoogle Scholar
  22. X. Glorot, A. Bordes, and Y. Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTAT’11). 315--323.Google ScholarGoogle Scholar
  23. Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 529--545.Google ScholarGoogle ScholarCross RefCross Ref
  24. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’18).Google ScholarGoogle Scholar
  25. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06).Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 770--778.Google ScholarGoogle Scholar
  27. S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853--899.Google ScholarGoogle ScholarCross RefCross Ref
  29. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks, and incremental parsing. To appear (2017). https://spacy.io.Google ScholarGoogle Scholar
  30. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Vol. 1. 3.Google ScholarGoogle Scholar
  31. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. 2017. Learning robust visual-semantic embeddings. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 3571--3580.Google ScholarGoogle Scholar
  33. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association of Computational Linguistics (TACL) 5, 1 (2017), 339--351.Google ScholarGoogle ScholarCross RefCross Ref
  34. Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Hervé Jégou, and Edouard Grave. 2018. Loss in translation: Learning bilingual word mapping with a retrieval criterion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18).Google ScholarGoogle ScholarCross RefCross Ref
  35. Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 (2016).Google ScholarGoogle Scholar
  36. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).Google ScholarGoogle ScholarCross RefCross Ref
  37. Y. Khalid and S. Noah. 2011. A framework for integrating DBpedia in a multi-modality ontology news image retrieval system. In Proceedings of the International Conference on Semantic Technology and Information Retrieval. IEEE, 144--149.Google ScholarGoogle Scholar
  38. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2015).Google ScholarGoogle Scholar
  39. R. Kiros, R. Salakhutdinov, and R. S. Zemel. 2015. Unifying visual-semantic embeddings with multimodal neural language models. Trans. Assoc. Comput. Ling. (2015).Google ScholarGoogle Scholar
  40. G. Lample, A. Conneau, L. Denoyer, and M.-A. Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  41. Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations (ICLR’18). https://openreview.net/forum?id=H196sainb.Google ScholarGoogle Scholar
  42. Rémi Lebret, Pedro O. Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML’15), Vol. 37. JMLR.org, 2085--2094.Google ScholarGoogle Scholar
  43. Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV’18). 201--216.Google ScholarGoogle ScholarCross RefCross Ref
  44. Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’19). 4654--4662.Google ScholarGoogle ScholarCross RefCross Ref
  45. A. Y. Lin, J. Ford, E. Adar, and B. Hecht. 2018. VizByWiki: Mining data visualizations from the web to enrich news articles. In Proceedings of the World Wide Web Conference (WWW’18). International World Wide Web Conferences Steering Committee, 873--882.Google ScholarGoogle Scholar
  46. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L. Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 740--755.Google ScholarGoogle Scholar
  47. Fangyu Liu and Rongtian Ye. 2019. A strong and robust baseline for text-image matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 169--176.Google ScholarGoogle ScholarCross RefCross Ref
  48. Fangyu Liu, Rongtian Ye, Xun Wang, and Shuaipeng Li. 2020. HAL: Improved text-image matching by mitigating visual semantic hubs. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’20).Google ScholarGoogle ScholarCross RefCross Ref
  49. Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2623--2631.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Diane Pecher and Rolf A. Zwaan. 2005. Grounding Cognition: The Role of Perception and Action in Memory, Language, and Thinking. Cambridge University Press.Google ScholarGoogle Scholar
  51. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 2641--2649.Google ScholarGoogle Scholar
  52. Janarthanan Rajendran, Mitesh M. Khapra, Sarath Chandar, and Balaraman Ravindran. 2016. Bridge correlational neural networks for multilingual multimodal representation learning. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 171--181.Google ScholarGoogle ScholarCross RefCross Ref
  53. A. Ramisa. 2017. Multimodal news article analysis. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 5136--5140.Google ScholarGoogle ScholarCross RefCross Ref
  54. A. Ramisa, F. Yan, F. Moreno-Noguer, and K. Mikolajczyk. 2018. BreakingNews: Article annotation by image and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2018), 1072--1085.Google ScholarGoogle Scholar
  55. Hareesh Ravi, Lezi Wang, Carlos Muniz, Leonid Sigal, Dimitris Metaxas, and Mubbasir Kapadia. 2018. Show me a story: Towards coherent neural story illustration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7613--7621.Google ScholarGoogle ScholarCross RefCross Ref
  56. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’15). 91--99.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Guy Rotman, Ivan Vulić, and Roi Reichart. 2018. Bridging languages through images with deep partial canonical correlation analysis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), Vol. 1. 910--921.Google ScholarGoogle ScholarCross RefCross Ref
  58. Alexander Rush. 2018. The annotated transformer. In Proceedings of the Workshop for NLP Open Source Software (NLP-OSS’18). 52--60.Google ScholarGoogle ScholarCross RefCross Ref
  59. Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).Google ScholarGoogle Scholar
  60. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929--1958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations (ICLR’20). Retrieved from https://openreview.net/forum?id=SygXPaEYvH.Google ScholarGoogle Scholar
  62. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’17). 5998--6008.Google ScholarGoogle Scholar
  63. I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. 2016. Order-embeddings of images and language. In Proceedings of the International Conference on Learning Representations (ICLR’16).Google ScholarGoogle Scholar
  64. Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (2018), 394--407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1398--1406.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Jônatas Wehrmann and Rodrigo C. Barros. 2018. Bidirectional retrieval made simple. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).Google ScholarGoogle Scholar
  67. Chao-Yuan Wu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2017. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17).Google ScholarGoogle ScholarCross RefCross Ref
  68. Hao Wu, Jiayuan Mao, Yufeng Zhang, Yuning Jiang, Lei Li, Weiwei Sun, and Wei-Ying Ma. 2019. Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).Google ScholarGoogle ScholarCross RefCross Ref
  69. Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 2088--2096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. 2018. ShakeDrop regularization. In Proceedings of the International Conference on Learning Representations Workshop Track (ICLR Workshop’18). Retrieved from https://openreview.net/forum?id=S1NHaMW0b.Google ScholarGoogle Scholar
  71. T. Zahavy, A. Magnani, A. Krishnan, and S. Mannor. 2018. Is a picture worth a thousand words? A deep multi-modal fusion architecture for product classification in e-commerce. In Proceedings of the 30th AAAI Conference on Innovative Applications of Artificial Intelligence.Google ScholarGoogle Scholar
  72. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2017. Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017).Google ScholarGoogle Scholar

Index Terms

  1. Upgrading the Newsroom: An Automated Image Selection System for News Articles

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!