skip to main content
research-article

On the Effectiveness of Images in Multi-modal Text Classification: An Annotation Study

Published:10 March 2023Publication History
Skip Abstract Section

Abstract

Combining different input modalities beyond text is a key challenge for natural language processing. Previous work has been inconclusive as to the true utility of images as a supplementary information source for text classification tasks, motivating this large-scale human study of labelling performance given text-only, images-only, or both text and images. To this end, we create a new dataset accompanied with a novel annotation method—Japanese Entity Labeling with Dynamic Annotation—to deepen our understanding of the effectiveness of images for multi-modal text classification. By performing careful comparative analysis of human performance and the performance of state-of-the-art multi-modal text classification models, we gain valuable insights into differences between human and model performance, and the conditions under which images are beneficial for text classification.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Caglayan Ozan, Madhyastha Pranava, Specia Lucia, and Barrault Loïc. 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 41594170. Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Chen Yen-Chun, Li Linjie, Yu Licheng, Kholy Ahmed El, Ahmed Faisal, Gan Zhe, Cheng Yu, and Liu Jingjing. 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. 104120.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 41714186. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Elliott Desmond, Frank Stella, Sima’an Khalil, and Specia Lucia. 2016. Multi30K: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language. Association for Computational Linguistics, 7074. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Geva Mor, Goldberg Yoav, and Berant Jonathan. 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 11611166. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Goyal Yash, Khot Tejas, Summers-Stay Douglas, Batra Dhruv, and Parikh Devi. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 69046913.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 770778.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Krishna Ranjay, Zhu Yuke, Groth Oliver, Johnson Justin, Hata Kenji, Kravitz Joshua, Chen Stephanie, Kalantidis Yannis, Li Li-Jia, Shamma David A., et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 3273.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Liang Bin, Lou Chenwei, Li Xiang, Gui Lin, Yang Min, and Xu Ruifeng. 2021. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM International Conference on Multimedia. 47074715.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Lin Yankai, Liu Zhiyuan, Sun Maosong, Liu Yang, and Zhu Xuan. 2015. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 21812187.Google ScholarGoogle Scholar
  12. [12] Liu Weijie, Zhou Peng, Zhao Zhe, Wang Zhiruo, Ju Qi, Deng Haotang, and Wang Ping. 2020. K-bert: Enabling language representation with knowledge graph. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 29012908. https://ojs.aaai.org/index.php/AAAI/article/view/5681/5537.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Ma Chunpeng, Shen Aili, Yoshikawa Hiyori, Iwakura Tomoya, Beck Daniel, and Baldwin Timothy. 2021. On the (in)effectiveness of images for text classification. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 4248. https://aclanthology.org/2021.eacl-main.4.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Vol. 28. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Sekine Satoshi, Kobayashi Akio, and Nakayama Kouta. 2018. Shinra: Structuring wikipedia by collaborative contribution. In Proceedings of Automated Knowledge Base Construction (AKBC’18).Google ScholarGoogle Scholar
  16. [16] Sharma Piyush, Ding Nan, Goodman Sebastian, and Soricut Radu. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 25562565. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Shen Aili, Salehi Bahar, Qi Jianzhong, and Baldwin Timothy. 2020. A general approach to multimodal document quality assessment. J. Artif. Intell. Res. 68 (2020), 607632. https://www.jair.org/index.php/jair/article/download/11647/26595/.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Su Weijie, Zhu Xizhou, Cao Yue, Li Bin, Lu Lewei, Wei Furu, and Dai Jifeng. 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  19. [19] Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jon, and Wojna Zbigniew. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 28182826.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Tan Hao and Bansal Mohit. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 51005111. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Yoshikawa Hiyori, Ma Chunpeng, Shen Aili, Sun Qian, Huang Chenbang, Pelat Guillaume, Miura Akiva, Beck Daniel, Baldwin Timothy, and Iwakura Tomoya. 2020. UOM-FJ at the NTCIR-15 SHINRA2020-ML task. In Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies. 201207.Google ScholarGoogle Scholar
  22. [22] Zellers Rowan, Bisk Yonatan, Farhadi Ali, and Choi Yejin. 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 67206731.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. On the Effectiveness of Images in Multi-modal Text Classification: An Annotation Study

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
      March 2023
      570 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3579816
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 March 2023
      • Online AM: 7 October 2022
      • Accepted: 18 September 2022
      • Revised: 11 June 2022
      • Received: 10 December 2021
      Published in tallip Volume 22, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)171
      • Downloads (Last 6 weeks)13

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format