Abstract
Combining different input modalities beyond text is a key challenge for natural language processing. Previous work has been inconclusive as to the true utility of images as a supplementary information source for text classification tasks, motivating this large-scale human study of labelling performance given text-only, images-only, or both text and images. To this end, we create a new dataset accompanied with a novel annotation method—Japanese Entity Labeling with Dynamic Annotation—to deepen our understanding of the effectiveness of images for multi-modal text classification. By performing careful comparative analysis of human performance and the performance of state-of-the-art multi-modal text classification models, we gain valuable insights into differences between human and model performance, and the conditions under which images are beneficial for text classification.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 6077–6086.Google Scholar
Cross Ref
- [2] . 2019. Probing the need for visual context in multimodal machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4159–4170. Google Scholar
Cross Ref
- [3] . 2020. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. 104–120.Google Scholar
Digital Library
- [4] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186. Google Scholar
Cross Ref
- [5] . 2016. Multi30K: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language. Association for Computational Linguistics, 70–74. Google Scholar
Cross Ref
- [6] . 2019. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 1161–1166. Google Scholar
Cross Ref
- [7] . 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 6904–6913.Google Scholar
Cross Ref
- [8] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 770–778.Google Scholar
Cross Ref
- [9] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
Digital Library
- [10] . 2021. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM International Conference on Multimedia. 4707–4715.Google Scholar
Digital Library
- [11] . 2015. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2181–2187.Google Scholar
- [12] . 2020. K-bert: Enabling language representation with knowledge graph. In Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2901–2908. https://ojs.aaai.org/index.php/AAAI/article/view/5681/5537.Google Scholar
Cross Ref
- [13] . 2021. On the (in)effectiveness of images for text classification. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, 42–48. https://aclanthology.org/2021.eacl-main.4.Google Scholar
Cross Ref
- [14] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Vol. 28. 91–99.Google Scholar
Digital Library
- [15] . 2018. Shinra: Structuring wikipedia by collaborative contribution. In Proceedings of Automated Knowledge Base Construction (AKBC’18).Google Scholar
- [16] . 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2556–2565. Google Scholar
Cross Ref
- [17] . 2020. A general approach to multimodal document quality assessment. J. Artif. Intell. Res. 68 (2020), 607–632. https://www.jair.org/index.php/jair/article/download/11647/26595/.Google Scholar
Cross Ref
- [18] . 2020. VL-BERT: Pre-training of generic visual-linguistic representations. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [19] . 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 2818–2826.Google Scholar
Cross Ref
- [20] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 5100–5111. Google Scholar
Cross Ref
- [21] . 2020. UOM-FJ at the NTCIR-15 SHINRA2020-ML task. In Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies. 201–207.Google Scholar
- [22] . 2019. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers, 6720–6731.Google Scholar
Cross Ref
Index Terms
- On the Effectiveness of Images in Multi-modal Text Classification: An Annotation Study
Recommendations
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Feature annotation for text categorization
CUBE '12: Proceedings of the CUBE International Information Technology ConferenceIn text categorization, feature extraction is one of the major strategies that aim at making text classifiers more efficient and accurate. Selecting quickly a suitable strategy for feature extraction out of many strategies proposed by previous studies ...
Improving multiclass text classification with error-correcting output coding and sub-class partitions
AI'10: Proceedings of the 23rd Canadian conference on Advances in Artificial IntelligenceError-Correcting Output Coding (ECOC) is a general framework for multiclass text classification with a set of binary classifiers It can not only help a binary classifier solve multi-class classification problems, but also boost the performance of a ...





Comments