Abstract
The analysis for social networks, such as the socially connected Internet of Things, has shown a deep influence of intelligent information processing technology on industrial systems for Smart Cities. The goal of social media representation learning is to learn dense, low-dimensional, and continuous representations for multimodal data within social networks, facilitating many real-world applications. Since social media images are usually accompanied by rich metadata (e.g., textual descriptions, tags, groups, and submitted users), simply modeling the image is not effective to learn the comprehensive information from social media images. In this work, we treat the image and its textual description as multimodal content, and transform other metainformation into the links between contents (such as two images marked by the same tag or submitted by the same user). Based on the multimodal content and social links, we propose a Deep Attentive Multimodal Graph Embedding model named DAMGE for more effective social image representation learning. We introduce both small- and large-scale datasets to conduct extensive experiments, of which the results confirm the superiority of the proposal on the tasks of social image classification and link prediction.
- Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 6 (2003), 1373–1396. DOI:https://doi.org/10.1162/089976603321780317 Google Scholar
Digital Library
- Jiuwen Cao, Kai Zhang, Minxia Luo, Chun Yin, and Xiaoping Lai. 2016. Extreme learning machine and adaptive sparse representation for image classification. Neural Netw. 81 (2016), 91–102. DOI:https://doi.org/10.1016/j.neunet.2016.06.001 Google Scholar
Digital Library
- Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast learning with graph convolutional networks via importance sampling. In 6th International Conference on Learning Representations (ICLR’18), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rytstxWAW.Google Scholar
- Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the 8th ACM International Conference on Image and Video Retrieval (CIVR’09), Stéphane Marchand-Maillet and Yiannis Kompatsiaris (Eds.). ACM. DOI:https://doi.org/10.1145/1646396.1646452 Google Scholar
Digital Library
- Paul D. Clough, Michael Grubinger, Thomas Deselaers, Allan Hanbury, and Henning Müller. 2006. Overview of the ImageCLEF 2006 photographic retrieval and object annotation tasks. In Evaluation of Multilingual and Multi-Modal Information Retrieval, 7th Workshop of the Cross-Language Evaluation Forum (CLEF’06), Revised Selected Papers, Lecture Notes in Computer Science, Vol. 4730, Carol Peters, Paul D. Clough, Fredric C. Gey, Jussi Karlgren, Bernardo Magnini, Douglas W. Oard, Maarten de Rijke, and Maximilian Stempfhuber (Eds.). Springer, 579–594. DOI:https://doi.org/10.1007/978-3-540-74999-8_71 Google Scholar
Digital Library
- Peng Cui, Shaowei Liu, and Wenwu Zhu. 2018. General knowledge embedded image representation learning. IEEE Trans. Multimedia 20, 1 (2018), 198–207. DOI:https://doi.org/10.1109/TMM.2017.2724843 Google Scholar
Digital Library
- Mark Everingham, Luc J. Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303–338. DOI:https://doi.org/10.1007/s11263-009-0275-4 Google Scholar
Digital Library
- Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 2121–2129. http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model. Google Scholar
Digital Library
- Yue Gao, Yi Zhen, Haojie Li, and Tat-Seng Chua. 2016. Filtering of brand-related microblogs using social-smooth multiview embedding. IEEE Trans. Multimedia 18, 10 (2016), 2115–2126. DOI:https://doi.org/10.1109/TMM.2016.2581483Google Scholar
Cross Ref
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 855–864. DOI:https://doi.org/10.1145/2939672.2939754 Google Scholar
Digital Library
- Zepeng Gu, Bo Lang, Tongyu Yue, and Lei Huang. 2017. Learning joint multimodal representation based on multi-fusion deep neural networks. In 24th International Conference on Neural Information Processing (ICONIP’17), Part II, Lecture Notes in Computer Science, Vol. 10635, Derong Liu, Shengli Xie, Yuanqing Li, Dongbin Zhao, and El-Sayed M. El-Alfy (Eds.). Springer, 276–285. DOI:https://doi.org/10.1007/978-3-319-70096-0_29Google Scholar
Cross Ref
- William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 1024–1034. http://papers.nips.cc/paper/6703-inductive-representation-learning-on-large-graphs. Google Scholar
Digital Library
- David R. Hardoon, Sándor Szedmák, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 16, 12 (2004), 2639–2664. DOI:https://doi.org/10.1162/0899766042321814 Google Scholar
Digital Library
- Feiran Huang, Xiaoming Zhang, Chaozhuo Li, Zhoujun Li, Yueying He, and Zhonghua Zhao. 2018. Multimodal network embedding via attention based multi-view variational autoencoder. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval (ICMR’18), Kiyoharu Aizawa, Michael S. Lew, and Shin’ichi Satoh (Eds.). ACM, 108–116. DOI:https://doi.org/10.1145/3206025.3206035 Google Scholar
Digital Library
- Feiran Huang, Xiaoming Zhang, and Zhoujun Li. 2018. Learning joint multimodal representation with adversarial attention networks. In 2018 ACM Multimedia Conference on Multimedia(MM’18), Susanne Boll, Kyoung Mu Lee, Jiebo Luo, Wenwu Zhu, Hyeran Byun, Chang Wen Chen, Rainer Lienhart, and Tao Mei (Eds.). ACM, 1874–1882. DOI:https://doi.org/10.1145/3240508.3240614 Google Scholar
Digital Library
- Feiran Huang, Xiaoming Zhang, Zhoujun Li, Tao Mei, Yueying He, and Zhonghua Zhao. 2017. Learning social image embedding with deep multimodal attention networks. In Proceedings of the Thematic Workshops of ACM Multimedia 2017, Wanmin Wu, Jianchao Yang, Qi Tian, and Roger Zimmermann (Eds.). ACM, 460–468. DOI:https://doi.org/10.1145/3126686.3126720 Google Scholar
Digital Library
- Feiran Huang, Xiaoming Zhang, Zhoujun Li, Zhonghua Zhao, and Yueying He. 2018. From content to links: Social image embedding with deep multimodal model. Knowl.-Based Syst. 160 (2018), 251–264. DOI:https://doi.org/10.1016/j.knosys.2018.07.020Google Scholar
Cross Ref
- Feiran Huang, Xiaoming Zhang, Jie Xu, Chaozhuo Li, and Zhoujun Li. 2019. Network embedding by fusing multimodal contents and links. Knowl.-Based Syst. 171 (2019), 44–55. DOI:https://doi.org/10.1016/j.knosys.2019.02.003Google Scholar
Cross Ref
- Feiran Huang, Xiaoming Zhang, Jie Xu, Zhonghua Zhao, and Zhoujun Li. 2019. Multimodal learning of social image representation by exploiting social relations. IEEE Transactions on Cybernetics 51, 3 (2021), 1506–1518.Google Scholar
Cross Ref
- Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, and Zhoujun Li. 2019. Bi-directional spatial-semantic attention networks for image-text matching. IEEE Trans. Image Processing 28, 4 (2019), 2008–2020. DOI:https://doi.org/10.1109/TIP.2018.2882225Google Scholar
Digital Library
- Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, Zhoujun Li, and Yueying He. 2018. Deep multi-view representation learning for social images. Appl. Soft Comput. 73 (2018), 106–118. DOI:https://doi.org/10.1016/j.asoc.2018.08.010Google Scholar
Cross Ref
- Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval (MIR’08), Michael S. Lew, Alberto Del Bimbo, and Erwin M. Bakker (Eds.). ACM, 39–43. DOI:https://doi.org/10.1145/1460096.1460104 Google Scholar
Digital Library
- Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations (ICLR ]1), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl.Google Scholar
- Chaozhuo Li, Senzhang Wang, Dejian Yang, Zhoujun Li, Yang Yang, Xiaoming Zhang, and Jianshe Zhou. 2017. PPNE: Property preserving network embedding. In Proceedings of the 22nd International Conference on Database Systems for Advanced Applications, (DASFAA’17), Part I, Lecture Notes in Computer Science, Vol. 10177, K. Selçuk Candan, Lei Chen, Torben Bach Pedersen, Lijun Chang, and Wen Hua (Eds.). Springer, 163–179. DOI:https://doi.org/10.1007/978-3-319-55753-3_11Google Scholar
Cross Ref
- Chaozhuo Li, Lei Zheng, Senzhang Wang, Feiran Huang, Philip S. Yu, and Zhoujun Li. 2019. Multi-hot compact network embedding. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’1), Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu (Eds.). ACM, 459–468. DOI:https://doi.org/10.1145/3357384.3357903 Google Scholar
Digital Library
- Zechao Li, Jinhui Tang, and Tao Mei. 2019. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2019), 2070–2083. DOI:https://doi.org/10.1109/TPAMI.2018.2852750Google Scholar
Cross Ref
- Lizi Liao, Xiangnan He, Hanwang Zhang, and Tat-Seng Chua. 2018. Attributed social network embedding. IEEE Trans. Knowl. Data Eng. 30, 12 (2018), 2257–2270. DOI:https://doi.org/10.1109/TKDE.2018.2819980Google Scholar
Digital Library
- Shaowei Liu, Peng Cui, Wenwu Zhu, and Shiqiang Yang. 2015. Learning socially embedded visual representation from scratch. In Proceedings of the 23rd Annual ACM Conference on Multimedia Conference (MM ’15,), Xiaofang Zhou, Alan F. Smeaton, Qi Tian, Dick C. A. Bulterman, Heng Tao Shen, Ketan Mayer-Patel, and Shuicheng Yan (Eds.). ACM, 109–118. DOI:https://doi.org/10.1145/2733373.2806247 Google Scholar
Digital Library
- Yun Liu, Xiaoming Zhang, Feiran Huang, and Zhoujun Li. 2018. Adversarial learning of answer-related representation for visual question answering. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 1013–1022. DOI:https://doi.org/10.1145/3269206.3271765 Google Scholar
Digital Library
- Yun Liu, Xiaoming Zhang, Feiran Huang, Xianghong Tang, and Zhoujun Li. 2019. Visual question answering via attention-based syntactic structure tree-LSTM. Appl. Soft Comput. 82 (2019). DOI:https://doi.org/10.1016/j.asoc.2019.105584Google Scholar
- Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016,, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 289–297. http://papers.nips.cc/paper/6202-hierarchical-question-image-co-attention-for-visual-question-answering. Google Scholar
Digital Library
- Zhiwu Lu, Liwei Wang, and Ji-Rong Wen. 2014. Direct semantic analysis for social image classification. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Carla E. Brodley and Peter Stone (Eds.). AAAI Press, 1258–1264. http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8189. Google Scholar
Digital Library
- Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP’15), Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics, 1412–1421. DOI:https://doi.org/10.18653/v1/d15-1166Google Scholar
- Julian J. McAuley and Jure Leskovec. 2012. Image labeling on a network: Using social-network metadata for image classification. In Proceedings of the 12th European Conference on Computer Vision (ECCV’12) , Part IV, (Lecture Notes in Computer Science), Vol. 7575, Andrew W. Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.),. Springer, 828–841. DOI:https://doi.org/10.1007/978-3-642-33765-9_59 Google Scholar
Digital Library
- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML’11), Lise Getoor and Tobias Scheffer (Eds.). Omnipress, 689–696. Google Scholar
Digital Library
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532–1543. http://aclweb.org/anthology/D/D14/D14-1162.pdf.Google Scholar
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 701–710. Google Scholar
Digital Library
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 91–99. http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks. Google Scholar
Digital Library
- Abhishek Sharma and David W. Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE Computer Society, 593–600. DOI:https://doi.org/10.1109/CVPR.2011.5995350 Google Scholar
Digital Library
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556.Google Scholar
- Nitish Srivastava and Ruslan Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger (Eds.). 2231–2239. http://papers.nips.cc/paper/4683-multimodal-learning-with-deep-boltzmann-machines. Google Scholar
Digital Library
- Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319–2323.Google Scholar
Cross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need. Google Scholar
Digital Library
- Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In 6th International Conference on Learning Representations (ICLR’18), Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=rJXMpikCZ.Google Scholar
- Petar Velickovic, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R. Devon Hjelm. 2019. Deep graph infomax. In Proceedings of the 7th International Conference on Learning Representations (ICLR’19). OpenReview.net. https://openreview.net/forum?id=rklz9iAcKQ.Google Scholar
- Vedran Vukotic, Christian Raymond, and Guillaume Gravier. 2016. Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR’16), John R. Kender, John R. Smith, Jiebo Luo, Susanne Boll, and Winston H. Hsu (Eds.). ACM, 343–346. DOI:https://doi.org/10.1145/2911996.2912064 Google Scholar
Digital Library
- Zhitao Wang, Chengyao Chen, and Wenjie Li. 2017. Predictive network representation learning for link prediction. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 969–972. DOI:https://doi.org/10.1145/3077136.3080692 Google Scholar
Digital Library
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37. JMLR.org, 2048–2057. http://jmlr.org/proceedings/papers/v37/xuc15.html. Google Scholar
Digital Library
- Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Processing 26, 5 (2017), 2494–2507. DOI:https://doi.org/10.1109/TIP.2017.2676345 Google Scholar
Digital Library
- Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE Computer Society, 3441–3450. DOI:https://doi.org/10.1109/CVPR.2015.7298966Google Scholar
Cross Ref
- Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y. Chang. 2015. Network representation learning with rich text information. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI’15), Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 2111–2117. http://ijcai.org/Abstract/15/299. Google Scholar
Digital Library
- Zhenguo Yang, Qing Li, Zheng Lu, Yun Ma, Zhiguo Gong, and Wenyin Liu. 2017. Dual structure constrained multimodal feature coding for social event detection from flickr data. ACM Trans. Internet Techn. 17, 2 (2017), 19:1–19:20. DOI:https://doi.org/10.1145/3015463 Google Scholar
Digital Library
- Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. 2017. Boosting image captioning with attributes. In IEEE International Conference on Computer Vision (ICCV’17). IEEE Computer Society, 4904–4912. DOI:https://doi.org/10.1109/ICCV.2017.524Google Scholar
Cross Ref
- Yi Zhuang, Nan Jiang, Qing Li, Lei Chen, and Chunhua Ju. 2015. Progressive batch medical image retrieval processing in mobile wireless networks. ACM Trans. Internet Techn. 15, 3 (2015), 9:1–9:27. DOI:https://doi.org/10.1145/2783437 Google Scholar
Digital Library
Index Terms
Deep Attentive Multimodal Network Representation Learning for Social Media Images
Recommendations
Cross-Modal Image-Tag Relevance Learning for Social Images
MM '15: Proceedings of the 23rd ACM international conference on MultimediaA new algorithm is developed in this paper to support more effective cross-modal image-tag relevance learning for large-scale social images, which integrates the multimodal feature representation, multimodal relevance measurement, and cross- modal ...
College students social media use and communication network heterogeneity
This study examined whether and how the usage of social media can influence college students' level of network heterogeneity and how network heterogeneity is associated with levels of bridging/bonding social capital and subjective well-being. In ...
Social media user classification: based on social capital expectation, susceptibility, and compulsion loop
ICEC '17: Proceedings of the International Conference on Electronic CommerceSocial media such as Facebook, Instagram and Twitter are originally developed as communication tools among individuals for private conversations. Through the platforms, people share photos, stories and news with their social media friends to interact ...






Comments