skip to main content
10.1145/3472749.3474765acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article
Open Access

Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

Published:12 October 2021Publication History

ABSTRACT

Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase. Summarizing mobile screens requires a holistic understanding of the multi-modal data of mobile UIs, including text, image, structures as well as UI semantics, motivating our multi-modal learning approach. We collected and analyzed a large-scale screen summarization dataset annotated by human workers. Our dataset contains more than 112k language summarization across ∼ 22k unique UI screens. We then experimented with a set of deep models with different configurations. Our evaluation of these models with both automatic accuracy metrics and human rating shows that our approach can generate high-quality summaries for mobile screens. We demonstrate potential use cases of Screen2Words and open-source our dataset and model to lay the foundations for further bridging language and user interfaces.

Skip Supplemental Material Section

Supplemental Material

p498-video_preview.mp4

References

  1. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  2. Y. Borodin, Jeffrey P. Bigham, Glenn Dausch, and I. Ramakrishnan. 2010. More than meets the eye: a survey of screen-reader browsing strategies. In W4A.Google ScholarGoogle Scholar
  3. Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. 2013. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108–122.Google ScholarGoogle Scholar
  4. Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, and Magy Seif El-Nasr. 2021. VINS: Visual Search for Mobile User Interface Design. arXiv preprint arXiv:2102.05216(2021).Google ScholarGoogle Scholar
  5. Bor-Chun Chen, Yan-Ying Chen, and Francine Chen. 2017. Video to Text Summary: Joint Video Summarization and Captioning with Recurrent Neural Networks.. In BMVC.Google ScholarGoogle Scholar
  6. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. arxiv:1504.00325 [cs.CV]Google ScholarGoogle Scholar
  7. Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (Québec City, QC, Canada) (UIST ’17). Association for Computing Machinery, New York, NY, USA, 845–854. https://doi.org/10.1145/3126594.3126651Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Biplab Deka, Zifeng Huang, and Ranjitha Kumar. 2016. ERICA: Interaction Mining Mobile Apps. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (Tokyo, Japan) (UIST ’16). ACM, New York, NY, USA, 767–776. https://doi.org/10.1145/2984511.2984581Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376–380.Google ScholarGoogle ScholarCross RefCross Ref
  10. Aniqa Dilawari and Muhammad Usman Ghani Khan. 2019. ASoVS: Abstractive summarization of video sequences. IEEE Access 7(2019), 29253–29263.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. Erol, D. . Lee, and J. Hull. 2003. Multimodal summarization of meeting recordings. In 2003 International Conference on Multimedia and Expo. ICME ’03. Proceedings (Cat. No.03TH8698), Vol. 3. III–25. https://doi.org/10.1109/ICME.2003.1221239Google ScholarGoogle Scholar
  12. Camilo Fosco, Vincent Casser, Amish Kumar Bedi, Peter O’Donovan, Aaron Hertzmann, and Zoya Bylinskii. 2020. Predicting Visual Importance Across Graphic Design Types. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 249–260. https://doi.org/10.1145/3379337.3415825Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarGoogle Scholar
  14. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]Google ScholarGoogle Scholar
  15. Eduard Hovy, Chin-Yew Lin, 1999. Automated text summarization in SUMMARIST. Advances in automatic text summarization 14 (1999), 81–94.Google ScholarGoogle Scholar
  16. Forrest Huang, John F. Canny, and Jeffrey Nichols. 2019. Swire: Sketch-Based User Interface Retrieval. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3290605.3300334Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ting Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende, Michel Galley, and Margaret Mitchell. 2016. Visual storytelling. In 2016 Conference of the North American Chapter of the Association for Computational Linguistics(2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 - Proceedings of the Conference). Association for Computational Linguistics (ACL), 1233–1239. https://doi.org/10.18653/v1/n16-1147 Publisher Copyright: ©2016 Association for Computational Linguistics. Copyright: Copyright 2020 Elsevier B.V., All rights reserved.; 15th Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 ; Conference date: 12-06-2016 Through 17-06-2016.Google ScholarGoogle Scholar
  18. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arxiv:1412.6980 [cs.LG]Google ScholarGoogle Scholar
  19. R. Kuber, Amanda Hastings, Matthew Tretter, and D. Fitzpatrick. 2012. DETERMINING THE ACCESSIBILITY OF MOBILE SCREEN READERS FOR BLIND USERS.Google ScholarGoogle Scholar
  20. Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, and Weilong Yang. 2020. Neural Design Network: Graphic Layout Generation with Constraints. arxiv:1912.09421 [cs.CV]Google ScholarGoogle Scholar
  21. Jianan Li, Jimei Yang, Aaron Hertzmann, Jianming Zhang, and Tingfa Xu. 2019. LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators. arxiv:1901.06767 [cs.CV]Google ScholarGoogle Scholar
  22. Toby Jia-Jun Li, Amos Azaria, and Brad A. Myers. 2017. SUGILITE: Creating Multimodal Smartphone Automation by Demonstration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 6038–6049. https://doi.org/10.1145/3025453.3025483Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Toby Jia-Jun Li, Jingya Chen, Haijun Xia, Tom M. Mitchell, and Brad A. Myers. 2020. Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs(UIST ’20). Association for Computing Machinery, New York, NY, USA, 1094–1107. https://doi.org/10.1145/3379337.3415820Google ScholarGoogle Scholar
  24. Toby Jia-Jun Li, Lindsay Popowski, Tom M Mitchell, and Brad A Myers. 2021. Screen2Vec: Semantic Embedding of GUI Screens and GUI Components. arXiv preprint arXiv:2101.11103(2021).Google ScholarGoogle Scholar
  25. Toby Jia-Jun Li, Marissa Radensky, Justin Jia, Kirielle Singarajah, Tom M. Mitchell, and Brad A. Myers. 2019. PUMICE: A Multi-Modal Agent That Learns Concepts and Conditionals from Natural Language and Demonstrations. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (New Orleans, LA, USA) (UIST ’19). Association for Computing Machinery, New York, NY, USA, 577–589. https://doi.org/10.1145/3332165.3347899Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Toby Jia-Jun Li and Oriana Riva. 2018. Kite: Building Conversational Bots from Mobile Apps. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services(Munich, Germany) (MobiSys ’18). Association for Computing Machinery, New York, NY, USA, 96–109. https://doi.org/10.1145/3210240.3210339Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Mapping Natural Language Instructions to Mobile UI Action Sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8198–8210. https://doi.org/10.18653/v1/2020.acl-main.729Google ScholarGoogle ScholarCross RefCross Ref
  28. Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. 2020. Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. arxiv:2010.04295 [cs.LG]Google ScholarGoogle Scholar
  29. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.Google ScholarGoogle Scholar
  30. Thomas F. Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and Ranjitha Kumar. 2018. Learning Design Semantics for Mobile Apps. In The 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) (UIST ’18). ACM, New York, NY, USA, 569–579. https://doi.org/10.1145/3242587.3242650Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Magdum and Sheetal Rathi. 2021. A Survey on Deep Learning-Based Automatic Text Summarization Models. 377–392. https://doi.org/10.1007/978-981-15-3514-7_30Google ScholarGoogle Scholar
  32. Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632(2014).Google ScholarGoogle Scholar
  33. Mani Maybury. 1999. Advances in automatic text summarization. MIT press.Google ScholarGoogle Scholar
  34. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.Google ScholarGoogle Scholar
  35. Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A Deep Reinforced Model for Abstractive Summarization. arxiv:1705.04304 [cs.CL]Google ScholarGoogle Scholar
  36. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162Google ScholarGoogle ScholarCross RefCross Ref
  37. André Rodrigues, Hugo Nicolau, Kyle Montague, João Guerreiro, and Tiago Guerreiro. 2019. Open Challenges of Blind People using Smartphones.Google ScholarGoogle Scholar
  38. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2019. Object Hallucination in Image Captioning. arxiv:1809.02156 [cs.CL]Google ScholarGoogle Scholar
  39. Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O. Wobbrock. 2017. Epidemiology as a Framework for Large-Scale Mobile Application Accessibility Assessment. In Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility (Baltimore, Maryland, USA) (ASSETS ’17). Association for Computing Machinery, New York, NY, USA, 2–11. https://doi.org/10.1145/3132525.3132547Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Anne Spencer Ross, Xiaoyi Zhang, James Fogarty, and Jacob O. Wobbrock. 2018. Examining Image-Based Button Labeling for Accessibility in Android Apps through Large-Scale Analysis. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (Galway, Ireland) (ASSETS ’18). Association for Computing Machinery, New York, NY, USA, 119–130. https://doi.org/10.1145/3234695.3236364Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Amanda Swearngin and Yang Li. 2019. Modeling Mobile Interface Tappability Using Crowdsourcing and Deep Learning. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3290605.3300305Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762(2017).Google ScholarGoogle Scholar
  43. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566–4575.Google ScholarGoogle ScholarCross RefCross Ref
  44. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision. 4534–4542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Computer Vision and Pattern Recognition. http://arxiv.org/abs/1411.4555Google ScholarGoogle Scholar
  46. Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018. No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling. arxiv:1804.09160 [cs.CL]Google ScholarGoogle Scholar
  47. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 37), Francis Bachand David Blei (Eds.). PMLR, Lille, France, 2048–2057. http://proceedings.mlr.press/v37/xuc15.htmlGoogle ScholarGoogle Scholar
  48. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. PMLR, 2048–2057.Google ScholarGoogle Scholar
  49. Mahmood Yousefi-Azar and Len Hamey. 2017. Text summarization using unsupervised deep learning. Expert Systems with Applications 68 (2017), 93–105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, Aaron Everitt, and Jeffrey P. Bigham. 2021. Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels. arxiv:2101.04893 [cs.HC]Google ScholarGoogle Scholar
  51. Xiaoyi Zhang, Anne Spencer Ross, and James Fogarty. 2018. Robust Annotation of Mobile Application Interfaces in Methods for Accessibility Repair and Enhancement. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 609–621. https://doi.org/10.1145/3242587.3242616Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2018. MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing. 4154–4164.Google ScholarGoogle ScholarCross RefCross Ref
  53. Junnan Zhu, Yu Zhou, Jiajun Zhang, Haoran Li, Chengqing Zong, and Changliang Li. 2020. Multimodal summarization with guidance of multimodal reference. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9749–9756.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format