Editorial Notes
A corrigendum was issued for this paper on August 8, 2022. You can download the corrigendum from the Supplemental Material section of this citation page.
Abstract

This research establishes a better understanding of the syntax choices in speech interactions and of how speech, gesture, and multimodal gesture and speech interactions are produced by users in unconstrained object manipulation environments using augmented reality. The work presents a multimodal elicitation study conducted with 24 participants. The canonical referents for translation, rotation, and scale were used along with some abstract referents (create, destroy, and select). In this study time windows for gesture and speech multimodal interactions are developed using the start and stop times of gestures and speech as well as the stoke times for gestures. While gestures commonly precede speech by 81 ms we find that the stroke of the gesture is commonly within 10 ms of the start of speech. Indicating that the information content of a gesture and its co-occurring speech are well aligned to each other. Lastly, the trends across the most common proposals for each modality are examined. Showing that the disagreement between proposals is often caused by a variation of hand posture or syntax. Allowing us to present aliasing recommendations to increase the percentage of users' natural interactions captured by future multimodal interactive systems.
Supplemental Material
Available for Download
Corrigendum to "Understanding Gesture and Speech Multimodal Interactions for Manipulation Tasks in Augmented Reality Using Unconstrained Elicitation" by Williams et al., Proceedings of the ACM on Human-Computer Interaction, Volume 4, Issue ISS (PACMHCI 4:ISS).
- Ohoud Alharbi, Ahmed Sabbir Arif, Wolfgang Stuerzlinger, Mark D. Dunlop, and Andreas Komninos. 2019. WiseType: A Tablet Keyboard with Color-Coded Visualization and Various Editing Options for Error Correction. In Proceedings of the 45th Graphics Interface Conference on Proceedings of Graphics Interface 2019 (Kingston, Canada) (GI'19). Canadian Human-Computer Communications Society, Waterloo, CAN, Article 4, bibinfonumpages10 pages. https://doi.org/10.20380/GI2019.04Google Scholar
Digital Library
- Dimitra Anastasiou, Cui Jian, and Desislava Zhekova. 2012. Speech and Gesture Interaction in an Ambient Assisted Living Lab. In Proceedings of the 1st Workshop on Speech and Multimodal Interaction in Assistive Environments (Jeju, Republic of Korea) (SMIAE '12). Association for Computational Linguistics, Stroudsburg, PA, USA, 18--27.Google Scholar
Digital Library
- Ahmed Sabbir Arif and Wolfgang Stuerzlinger. 2010. Predicting the Cost of Error Correction in Character-Based Text Entry Technologies. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Atlanta, Georgia, USA) (CHI '10). Association for Computing Machinery, New York, NY, USA, 5--14. https://doi.org/10.1145/1753326.1753329Google Scholar
Digital Library
- Muhammad Zeeshan Baig and Manolya Kavakli. 2018. Qualitative analysis of a multimodal interface system using speech/gesture. In 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA). IEEE, IEEE, Wuhan, China, 2811--2816.Google Scholar
Cross Ref
- Richard A. Bolt. 1980. &Ldquo;Put-that-there&Rdquo;: Voice and Gesture at the Graphics Interface. SIGGRAPH Comput. Graph. , Vol. 14, 3 (July 1980), 262--270. https://doi.org/10.1145/965105.807503Google Scholar
Digital Library
- Marie-Luce Bourguet. 2006. Towards a taxonomy of error-handling strategies in recognition-based multi-modal human-computer interfaces. Signal Processing , Vol. 86, 12 (2006), 3625--3643.Google Scholar
Digital Library
- Marie-Luce Bourguet and Akio Ando. 1998. Synchronization of Speech and Hand Gestures during Multimodal Human-Computer Interaction. In CHI 98 Conference Summary on Human Factors in Computing Systems (Los Angeles, California, USA) (CHI '98). Association for Computing Machinery, New York, NY, USA, 241--242. https://doi.org/10.1145/286498.286726Google Scholar
Digital Library
- Doug A. Bowman, Ernst Kruijff, Joseph J. LaViola, and Ivan Poupyrev. 2004. 3D User Interfaces: Theory and Practice .Addison Wesley Longman Publishing Co., Inc., USA.Google Scholar
Digital Library
- Joshua Brustein. 2018. Microsoft Wins $480 Million Army Battlefield Contract. https://www.bloomberg.com/news/articles/2018--11--28/microsoft-wins-480-million-army-battlefield-contractGoogle Scholar
- Sarah Buchanan, Bourke Floyd, Will Holderness, and Joseph J. LaViola. 2013. Towards User-Defined Multi-Touch Gestures for 3D Objects. In Proceedings of the 2013 ACM International Conference on Interactive Tabletops and Surfaces (St. Andrews, Scotland, United Kingdom) (ITS '13). Association for Computing Machinery, New York, NY, USA, 231--240. https://doi.org/10.1145/2512349.2512825Google Scholar
- Sébastien Carbini, Lionel Delphin-Poulat, L Perron, and Jean-Emmanuel Viallet. 2006. From a wizard of Oz experiment to a real time speech and gesture multimodal interface. Signal Processing , Vol. 86, 12 (2006), 3559--3577.Google Scholar
Digital Library
- Joyce Y. Chai and Shaolin Qu. 2005. A Salience Driven Approach to Robust Input Interpretation in Multimodal Conversational Systems. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (Vancouver, British Columbia, Canada) (HLT '05). Association for Computational Linguistics, USA, 217--224. https://doi.org/10.3115/1220575.1220603Google Scholar
- Edwin Chan, Teddy Seyed, Wolfgang Stuerzlinger, Xing-Dong Yang, and Frank Maurer. 2016. User Elicitation on Single-Hand Microgestures. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI '16). Association for Computing Machinery, New York, NY, USA, 3403--3414. https://doi.org/10.1145/2858036.2858589Google Scholar
Digital Library
- Aurélie Cohé and Martin Hachet. 2012. Understanding User Gestures for Manipulating 3D Objects from Touchscreen Inputs. In Proceedings of Graphics Interface 2012 (Toronto, Ontario, Canada) (GI '12). Canadian Information Processing Society, Toronto, Ont., Canada, Canada, 157--164. http://dl.acm.org/citation.cfm?id=2305276.2305303Google Scholar
Digital Library
- Andrea Corradini and Philip R Cohen. 2005. On the Relationships Among Speech, Gestures, and Object Manipulation in Virtual Environments: Initial Evidence. , bibinfonumpages97--112 pages.Google Scholar
- Andreas Dünser, Raphaël Grasset, Hartmut Seichter, and Mark Billinghurst. 2007. Applying HCI principles to AR systems design .University of Canterbury. Human Interface Technology Laboratory., New Zealand.Google Scholar
- Susan Goldin-Meadow, Martha Wagner Alibali, and R Breckinridge Church. 1993. Transitions in concept acquisition: using the hand to read the mind. Psychological review , Vol. 100, 2 (1993), 279.Google Scholar
- Susumu Harada, Daisuke Sato, Hironobu Takagi, and Chieko Asakawa. 2013. Characteristics of Elderly User Behavior on Mobile Multi-touch Devices. In Human-Computer Interaction -- INTERACT 2013, Paula Kotzé , Gary Marsden, Gitte Lindgaard, Janet Wesson, and Marco Winckler (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 323--341.Google Scholar
Cross Ref
- A G Hauptmann. 1989. Speech and gestures for graphic image manipulation. ACM SIGCHI Bulletin , Vol. 20, SI (1989), 241--245.Google Scholar
Digital Library
- Alexander G Hauptmann and Paul McAvinney. 1993. Gestures with speech for graphic manipulation. International Journal of Man-Machine Studies , Vol. 38, 2 (1993), 231--249.Google Scholar
Digital Library
- Sylvia Irawati, Scott Green, Mark Billinghurst, Andreas Duenser, and Heedong Ko. 2006. An Evaluation of an Augmented Reality Multimodal Interface Using Speech and Paddle Gestures. In Proceedings of the 16th International Conference on Advances in Artificial Reality and Tele-Existence (Hangzhou, China) (ICAT'06). Springer-Verlag, Berlin, Heidelberg, 272--283. https://doi.org/10.1007/11941354_28Google Scholar
Digital Library
- Robert J.K. Jacob, Audrey Girouard, Leanne M. Hirshfield, Michael S. Horn, Orit Shaer, Erin Treacy Solovey, and Jamie Zigelbaum. 2008. Reality-Based Interaction: A Framework for Post-WIMP Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy) (CHI '08). Association for Computing Machinery, New York, NY, USA, 201--210. https://doi.org/10.1145/1357054.1357089Google Scholar
Digital Library
- Michael Johnston, Philip R. Cohen, David McGee, Sharon L. Oviatt, James A. Pittman, and Ira Smith. 1997. Unification-Based Multimodal Integration. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (Madrid, Spain) (ACL '98/EACL '98). Association for Computational Linguistics, USA, 281--288. https://doi.org/10.3115/976909.979653Google Scholar
Digital Library
- Ed Kaiser, Alex Olwal, David McGee, Hrvoje Benko, Andrea Corradini, Xiaoguang Li, Phil Cohen, and Steven Feiner. 2003. Mutual Disambiguation of 3D Multimodal Interaction in Augmented and Virtual Reality. In Proceedings of the 5th International Conference on Multimodal Interfaces (Vancouver, British Columbia, Canada) (ICMI '03). Association for Computing Machinery, New York, NY, USA, 12--19. https://doi.org/10.1145/958432.958438Google Scholar
Digital Library
- A A Karpov and R M Yusupov. 2018. Multimodal Interfaces of Human-Computer Interaction. Her. Russ. Acad. Sci. , Vol. 88, 1 (Jan. 2018), 67--74.Google Scholar
Cross Ref
- Spencer D Kelly, Asli Ozyürek, and Eric Maris. 2010. Two sides of the same coin: speech and gesture mutually interact to enhance comprehension. Psychol. Sci. , Vol. 21, 2 (Feb. 2010), 260--267.Google Scholar
Cross Ref
- Sumbul Khan and Bige Tuncc er. 2019. Gesture and speech elicitation for 3D CAD modeling in conceptual design. Automation in Construction , Vol. 106 (2019), 102847.Google Scholar
Cross Ref
- Kenrick Kin, Maneesh Agrawala, and Tony DeRose. 2009. Determining the Benefits of Direct-Touch, Bimanual, and Multifinger Input on a Multitouch Workstation. In Proceedings of Graphics Interface 2009 (Kelowna, British Columbia, Canada) (GI '09). Canadian Information Processing Society, CAN, 119--124.Google Scholar
Digital Library
- David B. Koons, Carlton J. Sparrell, and Kristinn Rr. Thorisson. 1998. Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures .Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 53--64.Google Scholar
- Minkyung Lee and Mark Billinghurst. 2008. A Wizard of Oz Study for an AR Multimodal Interface. In Proceedings of the 10th International Conference on Multimodal Interfaces (Chania, Crete, Greece) (ICMI '08). Association for Computing Machinery, New York, NY, USA, 249--256. https://doi.org/10.1145/1452392.1452444Google Scholar
Digital Library
- Minkyung Lee, Mark Billinghurst, Woonhyuk Baek, Richard Green, and Woontack Woo. 2013. A usability study of multimodal input in an augmented reality environment. Virtual Real. , Vol. 17, 4 (Nov. 2013), 293--305.Google Scholar
Digital Library
- Daniel P Loehr. 2012. Temporal, structural, and pragmatic synchrony between intonation and gesture. Laboratory Phonology , Vol. 3, 1 (2012), 71--89.Google Scholar
Cross Ref
- David Mcneill. 2005. Gesture and Thought .the University of Chicago Press, USA. https://doi.org/10.7208/chicago/9780226514642.001.0001Google Scholar
- Mark Micire, Munjal Desai, Amanda Courtemanche, Katherine M. Tsui, and Holly A. Yanco. 2009. Analysis of Natural Gestures for Controlling Robot Teams on Multi-touch Tabletop Surfaces. In Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces (Banff, Alberta, Canada) (ITS '09). ACM, New York, NY, USA, 41--48. https://doi.org/10.1145/1731903.1731912Google Scholar
- Christophe Mignot, Claude Valot, and Noëlle Carbonell. 1993. An Experimental Study of Future 'Natural' Multimodal Human-Computer Interaction. In INTERACT '93 and CHI '93 Conference Companion on Human Factors in Computing Systems (Amsterdam, The Netherlands) (CHI '93). Association for Computing Machinery, New York, NY, USA, 67--68. https://doi.org/10.1145/259964.260075Google Scholar
Digital Library
- Lisette Mol and Sotaro Kita. 2012. Gesture structure affects syntactic structure in speech. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 34. CogSci, USA, 761 -- 766.Google Scholar
- Meredith Ringel Morris. 2012. Web on the Wall: Insights from a Multimodal Interaction Elicitation Study. In Proceedings of the 2012 ACM International Conference on Interactive Tabletops and Surfaces (Cambridge, Massachusetts, USA) (ITS '12). ACM, New York, NY, USA, 95--104. https://doi.org/10.1145/2396636.2396651Google Scholar
Digital Library
- Meredith Ringel Morris, Jacob O. Wobbrock, and Andrew D. Wilson. 2010. Understanding Users' Preferences for Surface Gestures. In Proceedings of Graphics Interface 2010 (Ottawa, Ontario, Canada) (GI '10). Canadian Information Processing Society, CAN, 261--268.Google Scholar
- Tomer Moscovich and John F. Hughes. 2008. Indirect Mappings of Multi-Touch Input Using One and Two Hands. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy) (CHI '08). Association for Computing Machinery, New York, NY, USA, 1275--1284. https://doi.org/10.1145/1357054.1357254Google Scholar
- Miguel A Nacenta, Yemliha Kamber, Yizhou Qiang, and Per Ola Kristensson. 2013. Memorability of pre-designed and user-defined gesture sets.Google Scholar
- Michael Nielsen, Moritz Störring, Thomas B. Moeslund, and Erik Granum. 2004. A Procedure for Developing Intuitive and Ergonomic Gesture Interfaces for HCI. In Gesture-Based Communication in Human-Computer Interaction , , Antonio Camurri and Gualtiero Volpe (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 409--420.Google Scholar
- F. R. Ortega, A. Galvan, K. Tarre, A. Barreto , N. Rishe, J. Bernal , R. Balcazar, and J. Thomas. 2017. Gesture elicitation for 3D travel via multi-touch and mid-Air systems for procedurally generated pseudo-universe. In 2017 IEEE Symposium on 3D User Interfaces (3DUI) . IEEE, Los Angeles, CA, USA, 144--153.Google Scholar
- F. R. Ortega, K. Tarre, M. Kress, A. S. Williams , A. B. Barreto, and N. D. Rishe. 2019. Selection and Manipulation Whole-Body Gesture Elicitation Study In Virtual Reality. In 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) . IEEE, Osaka, Japan, Japan, 1723--1728.Google Scholar
- Sharon Oviatt. 1999. Ten Myths of Multimodal Interaction. Commun. ACM , Vol. 42, 11 (Nov. 1999), 74--81. https://doi.org/10.1145/319382.319398Google Scholar
Digital Library
- Sharon Oviatt. 2000. Taming recognition errors with a multimodal interface. Commun. ACM , Vol. 43, 9 (2000), 45--51.Google Scholar
Digital Library
- Sharon Oviatt, Rachel Coulston, and Rebecca Lunsford. 2004. When Do We Interact Multimodally? Cognitive Load and Multimodal Communication Patterns. In Proceedings of the 6th International Conference on Multimodal Interfaces (State College, PA, USA) (ICMI '04). Association for Computing Machinery, New York, NY, USA, 129--136. https://doi.org/10.1145/1027933.1027957Google Scholar
Digital Library
- Sharon Oviatt, Antonella DeAngeli, and Karen Kuhn. 1997. Integration and Synchronization of Input Modes during Multimodal Human-Computer Interaction. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (Atlanta, Georgia, USA) (CHI '97). Association for Computing Machinery, New York, NY, USA, 415--422. https://doi.org/10.1145/258549.258821Google Scholar
Digital Library
- Helge Petersson, David Sinkvist, Chunliang Wang, and Örjan Smedby. 2009. Web-based interactive 3D visualization as a tool for improved anatomy learning. Anatomical sciences education , Vol. 2, 2 (2009), 61--68.Google Scholar
- T. Piumsomboon, D. Altimira, H. Kim, A. Clark , G. Lee, and M. Billinghurst. 2014. Grasp-Shell vs gesture-speech: A comparison of direct and indirect natural interaction techniques in augmented reality. In 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, Munich, Germany, 73--82.Google Scholar
- Thammathip Piumsomboon, Adrian Clark, Mark Billinghurst, and Andy Cockburn. 2013. User-Defined Gestures for Augmented Reality. In CHI '13 Extended Abstracts on Human Factors in Computing Systems (Paris, France) (CHI EA '13). Association for Computing Machinery, New York, NY, USA, 955--960. https://doi.org/10.1145/2468356.2468527Google Scholar
- Thomas Plank, Hans-Christian Jetter, Roman R"adle, Clemens N. Klokmose, Thomas Luger, and Harald Reiterer. 2017. Is Two Enough?: ! Studying Benefits, Barriers, and Biases of Multi-Tablet Use for Collaborative Visualization. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI '17). ACM, New York, NY, USA, 4548--4560. https://doi.org/10.1145/3025453.3025537Google Scholar
Digital Library
- Sandrine Robbe. 1998. An Empirical Study of Speech and Gesture Interaction: Toward the Definition of Ergonomic Design Guidelines. In CHI 98 Conference Summary on Human Factors in Computing Systems (Los Angeles, California, USA) (CHI '98). Association for Computing Machinery, New York, NY, USA, 349--350. https://doi.org/10.1145/286498.286815Google Scholar
Digital Library
- Jaime Ruiz, Yang Li, and Edward Lank. 2011. User-defined motion gestures for mobile interaction.Google Scholar
- Emanuel A Schegloff. 1984. On some gestures' relation to talk.(pp. 266--296) In J. Maxwell and J. Heritage (Eds.) Structures of social action.Google Scholar
- Katherine Tarre, Adam S. Williams, Lukas Borges, Naphtali D. Rishe, Armando B. Barreto, and Francisco R. Ortega. 2018. Towards First Person Gamer Modeling and the Problem with Game Classification in User Studies. In Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology (Tokyo, Japan) (VRST '18). ACM, New York, NY, USA, Article 125, bibinfonumpages2 pages. https://doi.org/10.1145/3281505.3281590Google Scholar
- Theophanis Tsandilas. 2018. Fallacies of Agreement: A Critical Review of Consensus Assessment Methods for Gesture Elicitation. ACM Trans. Comput. Hum. Interact. , Vol. 25, 3 (June 2018), 18.Google Scholar
Digital Library
- Radu-Daniel Vatavu and Jacob O. Wobbrock. 2015. Formalizing Agreement Analysis for Elicitation Studies: New Measures, Significance Test, and Toolkit. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI '15). Association for Computing Machinery, New York, NY, USA, 1325--1334. https://doi.org/10.1145/2702123.2702223Google Scholar
- Radu-Daniel Vatavu and Jacob O. Wobbrock. 2016. Between-Subjects Elicitation Studies: Formalization and Tool Support. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI '16). Association for Computing Machinery, New York, NY, USA, 3390--3402. https://doi.org/10.1145/2858036.2858228Google Scholar
Digital Library
- Santiago Villarreal-Narvaez, Jean Vanderdonckt, Radu-Daniel Vatavu, and Jacob A Wobbrock. 2020. A Systematic Review of Gesture Elicitation Studies: What Can We Learn from 216 Studies. In Proceedings of ACM Int. Conf. on Designing Interactive Systems (DIS'20) . ACM Press, Eindhoven, NA.Google Scholar
Digital Library
- Jacob O Wobbrock, Htet Htet Aung, Brandon Rothrock, and Brad A Myers. 2005. Maximizing the guessability of symbolic input.Google Scholar
- Jacob O Wobbrock, Meredith Ringel Morris, and Andrew D Wilson. 2009. User-defined Gestures for Surface Computing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Boston, MA, USA) (CHI '09). ACM, New York, NY, USA, 1083--1092.Google Scholar
Digital Library
- Katrin Wolf, Anja Naumann, Michael Rohs, and Jörg Müller. 2011. Taxonomy of Microinteractions: Defining Microgestures Based on Ergonomic and Scenario-dependent Requirements. In Proceedings of the 13th IFIP TC 13 International Conference on Human-computer Interaction - Volume Part I (Lisbon, Portugal) (INTERACT'11). Springer-Verlag, Berlin, Heidelberg, 559--575. http://dl.acm.org/citation.cfm?id=2042053.2042111Google Scholar
Digital Library
- Ionuc t-Alexandru Zaic ti, c Stefan-Gheorghe Pentiuc, and Radu-Daniel Vatavu. 2015. On free-hand TV control: experimental results on user-elicited gestures with Leap Motion. Pers. Ubiquit. Comput. , Vol. 19, 5 (Aug. 2015), 821--838.Google Scholar
Index Terms
Understanding Gesture and Speech Multimodal Interactions for Manipulation Tasks in Augmented Reality Using Unconstrained Elicitation
Recommendations
Eliciting Multimodal Gesture+Speech Interactions in a Multi-Object Augmented Reality Environment
VRST '22: Proceedings of the 28th ACM Symposium on Virtual Reality Software and TechnologyAs augmented reality (AR) technology and hardware become more mature and affordable, researchers have been exploring more intuitive and discoverable interaction techniques for immersive environments. This paper investigates multimodal interaction for ...
Multimodal augmented reality: the norm rather than the exception
MVAR '16: Proceedings of the 2016 workshop on Multimodal Virtual and Augmented RealityAugmented reality (AR) is commonly seen as a technology that overlays virtual imagery onto a participant's view of the world. In line with this, most AR research is focused on what we see. In this paper, we challenge this focus on vision and make a case ...
Using Hand Gesture and Speech in a Multimodal Augmented Reality Environment
Gesture-Based Human-Computer Interaction and SimulationIn this work we describe a 3D authoring tool which takes advantage of multimodal interfaces such as gestures and speech. This tool allows real-time Augmented Reality aimed to aid the tasks of interior architects and designers. This approach intends to ...






Comments