ABSTRACT
We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. In an experiment, it achieved a 40.6% PER at sentence-level, much better compared to the existing models. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.
Supplemental Material
Available for Download
- Jangwon Kim and et al.2014. USC-EMO-MRI corpus: An Emotional Speech Production Database Recorded by Real-time Magnetic Resonance Imaging.Google Scholar
- Shrikanth Narayanan and et al.2014. Real-time Magnetic Resonance Imaging and Electromagnetic Articulography Database for Speech Production Research (TC). 136 (2014), 1307.Google Scholar
- Pramit Saha and et al.2018. Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI. 1249–1253.Google Scholar
- Kicky van Leeuwen and et al.2019. CNN-Based Phoneme Classifier from Vocal Tract MRI Learns Embedding Consistent with Articulatory Topology. 909–913.Google Scholar
Recommendations
Effects of Speaking Rate on Speech and Silent Speech Recognition
CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing SystemsSpeaking rate or the speed at which a person speaks is a fundamental user characteristic. This work investigates the rate in which users speak when interacting with speech and silent speech-based methods. Results revealed that native users speak about ...
Measuring Variations of Voice Source and Vocal Tract Characteristics from Korean Emotional Voice
ISDA '06: Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications - Volume 02We explored the voice source and vocal tract characteristics of emotional speech to estimate the voice quality. Emotional speech data used was collected from the actors. Speech materials consist of 10 sentences from 3 male and 3 female speakers in 6 ...
Silent speech interfaces
The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a 'silent speech' interface, to be used as an aid for the speech-handicapped, or as part of a communications system operating in silence-...




Comments