Abstract
This article investigates the correlations between multimedia objects (particularly speech and text) involved in language lectures in order to design an effective presentation mechanism for web-based learning. The cross-media correlations are classified into implicit relations (retrieved by computing) and explicit relations (recorded during the preprocessing stage). The implicit temporal correlation between speech and text is primarily to help to negotiate supplementary lecture navigations like tele-pointer movement, lips-sync movement, and content scrolling. We propose a speech-text alignment framework, using an iterative algorithm based on local alignment, to probe many-to-one temporal correlations, and not the one-to-one only. The proposed framework is a more practical method for analyzing general language lectures, and the algorithm's time complexity conforms to the best-possible computation cost, O(nm), without introducing additional computation. In addition, we have shown the feasibility of creating vivid presentations by exploiting implicit relations and artificially simulating some explicit media. To facilitate the navigation of integrated multimedia documents, we develop several visualization techniques for describing media correlations, including guidelines for speech-text correlations, visible-automatic scrolling, and levels of detail of timeline, to provide intuitive and easy-to-use random access mechanisms. We evaluated the performance of the analysis method and human perceptions of the synchronized presentation. The overall performance of the analysis method is that about 99.5% of the words analyzed are of a temporal error within 0.5 sec and the subjective evaluation result shows that the synchronized presentation is highly acceptable to human beings.
- Abowd, G. D. 1999. Classroom 2000: An experiment with the instrumentation of a living educational environment. IBM Syst. J. 38, 4, 508--530. Google Scholar
Digital Library
- Bregler, C., Covell, M., and Slaney, M. 1997. Video rewrite: Driving video speech with audio. In Proceedings of the 24h Annual ACM-SIGGRAPH Conference on Computer Graphics and Interactive Techniques. 353--360. Google Scholar
Digital Library
- Chen, H. Y., Chen, G. Y., and Hong, J. S. 1999. Design of a web-based synchronized multimedia lecture system for distance education. In Proceedings of the IEEE International Conference on Multimedia Computing and Systems, vol. 2 (Jun. 7--11). 887. Google Scholar
Digital Library
- Cheng, W. H., Chu, W. T., and Wu, J. L. 2003. Semantic context detection based on hierarchical audio models. In Proceedings of the 5th ACM-SIGMM International Workshop on Multimedia Information Retrieval. 109--115. Google Scholar
Digital Library
- Chu, W. T. 2001. Exploring computed synchronization and its application for navigated hypermedia documents. Masters thesis.Google Scholar
- Chu, W. T. and Chen, H. Y. 2004. Toward better retrieval and presentation by exploring cross-media correlations. Multimedia Syst. 10, 3 (Mar.), 183--198.Google Scholar
- Chu, W. T. and Chen, H. Y. 2002. Cross-Media correlations: A case study of navigated hypermedia documents. In Proceedings of the 10th ACM International Conference on Multimedia (Juan-les-Pins, France). 57--66. Google Scholar
Digital Library
- Digital Signal Processing Committee. 1979. Programs for Digital Signal Processing. IEEE Press, Piscataway, NJ. Google Scholar
Digital Library
- Dohi, H. and Ishizuka, M. 1997. Visual software agent: A realistic face-to-face style interface connected with www/Netscape. In Proceedings of the IJCAI Workshop on Intelligent Multimodal Systems. 17--22.Google Scholar
- Gadd, T. N. 1988. ‘Fisching fore weds’: Phonetic retrieval of written text in information system. Program 22, 3 (Jul.), 222--237. Google Scholar
Digital Library
- Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York. Google Scholar
Digital Library
- Hall, P. and Dowling, G. 1980. Approximate string matching. ACM Comput. Surv. 12, 4, 381--402. Google Scholar
Digital Library
- Huang, X., Alleva, F., Hon, H. W., Hwang, M. Y., and Rosenfeld, R. 1993. The SPHINX II speech recognition system: An overview. Comput. Speech Lang. 2, 7, 137--148.Google Scholar
Cross Ref
- Lopresti, D. and Wilfong, G. 1999. Cross-Domain approximate string matching. In Proceedings of the 6th International Symposium on String Processing and Information Retrieval. IEEE Computer Society Press, Los Alamitos, CA. 120--127. Google Scholar
Digital Library
- Moreno, P. J., Joerg, C., Van Thong, J. M., and Glickman, O. 1998. A recursive algorithm for the forced alignment of very long audio segments. In International Conference on Spoken Language Processing (ICSLP) (Sydney, Australia).Google Scholar
- Muller, R. and Ottmann, T. 2000. The authoring on the fly system for automated recording and reply of (tele)presentations. ACM Multimedia Syst. J. 8, 3, 158--176. Google Scholar
Digital Library
- Nitta, N. and Babaguchi, N. 2002. Automatic story segmentation of closed-caption text for semantic content analysis of broadcasted sports video. In Proceedings of the 8th International Workshop on Multimedia Information Systems (MIS). 110--116.Google Scholar
- Okimi, K. and Fukinuki, H. 1981. Master-Slave synchronization techniques. IEEE Commun. Mag. 19, 12--21.Google Scholar
Digital Library
- Owen, C. B. 1998. Computed Synchronization for Multimedia Applications. Kluwer Academic, Norwell, MA. Google Scholar
Digital Library
- Philip, L. 1990. Hanging on the metaphone. Comput. Lang. Mag. 7, 12, 38--43.Google Scholar
- Pratt, W. K. 1978. Digital Image Processing. Wiley, New York. Google Scholar
Digital Library
- Richter, H. A., Brotherton, J. A., Abowd, G. D., and Khai, N. T. 1999. A multi-scale timeline slider for stream visualization and control. Tech. Rep. GIT-GVU-99-30, Georgia Institute of Technology.Google Scholar
- Steinmetz, R. 1996. Human perception of jitter and media synchronization. IEEE J. Selected Areas Commun. 14, 1, 61--72. Google Scholar
Digital Library
- VOA. 2005. Voice of America. http://www.voanews.com/.Google Scholar
- Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 1, 168--178. Google Scholar
Digital Library
- Waterman, M. S. and Eggert, M. 1987. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J. Molecular Biol. 197, 723--728.Google Scholar
Cross Ref
- Waters, K. and Levergood, T. M. 1993. DECface: An automatic lip-synchronization algorithm for synthetic faces. In Proceedings of the 2nd ACM International Conference on Multimedia. 149--156. Google Scholar
Digital Library
- Weide, R. 1998. The CMU pronunciation dictionary, release 0.6. Carnegie Mellon University, http://www.speech.cs.cmu.edu/cgi-bin/cmudict.Google Scholar
- WSML. 1997. NCNU multimedia English classroom. http://english.csie.ncnu.edu.tw.Google Scholar
- Wu, H. L. 2002. A synchronization framework for navigated hypermedia document presentation. Masters thesis, Taiwan University.Google Scholar
Index Terms
Exploring many-to-one speech-to-text correlation for web-based language learning
Recommendations
How speech/text alignment benefits web-based learning
MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on MultimediaThis demonstration presents an integrated web-based synchronized scenario for many-to-one cross-media correlations between speech (an EFL, English as Foreign Language, lecture with free-style lecturing behaviors) and the corresponding textual content. ...
Exploring Multimedia Correlation and Synchronization for Web-Based Language Learning
Incorporating multimedia into Web documents is a direct and efficient means of conveying knowledge and is exemplified by the Web-basedSynchronized Multimedia Lecture project, which is designed for English and Chinese language self-learning. The project ...
Toward better retrieval and presentation by exploring cross-media correlations
This paper addresses explicit correlation and implicit correlation between various media streams in a composite multimedia document, the so-called navigated hypermedia document in our language learning system, in order to facilitate document retrieval ...








Comments