Abstract
A so-called completely automated public Turing test to tell computers and humans apart (CAPTCHA) represents a challenge-response test that is widely used on the Internet to distinguish human users from fraudulent computer programs, often referred to as bots. To enable access for visually impaired users, most Web sites utilize audio CAPTCHAs in addition to a conventional image-based scheme. Recent research has shown that most currently available audio CAPTCHAs are insecure, as they can be broken by means of machine learning at relatively low costs. Moreover, most audio CAPTCHAs suffer from low human success rates that arise from severe signal distortions.
This article proposes two different audio CAPTCHA schemes that systematically exploit differences between humans and computers in terms of auditory perception and language understanding, yielding a better trade-off between usability and security as compared to currently available schemes. Furthermore, we provide an elaborate analysis of Google’s prominent reCAPTCHA that serves as a baseline setting when evaluating our proposed CAPTCHA designs.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Toward Improved Audio CAPTCHAs Based on Auditory Perception and Language Understanding
- Jeffrey P. Bigham and Anna C. Cavender. 2009. Evaluating existing audio CAPTCHAs and an interface optimized for non-visual use. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’09). Google Scholar
Digital Library
- Sonja Bohr, Andrea Shome, and Jonathan Z. Simon. 2008. Improving Auditory CAPTCHA Security. Technical Report. A. James Clark School of Engineering, College Park, MD.Google Scholar
- Albert S. Bregman. 1994. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, Cambridge, MA.Google Scholar
- Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, and John C. Mitchell. 2014. The end is nigh: Generic solving of text-based CAPTCHAs. In Proceedings of the USENIX Workshop on Offensive Technologies (WOOT’14). Google Scholar
Digital Library
- Elie Bursztein, Romain Bauxis, Hristo Paskov, Daniele Perito, Celine Fabry, and John C. Mitchell. 2011a. The failure of noise-based non-continuous audio CAPTCHAs. In Proceedings of the IEEE Symposium on Security and Privacy. Google Scholar
Digital Library
- Elie Bursztein and Steven Bethard. 2009. Decaptcha breaking 75% of eBay audio CAPTCHAs. In Proceedings of the USENIX Workshop on Offensive Technologies (WOOT’09). Google Scholar
Digital Library
- Elie Bursztein, Matthieu Martin, and John C. Mitchell. 2011b. Text-based CAPTCHA strengths and weaknesses. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’11). Google Scholar
Digital Library
- Carnegie Mellon University. 2014. The Carnegie Mellon University Pronouncing Dictionary, CMUdict (v. 0.7a). Retrieved October 19, 2016, from http://www.speech.cs.cmu.edu/cgi-bin/cmudict.Google Scholar
- Gert Cauwenberghs. 1999. Monaural separation of independent acoustical components. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’99).Google Scholar
Cross Ref
- Kumar Chellapilla, Kevin Larson, Patrice Y. Simard, and Mary Czerwinski. 2005. Building segmentation based human-friendly human interaction proofs (HIPs). In Human Interactive Proofs. Springer. Google Scholar
Digital Library
- CrowdFlower Inc. 2015. CrowdFlower Home Page. Retrieved October 19, 2016, from http://www.crowdflower.com.Google Scholar
- Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The second-generation onion router. In Proceedings of the USENIX Security Symposium. Google Scholar
Digital Library
- Pierre Divenyi. 2005. Speech Separation by Humans and Machines. Springer.Google Scholar
- Maxine Eskenazi, Gina-Anne Levow, Helen Meng, Gabriel Parent, and David Suendermann. 2013. Crowdsourcing for Speech Processing: Applications to Data Collection, Transcription and Assessment. Wiley. Google Scholar
Digital Library
- Igor Fischer and Thorsten Herfet. 2006. Visual CAPTCHAs for document authentication. In Proceedings of the IEEE Workshop on Multimedia Signal Processing (MMSP’06).Google Scholar
Cross Ref
- Mark Gales and Steve Young. 2007. The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing 1, 3, 195--304. Google Scholar
Digital Library
- Google Inc. 2015a. Google Web Search. Available at http://www.google.com.Google Scholar
- Google Inc. 2015b. reCAPTCHA Home Page. Retrieved October 19, 2016, from http://www.recaptcha.net.Google Scholar
- Fredric J. Harris. 1978. On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE 66, 51--83.Google Scholar
Cross Ref
- Hynek Hermansky, Brian A. Hanson, and Hisashi Wakita. 1985. Perceptually based linear predictive analysis of speech. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’85).Google Scholar
Cross Ref
- Abram Hindle, Michael W. Godfrey, and Richard C. Holt. 2008. Reverse engineering CAPTCHAs. In Proceedings of the Working Conference on Reverse Engineering (WCRE’08). Google Scholar
Digital Library
- Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 6, 82--97.Google Scholar
Cross Ref
- Christof Koch and Giulio Tononi. 2011. A test for consciousness. Scientific American 304, 6, 44--47.Google Scholar
Cross Ref
- Greg Kochanski, Daniel P. Lopresti, and Chilin Shih. 2002. A reverse Turing test using speech. In Proceedings of the INTERSPEECH Conference.Google Scholar
- John Kominek and Alan W. Black. 2003. CMU Arctic Databases for Speech Synthesis. Technical Report. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- R. Gary Leonard and George Doddington. 1993. TIDIGITS. Linguistic Data Consortium. Available at https://catalog.ldc.upenn.edu/ldc93s10.Google Scholar
- Hector J. Levesque, Ernest Davis, and Leora Morgenstern. 2012. The Winograd schema challenge. In Proceedings of the International Conference on the Principles of Knowledge Representation and Reasoning.Google Scholar
- Richard P. Lippmann. 1989. Review of neural networks for speech recognition. Neural Computation 1, 1, 1--38. Google Scholar
Digital Library
- Hendrik Meutzner, Viet-Hung Nguyen, Thorsten Holz, and Dorothea Kolossa. 2014. Using automatic speech recognition for attacking acoustic CAPTCHAs: The trade-off between usability and security. In Proceedings of the Annual Computer Security Applications Conference (ACSAC’14). Google Scholar
Digital Library
- Nelson Morgan, Hervé Bourlard, and Hynek Hermansky. 2004. Automatic speech recognition: An auditory perspective. In Speech Processing in the Auditory System. Springer.Google Scholar
- Charles K. Ogden. 1930. Ogden’s Basic English. Retrieved October 19, 2016, from http://ogden.basic-english.org/basiceng.html.Google Scholar
- David Pearce and Hans-Günter Hirsch. 2000. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of the ISCA Tutorial and Research Workshop on Automatic Speech Recognition: Challenges for the New Millennium (ISCA ITRW ASR’00).Google Scholar
- Lawrence Rabiner and Biing-Hwang Juang. 1993. Fundamentals of Speech Recognition. Prentice Hall. Google Scholar
Digital Library
- Shotaro Sano, Takuma Otsuka, and Hiroshi G. Okuno. 2013. Solving Google’s continuous audio CAPTCHA with HMM-based automatic speech recognition. In Advances in Information and Computer Security. Lecture Notes in Computer Science, Vol. 8231. Springer, 36--52.Google Scholar
- Andy Schlaikjer. 2007. A Dual-Use Speech CAPTCHA: Aiding Visually Impaired Web Users While Providing Transcriptions of Audio Streams. LTI-CMU Technical Report. Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Jürgen Schmidhuber. 2015. Deep learning in neural networks: An overview. Neural Networks 61, 85--117. Google Scholar
Digital Library
- Rituraj Soni and Devendra Tiwari. 2010. Improved CAPTCHA method. International Journal of Computer Applications 25, Article No. 17.Google Scholar
- Jennifer Tam, Jiri Simsa, David Huggins-Daines, Luis von Ahn, and Manuel Blum. 2008a. Improving audio CAPTCHAs. In Proceedings of the Symposium on Usable Privacy and Security (SOUPS’08).Google Scholar
- Jennifer Tam, Jiri Simsa, Sean Hyde, and Luis von Ahn. 2008b. Breaking audio CAPTCHAs. In Proceedings of Advances in Neural Information Processing Systems (NIPS’15). Google Scholar
Digital Library
- Giulio Tononi. 2008. Consciousness as integrated information: A provisional manifesto. Biological Bulletin 215, 3, 216--242.Google Scholar
Cross Ref
- Andrew Varga and Herman J. M. Steeneken. 1993. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12, 3, 247--251. Google Scholar
Digital Library
- Keith Vertanen. 2006. Baseline WSJ Acoustic Models for HTK and Sphinx: Training Recipes and Recognition Experiments. Technical Report. Cavendish Laboratory, University of Cambridge, Cambridge, England. http://www.keithv.com/pub/baselinewsj.Google Scholar
- Tuomas Virtanen, Rita Singh, and Bhiksha Raj. 2012. Techniques for Noise Robustness in Automatic Speech Recognition. Wiley. Google Scholar
- Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John Langford. 2003. CAPTCHA: Using hard AI problems for security. In Advances in Cryptology—EUROCRYPT 2003. Lecture Notes in Computer Science, Vol. 2656. Springer, 294--311. Google Scholar
Digital Library
- Luis von Ahn, Manuel Blum, and John Langford. 2004. Telling humans and computers apart automatically. Communications of the ACM 47, 2, 56--60. Google Scholar
Digital Library
- Terry Winograd. 1972. Understanding Natural Language. Academic Press, Orlando, FL. Google Scholar
Digital Library
- Wolfram Research. 2015. Wolfram Alpha. Retrieved October 19, 2016, from http://www.wolframalpha.com.Google Scholar
- Maria K. Wolters, Karl Isaac, and Steve Renals. 2010. Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. In Proceedings of the ISCA Workshops on Speech Synthesis (SSW’10).Google Scholar
- Jeff Yan and Ahmad Salah El Ahmad. 2007. Breaking visual CAPTCHAs with naive pattern recognition algorithms. In Proceedings of the Annual Computer Security Applications Conference (ACSAC’07).Google Scholar
Cross Ref
- Jeff Yan and Ahmad Salah El Ahmad. 2008. A low-cost attack on a Microsoft CAPTCHA. In Proceedings of the ACM Conference on Computer and Communications Security (CCS’08). Google Scholar
Digital Library
- Steve J. Young. 1994. The HTK Hidden Markov Model Toolkit: Design and Philosophy. Entropic Cambridge Research Laboratory, Ltd., Cambridge, England.Google Scholar
- Heiga Zen, Takashi Nose, Junichi Yamagishi, Shinji Sako, Takashi Masuko, Alan W. Black, and Keiichi Tokuda. 2007. The HMM-based speech synthesis system (HTS) version 2.0. In Proceedings of the ISCA Workshops on Speech Synthesis (SSW’07).Google Scholar
Index Terms
Toward Improved Audio CAPTCHAs Based on Auditory Perception and Language Understanding
Recommendations
Constructing Secure Audio CAPTCHAs by Exploiting Differences between Humans and Machines
CHI '15: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing SystemsTo prevent abuses of Internet services, CAPTCHAs are used to distinguish humans from programs where an audio-based scheme is beneficial to support visually impaired people. Previous studies show that most audio CAPTCHAs, albeit hard to solve for humans, ...
Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language
Highlights- Adding dysarthric speech resources from the dominant variety for training improves automatic recognition of dysarthric speech of the non-dominant variety.
AbstractSpeech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. ...
Harmonicity Based Dereverberation for Improving Automatic Speech Recognition Performance and Speech Intelligibility
A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades both the speech intelligibility and Automatic Speech Recognition (ASR) performance. Previously, we proposed a single-microphone ...






Comments