Abstract
Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users’ interests.
- Mark S. Ackerman, Lorrie Faith Cranor, and Joseph Reagle. 1999. Privacy in e-commerce: Examining user scenarios and privacy preferences. In Proceedings of the 1st ACM Conference on Electronic Commerce (EC’99). ACM, New York, NY, 1--8. 00456 Google Scholar
Digital Library
- Waleed Ammar, Shomir Wilson, Norman Sadeh, and Noah A. Smith. 2012. Automatic categorization of privacy policies: A pilot study. Technical Report. Carnegie Mellon University.Google Scholar
- Paul André, Aniket Kittur, and Steven P. Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proc. CSCW’14. ACM, 989--998. Google Scholar
Digital Library
- Ngo Xuan Bach, Nguyen Le Minh, Tran Thi Oanh, and Akira Shimazu. 2013. A two-phase framework for learning logical structures of paragraphs in legal articles. ACM Transactions on Asian Language Information Processing (TALIP) 12, 1 (2013), 3. Google Scholar
Digital Library
- Jaspreet Bhatia, Travis D. Breaux, Joel R. Reidenberg, and Thomas B. Norton. 2016b. A theory of vagueness and privacy risk perception. In Proceedings of the 2016 IEEE 24th International Requirements Engineering Conference (RE). 26--35.Google Scholar
- Jaspreet Bhatia, Travis D. Breaux, and Florian Schaub. 2016a. Mining privacy goals from privacy policies using hybridized task recomposition. ACM Transactions on Software Engineering and Methodology 25, 3, Article 22 (May 2016), 24 pages. Google Scholar
Digital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022. Google Scholar
Digital Library
- Travis D. Breaux and Florian Schaub. 2014. Scaling requirements extraction to the crowd. In Proceedings of the 22nd IEEE International Requirements Engineering Conference (RE’14). IEEE Society Press, Washington, D.C.Google Scholar
- Fred H. Cate. 2010. The limits of notice and choice. IEEE Security 8 Privacy 8, 2 (2010), 59--62. Google Scholar
Digital Library
- Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, and James A. Landay. 2013. Cascade: Crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999--2008. Google Scholar
Digital Library
- Parvathi Chundi and Pranav M. Subramaniam. 2014. An approach to analyze web privacy policy documents. In KDD Workshop on Data Mining for Social Good.Google Scholar
- Elisa Costante, Jerry den Hartog, and Milan Petković. 2013. What websites know about you: Privacy policy analysis using information extraction. In Data Privacy Management and Autonomous Spontaneous Security (Lecture Notes in Computer Science), Roberto Di Pietro, Javier Herranz, Ernesto Damiani, and Radu State (Eds.), Vol. 7731. Springer, 146--159.Google Scholar
- Lorrie Cranor, B. Dobbs, S. Egelman, G. Hogben, J. Humphrey, M. Langheinrich, M. Marchiori, M. Presler-Marshall, J. Reagle, D. A. Stampley, Matthias Schunter, and Rigo Wenning. 2006. The Platform for Privacy Preferences 1.1 (P3P1.1) Specification. Working Group Note. W3C. Retrieved March 12, 2018 from http://www.w3.org/TR/P3P11/.Google Scholar
- Lorrie Faith Cranor. 2012. Necessary but not sufficient: Standardized mechanisms for privacy notice and choice. J. on Telecomm. 8 High Tech. L. 10 (2012), 273.Google Scholar
- Lorrie Faith Cranor, Pedro Giovanni Leon, and Blase Ur. 2016. A large-scale evaluation of U.S. financial institutions’ standardized privacy notices. ACM Transactions on the Web 10, 3, Article 17 (Aug. 2016), 33 pages. Google Scholar
Digital Library
- Anupam Das, Martin Degeling, Daniel Smullen, and Norman Sadeh. 2018. Personalized privacy assistants for the Internet of Things. IEEE Pervasive Computing—Special Issue on Securing the IoT. 17, 3 (2018), 35--46.Google Scholar
- Nick Doty, Heather West, Justin Brookman, Sean Harvey, and Erica Newland. 2016. Tracking compliance and scope. Candidate Recommendation. W3C.Google Scholar
- Tatiana Ermakova, Benjamin Fabian, and Eleonora Babina. 2015. Readability of privacy policies of healthcare websites. In 12. Internationale Tagung Wirtschaftsinformatik (Wirtschaftsinformatik 2015).Google Scholar
- Morgan C. Evans, Jaspreet Bhatia, Sudarshan Wadkar, and Travis D. Breaux. 2017. An evaluation of constituency-based hyponymy extraction from privacy policies. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE). 312--321.Google Scholar
- Federal Trade Commission. 2000. Privacy Online: A Report to Congress. Technical Report. Federal Trade Commission.Google Scholar
- Federal Trade Commission. 2012. Protecting Consumer Privacy in an Era of Rapid Change: Recommendations For Businesses and Policymakers. Retrieved March 12, 2018 from https://www.ftc.gov/reports/protecting-consumer-privacy-era-rapid-change-recommendations-businesses-policymakers.Google Scholar
- Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Combining different summarization techniques for legal text. In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 115--123. Google Scholar
Digital Library
- Mitra Bokaei Hosseini, Sudarshan Wadkar, Travis D. Breaux, and Jianwei Niu. 2016. Lexical similarity of information type hypernyms, meronyms and synonyms in privacy policies. In Proceedings of the 2016 AAAI Fall Symposium Series.Google Scholar
- Carlos Jensen and Colin Potts. 2004. Privacy policies as decision-making tools: An evaluation of online privacy notices. In Proc. CHI’04. ACM. Google Scholar
Digital Library
- Adam N. Joinson, Ulf-Dietrich Reips, Tom Buchanan, and Carina B. Paine Schofield. 2010. Privacy, trust, and self-disclosure online. Human-Computer Interaction 25, 1 (Feb. 2010), 1--24.Google Scholar
Cross Ref
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014).Google Scholar
- Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. Crowdforge: Crowdsourcing complex work. In Proc. UIST’11. ACM, 43--52. Google Scholar
Digital Library
- Pedro Giovanni Leon, Blase Ur, Yang Wang, Manya Sleeper, Rebecca Balebako, Richard Shay, Lujo Bauer, Mihai Christodorescu, and Lorrie Faith Cranor. 2013. What matters to users?: Factors that affect users’ willingness to share information with online advertisers. In Proc. SOUPS’13. ACM. Google Scholar
Digital Library
- Fei Liu, Nicole Lee Fella, and Kexin Liao. 2016a. Modeling language vagueness in privacy policies using deep neural networks. In Proceedings of the 2016 AAAI Fall Symposium Series.Google Scholar
- Fei Liu, Rohan Ramanath, Norman Sadeh, and Noah A. Smith. 2014. A step towards usable privacy policy: Automatic alignment of privacy statements. In Proceedings of the 25th International Conference on Computational Linguistics (COLING).Google Scholar
- Frederick Liu, Shomir Wilson, Florian Schaub, and Norman Sadeh. 2016b. Analyzing vocabulary intersections of expert annotations and topic models for data practices in privacy policies. In Proceedings of the 2016 AAAI Fall Symposium Series.Google Scholar
- Ewa Luger, Stuart Moran, and Tom Rodden. 2013. Consent for all: Revealing the hidden complexity of terms and conditions. In Proc. CHI’13. ACM. Google Scholar
Digital Library
- Lars Mahler. 2015. What Is NLP and Why Should Lawyers Care? Retrieved March 12, 2018 from http://www.lawpracticetoday.org/article/nlp-lawyers/.Google Scholar
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60.Google Scholar
- Aleecia M. McDonald. 2013. Browser Wars: A New Sequel? The Technology of Privacy. Silicon Flatirons Center, University of Colorado. Presented Jan. 11, 2013.Google Scholar
- Aleecia M. McDonald and Lorrie Faith Cranor. 2008. The cost of reading privacy policies. I/S: Journal of Law and Policy for the Information Society 4, 3 (2008), 540--561.Google Scholar
- Gabriele Meiselwitz. 2013. Readability assessment of policies and procedures of social networking sites. In Proceedings of OCSC’13. Google Scholar
Digital Library
- Simonetta Montemagni, Wim Peters, and Daniela Tiscornia. 2010. Semantic Processing of Legal Texts. Springer.Google Scholar
- Matteo Negri, Luisa Bentivogli, Yashar Mehdad, Danilo Giampiccolo, and Alessandro Marchetti. 2011. Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 670--679. Google Scholar
Digital Library
- Official California Legislative Information. 2003. Online Privacy Protection Act of 2003. Retrieved March 12, 2018 from http://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=200320040AB68.Google Scholar
- Alessandro Oltramari, Dhivya Piraviperumal, Florian Schaub, Shomir Wilson, Sushain Cherivirala, Thomas B. Norton, N. Cameron Russell, Peter Story, Joel Reidenberg, and Norman Sadeh. 2017. PrivOnto: A semantic framework for the analysis of privacy policies. Semantic Web Journal Preprint (2017), 1--19.Google Scholar
- Alexander J. Quinn and Benjamin B. Bederson. 2011. Human computation: A survey and taxonomy of a growing field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’11). ACM, New York, NY, 1403--1412. 00257. Google Scholar
Digital Library
- Nguyen Quoc Viet Hung, Nguyen Thanh Tam, LamNgoc Tran, and Karl Aberer. 2013. An evaluation of aggregation techniques in crowdsourcing. In Proceedings of WISE’13. Springer, 1--15.Google Scholar
Cross Ref
- Rohan Ramanath, Fei Liu, Norman Sadeh, and Noah A. Smith. 2014. Unsupervised alignment of privacy policies using hidden Markov models. In Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL’14). ACL, 605--610.Google Scholar
- A. Rao, F. Schaub, N. Sadeh, A. Acquisti, and R. Kang. 2016. Expecting the unexpected: Understanding mismatched privacy expectations online. In Proceedings of the 12th Symposium on Usable Privacy and Security (SOUPS’16). USENIX Association, 77--96. DOI:https://www.usenix.org/system/files/conference/soups2016/soups2016-paper-rao.pdf. Google Scholar
Digital Library
- Joel R. Reidenberg, Jaspreet Bhatia, Travis Breaux, and Thomas B. Norton. 2016. Automated comparisons of ambiguity in privacy policies and the impact of regulation. Journal of Legal Studies 45, 2 (15 Mar 2016), S163--S190.Google Scholar
- Joel R. Reidenberg, Travis Breaux, Lorrie Faith Cranor, Brian French, Amanda Grannis, James T. Graves, Fei Liu, Aleecia McDonald, Thomas B. Norton, Rohan Ramanath, N. Cameron Russell, Norman Sadeh, and Florian Schaub. 2015a. Disagreeable privacy policies: Mismatches between meaning and users’ understanding. Berkeley Tech. LJ 30 (2015), 39.Google Scholar
- Joel R. Reidenberg, N. Cameron Russell, Alexander J. Callen, Sophia Qasir, and Thomas B. Norton. 2015b. Privacy harms and the effectiveness of the notice and choice framework. I/S: Journal of Law 8 Policy for the Information Society 11 (2015).Google Scholar
- Norman Sadeh, Alessandro Acquisti, Travis D. Breaux, Lorrie Faith Cranor, Aleecia M. McDonald, Joel R. Reidenberg, Noah A. Smith, Fei Liu, N. Cameron Russell, Florian Schaub, and Shomir Wilson. 2013. The Usable Privacy Policy Project: Combining Crowdsourcing, Machine Learning and Natural Language Processing to Semi-Automatically Answer Those Privacy Questions Users Care About. Tech. report CMU-ISR-13-119. Carnegie Mellon University.Google Scholar
- K.M. Sathyendra, A. Ravichander, P. Story, A.W. Black, and N. Sadeh. 2017a. Helping Users Understand Privacy Notices with Automated Question Answering Functionality: An Exploratory Study. Tech. Report CMU-LTI-17-005. Carnegie Mellon University.Google Scholar
- Kanthashree Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017b. Identifying the provision of choices in privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2774--2779.Google Scholar
Cross Ref
- Florian Schaub, Rebecca Balebako, and Lorrie Faith Cranor. 2017. Designing effective privacy notices and controls. IEEE Internet Computing 21, 3 (May 2017), 70--77. Google Scholar
Digital Library
- Florian Schaub, Rebecca Balebako, Adam L. Durity, and Lorrie Faith Cranor. 2015. A design space for effective privacy notices. In Proceedings of the 11th Symposium On Usable Privacy and Security (SOUPS 2015). USENIX Association, Ottawa, 1--17. Google Scholar
Digital Library
- Florian Schaub, Travis D. Breaux, and Norman Sadeh. 2016. Crowdsourcing privacy policy analysis: Potential, challenges and best practices. it--Information Technology 58, 5 (2016), 229--236.Google Scholar
- Rocky Slavin, Xiaoyin Wang, Mitra Bokaei Hosseini, James Hester, Ram Krishnan, Jaspreet Bhatia, Travis D. Breaux, and Jianwei Niu. 2016. Toward a framework for detecting privacy policy violations in android application code. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 25--36. Google Scholar
Digital Library
- John W. Stamey and Ryan A. Rossi. 2009. Automatically identifying relations in privacy policies. In Proc. SIGDOC’09. ACM. Google Scholar
Digital Library
- Tos;DR. 2012. Terms of Service Didn’t Read. http://tosdr.org/. Retrieved March 12, 2018.Google Scholar
- University of Cambridge. 2013. Certificate of Proficiency in English (CPE), CEFR Level C2): Handbook for Teachers.Google Scholar
- Shomir Wilson, Florian Schaub, Aswarth Dara, Sushain K. Cherivirala, Sebastian Zimmeck, Mads Schaarup Andersen, Pedro Giovanni Leon, Eduard Hovy, and Norman Sadeh. 2016a. Demystifying privacy policies with language technologies: Progress and challenges. In Proceedings of LREC 1st Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS’16). ELRA, Portorož, Slovenia.Google Scholar
- Shomir Wilson, Florian Schaub, Aswarth Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel R. Reidenberg, and Norman Sadeh. 2016b. The creation and analysis of a website privacy policy corpus. In Annual Meeting of the Association for Computational Linguistics, Aug 2016. ACL.Google Scholar
Cross Ref
- Shomir Wilson, Florian Schaub, Rohan Ramanath, Norman Sadeh, Fei Liu, Noah A. Smith, and Frederick Liu. 2016c. Crowdsourcing annotations for websites’ privacy policies: Can it really work? In Proceedings of the 25th International World Wide Web Conference. International World Wide Web Conferences Steering Committee, 133--143. Google Scholar
Digital Library
- Wei Xu, Xin Liu, and Yihong Gong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 267--273. Google Scholar
Digital Library
- Sebastian Zimmeck and Steven M. Bellovin. 2014. Privee: An architecture for automatically analyzing web privacy policies. In Proceedings of the USENIX Security Symposium. Google Scholar
Digital Library
- Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, Roger Iyengar, Bin Liu, Florian Schaub, Shormir Wilson, Norman Sadeh, Steven M. Bellovin, and Joel Reidenberg. 2017. Automated analysis of privacy requirements for mobile apps. In Proceedings of the 24th Network 8 Distributed System Security Symposium (NDSS’17). Internet Society, San Diego, CA.Google Scholar
Cross Ref
Index Terms
Analyzing Privacy Policies at Scale: From Crowdsourcing to Automated Annotations
Recommendations
Crowdsourcing Annotations for Websites' Privacy Policies: Can It Really Work?
WWW '16: Proceedings of the 25th International Conference on World Wide WebWebsite privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have time to understand the policies of every website they visit, and most users hardly ever read privacy ...
E-P3P privacy policies and privacy authorization
WPES '02: Proceedings of the 2002 ACM workshop on Privacy in the Electronic SocietyEnterprises collect large amounts of personal data from their customers. To ease privacy concerns, enterprises publish privacy statements that outline how data is used and shared. The Platform for Enterprise Privacy Practices (E-P3P) defines a fine-...
A Gap in Perceived Importance of Privacy Policies between Individuals and Companies
CONGRESS '09: Proceedings of the 2009 World Congress on Privacy, Security, Trust and the Management of e-BusinessAlthough several studies have examined individuals’ privacy concerns and companies’ privacy policy disclosures, only a few studies examined whether customers’ privacy concerns are adequately addressed in companies’ privacy policy disclosures. This study ...






Comments