skip to main content
research-article
Public Access

Analyzing Privacy Policies at Scale: From Crowdsourcing to Automated Annotations

Published:04 December 2018Publication History
Skip Abstract Section

Abstract

Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this article, we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, the results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the data practice details that are salient to Internet users’ interests.

References

  1. Mark S. Ackerman, Lorrie Faith Cranor, and Joseph Reagle. 1999. Privacy in e-commerce: Examining user scenarios and privacy preferences. In Proceedings of the 1st ACM Conference on Electronic Commerce (EC’99). ACM, New York, NY, 1--8. 00456 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Waleed Ammar, Shomir Wilson, Norman Sadeh, and Noah A. Smith. 2012. Automatic categorization of privacy policies: A pilot study. Technical Report. Carnegie Mellon University.Google ScholarGoogle Scholar
  3. Paul André, Aniket Kittur, and Steven P. Dow. 2014. Crowd synthesis: Extracting categories and clusters from complex data. In Proc. CSCW’14. ACM, 989--998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ngo Xuan Bach, Nguyen Le Minh, Tran Thi Oanh, and Akira Shimazu. 2013. A two-phase framework for learning logical structures of paragraphs in legal articles. ACM Transactions on Asian Language Information Processing (TALIP) 12, 1 (2013), 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jaspreet Bhatia, Travis D. Breaux, Joel R. Reidenberg, and Thomas B. Norton. 2016b. A theory of vagueness and privacy risk perception. In Proceedings of the 2016 IEEE 24th International Requirements Engineering Conference (RE). 26--35.Google ScholarGoogle Scholar
  6. Jaspreet Bhatia, Travis D. Breaux, and Florian Schaub. 2016a. Mining privacy goals from privacy policies using hybridized task recomposition. ACM Transactions on Software Engineering and Methodology 25, 3, Article 22 (May 2016), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Travis D. Breaux and Florian Schaub. 2014. Scaling requirements extraction to the crowd. In Proceedings of the 22nd IEEE International Requirements Engineering Conference (RE’14). IEEE Society Press, Washington, D.C.Google ScholarGoogle Scholar
  9. Fred H. Cate. 2010. The limits of notice and choice. IEEE Security 8 Privacy 8, 2 (2010), 59--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Lydia B. Chilton, Greg Little, Darren Edge, Daniel S. Weld, and James A. Landay. 2013. Cascade: Crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1999--2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Parvathi Chundi and Pranav M. Subramaniam. 2014. An approach to analyze web privacy policy documents. In KDD Workshop on Data Mining for Social Good.Google ScholarGoogle Scholar
  12. Elisa Costante, Jerry den Hartog, and Milan Petković. 2013. What websites know about you: Privacy policy analysis using information extraction. In Data Privacy Management and Autonomous Spontaneous Security (Lecture Notes in Computer Science), Roberto Di Pietro, Javier Herranz, Ernesto Damiani, and Radu State (Eds.), Vol. 7731. Springer, 146--159.Google ScholarGoogle Scholar
  13. Lorrie Cranor, B. Dobbs, S. Egelman, G. Hogben, J. Humphrey, M. Langheinrich, M. Marchiori, M. Presler-Marshall, J. Reagle, D. A. Stampley, Matthias Schunter, and Rigo Wenning. 2006. The Platform for Privacy Preferences 1.1 (P3P1.1) Specification. Working Group Note. W3C. Retrieved March 12, 2018 from http://www.w3.org/TR/P3P11/.Google ScholarGoogle Scholar
  14. Lorrie Faith Cranor. 2012. Necessary but not sufficient: Standardized mechanisms for privacy notice and choice. J. on Telecomm. 8 High Tech. L. 10 (2012), 273.Google ScholarGoogle Scholar
  15. Lorrie Faith Cranor, Pedro Giovanni Leon, and Blase Ur. 2016. A large-scale evaluation of U.S. financial institutions’ standardized privacy notices. ACM Transactions on the Web 10, 3, Article 17 (Aug. 2016), 33 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Anupam Das, Martin Degeling, Daniel Smullen, and Norman Sadeh. 2018. Personalized privacy assistants for the Internet of Things. IEEE Pervasive Computing—Special Issue on Securing the IoT. 17, 3 (2018), 35--46.Google ScholarGoogle Scholar
  17. Nick Doty, Heather West, Justin Brookman, Sean Harvey, and Erica Newland. 2016. Tracking compliance and scope. Candidate Recommendation. W3C.Google ScholarGoogle Scholar
  18. Tatiana Ermakova, Benjamin Fabian, and Eleonora Babina. 2015. Readability of privacy policies of healthcare websites. In 12. Internationale Tagung Wirtschaftsinformatik (Wirtschaftsinformatik 2015).Google ScholarGoogle Scholar
  19. Morgan C. Evans, Jaspreet Bhatia, Sudarshan Wadkar, and Travis D. Breaux. 2017. An evaluation of constituency-based hyponymy extraction from privacy policies. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE). 312--321.Google ScholarGoogle Scholar
  20. Federal Trade Commission. 2000. Privacy Online: A Report to Congress. Technical Report. Federal Trade Commission.Google ScholarGoogle Scholar
  21. Federal Trade Commission. 2012. Protecting Consumer Privacy in an Era of Rapid Change: Recommendations For Businesses and Policymakers. Retrieved March 12, 2018 from https://www.ftc.gov/reports/protecting-consumer-privacy-era-rapid-change-recommendations-businesses-policymakers.Google ScholarGoogle Scholar
  22. Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Combining different summarization techniques for legal text. In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, 115--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mitra Bokaei Hosseini, Sudarshan Wadkar, Travis D. Breaux, and Jianwei Niu. 2016. Lexical similarity of information type hypernyms, meronyms and synonyms in privacy policies. In Proceedings of the 2016 AAAI Fall Symposium Series.Google ScholarGoogle Scholar
  24. Carlos Jensen and Colin Potts. 2004. Privacy policies as decision-making tools: An evaluation of online privacy notices. In Proc. CHI’04. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Adam N. Joinson, Ulf-Dietrich Reips, Tom Buchanan, and Carina B. Paine Schofield. 2010. Privacy, trust, and self-disclosure online. Human-Computer Interaction 25, 1 (Feb. 2010), 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  26. Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv Preprint arXiv:1408.5882 (2014).Google ScholarGoogle Scholar
  27. Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. Crowdforge: Crowdsourcing complex work. In Proc. UIST’11. ACM, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Pedro Giovanni Leon, Blase Ur, Yang Wang, Manya Sleeper, Rebecca Balebako, Richard Shay, Lujo Bauer, Mihai Christodorescu, and Lorrie Faith Cranor. 2013. What matters to users?: Factors that affect users’ willingness to share information with online advertisers. In Proc. SOUPS’13. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Fei Liu, Nicole Lee Fella, and Kexin Liao. 2016a. Modeling language vagueness in privacy policies using deep neural networks. In Proceedings of the 2016 AAAI Fall Symposium Series.Google ScholarGoogle Scholar
  30. Fei Liu, Rohan Ramanath, Norman Sadeh, and Noah A. Smith. 2014. A step towards usable privacy policy: Automatic alignment of privacy statements. In Proceedings of the 25th International Conference on Computational Linguistics (COLING).Google ScholarGoogle Scholar
  31. Frederick Liu, Shomir Wilson, Florian Schaub, and Norman Sadeh. 2016b. Analyzing vocabulary intersections of expert annotations and topic models for data practices in privacy policies. In Proceedings of the 2016 AAAI Fall Symposium Series.Google ScholarGoogle Scholar
  32. Ewa Luger, Stuart Moran, and Tom Rodden. 2013. Consent for all: Revealing the hidden complexity of terms and conditions. In Proc. CHI’13. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lars Mahler. 2015. What Is NLP and Why Should Lawyers Care? Retrieved March 12, 2018 from http://www.lawpracticetoday.org/article/nlp-lawyers/.Google ScholarGoogle Scholar
  34. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60.Google ScholarGoogle Scholar
  35. Aleecia M. McDonald. 2013. Browser Wars: A New Sequel? The Technology of Privacy. Silicon Flatirons Center, University of Colorado. Presented Jan. 11, 2013.Google ScholarGoogle Scholar
  36. Aleecia M. McDonald and Lorrie Faith Cranor. 2008. The cost of reading privacy policies. I/S: Journal of Law and Policy for the Information Society 4, 3 (2008), 540--561.Google ScholarGoogle Scholar
  37. Gabriele Meiselwitz. 2013. Readability assessment of policies and procedures of social networking sites. In Proceedings of OCSC’13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Simonetta Montemagni, Wim Peters, and Daniela Tiscornia. 2010. Semantic Processing of Legal Texts. Springer.Google ScholarGoogle Scholar
  39. Matteo Negri, Luisa Bentivogli, Yashar Mehdad, Danilo Giampiccolo, and Alessandro Marchetti. 2011. Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 670--679. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Official California Legislative Information. 2003. Online Privacy Protection Act of 2003. Retrieved March 12, 2018 from http://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=200320040AB68.Google ScholarGoogle Scholar
  41. Alessandro Oltramari, Dhivya Piraviperumal, Florian Schaub, Shomir Wilson, Sushain Cherivirala, Thomas B. Norton, N. Cameron Russell, Peter Story, Joel Reidenberg, and Norman Sadeh. 2017. PrivOnto: A semantic framework for the analysis of privacy policies. Semantic Web Journal Preprint (2017), 1--19.Google ScholarGoogle Scholar
  42. Alexander J. Quinn and Benjamin B. Bederson. 2011. Human computation: A survey and taxonomy of a growing field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’11). ACM, New York, NY, 1403--1412. 00257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Nguyen Quoc Viet Hung, Nguyen Thanh Tam, LamNgoc Tran, and Karl Aberer. 2013. An evaluation of aggregation techniques in crowdsourcing. In Proceedings of WISE’13. Springer, 1--15.Google ScholarGoogle ScholarCross RefCross Ref
  44. Rohan Ramanath, Fei Liu, Norman Sadeh, and Noah A. Smith. 2014. Unsupervised alignment of privacy policies using hidden Markov models. In Proceedings of the Annual Meeting of the Association of Computational Linguistics (ACL’14). ACL, 605--610.Google ScholarGoogle Scholar
  45. A. Rao, F. Schaub, N. Sadeh, A. Acquisti, and R. Kang. 2016. Expecting the unexpected: Understanding mismatched privacy expectations online. In Proceedings of the 12th Symposium on Usable Privacy and Security (SOUPS’16). USENIX Association, 77--96. DOI:https://www.usenix.org/system/files/conference/soups2016/soups2016-paper-rao.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Joel R. Reidenberg, Jaspreet Bhatia, Travis Breaux, and Thomas B. Norton. 2016. Automated comparisons of ambiguity in privacy policies and the impact of regulation. Journal of Legal Studies 45, 2 (15 Mar 2016), S163--S190.Google ScholarGoogle Scholar
  47. Joel R. Reidenberg, Travis Breaux, Lorrie Faith Cranor, Brian French, Amanda Grannis, James T. Graves, Fei Liu, Aleecia McDonald, Thomas B. Norton, Rohan Ramanath, N. Cameron Russell, Norman Sadeh, and Florian Schaub. 2015a. Disagreeable privacy policies: Mismatches between meaning and users’ understanding. Berkeley Tech. LJ 30 (2015), 39.Google ScholarGoogle Scholar
  48. Joel R. Reidenberg, N. Cameron Russell, Alexander J. Callen, Sophia Qasir, and Thomas B. Norton. 2015b. Privacy harms and the effectiveness of the notice and choice framework. I/S: Journal of Law 8 Policy for the Information Society 11 (2015).Google ScholarGoogle Scholar
  49. Norman Sadeh, Alessandro Acquisti, Travis D. Breaux, Lorrie Faith Cranor, Aleecia M. McDonald, Joel R. Reidenberg, Noah A. Smith, Fei Liu, N. Cameron Russell, Florian Schaub, and Shomir Wilson. 2013. The Usable Privacy Policy Project: Combining Crowdsourcing, Machine Learning and Natural Language Processing to Semi-Automatically Answer Those Privacy Questions Users Care About. Tech. report CMU-ISR-13-119. Carnegie Mellon University.Google ScholarGoogle Scholar
  50. K.M. Sathyendra, A. Ravichander, P. Story, A.W. Black, and N. Sadeh. 2017a. Helping Users Understand Privacy Notices with Automated Question Answering Functionality: An Exploratory Study. Tech. Report CMU-LTI-17-005. Carnegie Mellon University.Google ScholarGoogle Scholar
  51. Kanthashree Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017b. Identifying the provision of choices in privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2774--2779.Google ScholarGoogle ScholarCross RefCross Ref
  52. Florian Schaub, Rebecca Balebako, and Lorrie Faith Cranor. 2017. Designing effective privacy notices and controls. IEEE Internet Computing 21, 3 (May 2017), 70--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Florian Schaub, Rebecca Balebako, Adam L. Durity, and Lorrie Faith Cranor. 2015. A design space for effective privacy notices. In Proceedings of the 11th Symposium On Usable Privacy and Security (SOUPS 2015). USENIX Association, Ottawa, 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Florian Schaub, Travis D. Breaux, and Norman Sadeh. 2016. Crowdsourcing privacy policy analysis: Potential, challenges and best practices. it--Information Technology 58, 5 (2016), 229--236.Google ScholarGoogle Scholar
  55. Rocky Slavin, Xiaoyin Wang, Mitra Bokaei Hosseini, James Hester, Ram Krishnan, Jaspreet Bhatia, Travis D. Breaux, and Jianwei Niu. 2016. Toward a framework for detecting privacy policy violations in android application code. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. John W. Stamey and Ryan A. Rossi. 2009. Automatically identifying relations in privacy policies. In Proc. SIGDOC’09. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Tos;DR. 2012. Terms of Service Didn’t Read. http://tosdr.org/. Retrieved March 12, 2018.Google ScholarGoogle Scholar
  58. University of Cambridge. 2013. Certificate of Proficiency in English (CPE), CEFR Level C2): Handbook for Teachers.Google ScholarGoogle Scholar
  59. Shomir Wilson, Florian Schaub, Aswarth Dara, Sushain K. Cherivirala, Sebastian Zimmeck, Mads Schaarup Andersen, Pedro Giovanni Leon, Eduard Hovy, and Norman Sadeh. 2016a. Demystifying privacy policies with language technologies: Progress and challenges. In Proceedings of LREC 1st Workshop on Text Analytics for Cybersecurity and Online Safety (TA-COS’16). ELRA, Portorož, Slovenia.Google ScholarGoogle Scholar
  60. Shomir Wilson, Florian Schaub, Aswarth Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel R. Reidenberg, and Norman Sadeh. 2016b. The creation and analysis of a website privacy policy corpus. In Annual Meeting of the Association for Computational Linguistics, Aug 2016. ACL.Google ScholarGoogle ScholarCross RefCross Ref
  61. Shomir Wilson, Florian Schaub, Rohan Ramanath, Norman Sadeh, Fei Liu, Noah A. Smith, and Frederick Liu. 2016c. Crowdsourcing annotations for websites’ privacy policies: Can it really work? In Proceedings of the 25th International World Wide Web Conference. International World Wide Web Conferences Steering Committee, 133--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Wei Xu, Xin Liu, and Yihong Gong. 2003. Document clustering based on non-negative matrix factorization. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 267--273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Sebastian Zimmeck and Steven M. Bellovin. 2014. Privee: An architecture for automatically analyzing web privacy policies. In Proceedings of the USENIX Security Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, Roger Iyengar, Bin Liu, Florian Schaub, Shormir Wilson, Norman Sadeh, Steven M. Bellovin, and Joel Reidenberg. 2017. Automated analysis of privacy requirements for mobile apps. In Proceedings of the 24th Network 8 Distributed System Security Symposium (NDSS’17). Internet Society, San Diego, CA.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Analyzing Privacy Policies at Scale: From Crowdsourcing to Automated Annotations

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!