Abstract
It is often very expensive and practically infeasible to generate test cases that can exercise all possible program states in a program. This is especially true for a medium or large industrial system. In practice, industrial clients of the system often have a set of input data collected either before the system is built or after the deployment of a previous version of the system. Such data are highly valuable as they represent the operations that matter in a client's daily business and may be used to extensively test the system. However, such data often carries sensitive information and cannot be released to third-party development houses. For example, a healthcare provider may have a set of patient records that are strictly confidential and cannot be used by any third party. Simply masking sensitive values alone may not be sufficient, as the correlation among fields in the data can reveal the masked information. Also, masked data may exhibit different behavior in the system and become less useful than the original data for testing and debugging.
For the purpose of releasing private data for testing and debugging, this paper proposes the kb-anonymity model, which combines the k-anonymity model commonly used in the data mining and database areas with the concept of program behavior preservation. Like k-anonymity, kb-anonymity replaces some information in the original data to ensure privacy preservation so that the replaced data can be released to third-party developers. Unlike k-anonymity, kb-anonymity ensures that the replaced data exhibits the same kind of program behavior exhibited by the original data so that the replaced data may still be useful for the purposes of testing and debugging. We also provide a concrete version of the model under three particular configurations and have successfully applied our prototype implementation to three open source programs, demonstrating the utility and scalability of our prototype.
- Choco solver. http://www.emn.fr/z-info/choco-solver/.Google Scholar
- Fujitsu develops technology to enhance comprehensive testing of java programs. http://www.fujitsu.com/global/news/pr/archives/month/2010/20100112-02.html.Google Scholar
- iTrust. http://sourceforge.net/projects/itrust/.Google Scholar
- Open hospital. http://sourceforge.net/projects/angal/.Google Scholar
- PDmanager. http://sourceforge.net/projects/pdmanager/.Google Scholar
- G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving anonymity via clustering. In PODS, pages 153--162, 2006. Google Scholar
Digital Library
- G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. Approximation algorithms for k-anonymity. In Int. Conf. on Data Theory, 2005.Google Scholar
- S. Anand, C. Pasareanu, and W. Visser. JPF-SE: A symbolic execution extenion to Java PathFinder. In TACAS, 2007. Google Scholar
Digital Library
- S. Artzi, J. Dolby, F. Tip, and M. Pistoia. Directed test generation for effective fault localization. In ISSTA, pages 49--60, 2010. Google Scholar
Digital Library
- L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x? Anonymized social networks, hidden patterns, and structural steganography. In WWW, pages 181--190, 2007. Google Scholar
Digital Library
- P. Broadwell, M. Harren, and N. Sastry. Scrash: A system for generating secure crash information. In 12th USENIX Security Symposium, pages 273--284, 2003. Google Scholar
Digital Library
- C. Cadar, D. Dunbar, and D. R. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, pages 209--224, 2008. Google Scholar
Digital Library
- J. Cleland-Huang, A. Czauderna, M. Gibiec, and J. Emenecker. A machine learning approach for tracing regulatory codes to product specific requirements. In ICSE, pages 155--164, 2010. Google Scholar
Digital Library
- W. Enck, P. Gilbert, B. gon Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth. TaintDroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In OSDI, 2010. Google Scholar
Digital Library
- P. Godefroid, N. Klarlund, and K. Sen. DART: Directed automated random testing. In PLDI, pages 213--223. ACM, 2005. Google Scholar
Digital Library
- P. Godefroid, M. Y. Levin, and D. A. Molnar. Automated whitebox fuzz testing. In NDSS, 2008.Google Scholar
- P. Golle. Revisiting the uniqueness of simple demographics in the US population. In 5th ACM Workshop on Privacy in Electronic Society (WPES), pages 77--80, 2006. Google Scholar
Digital Library
- K. Jayaraman, D. Harvison, V. Ganesh, and A. Kiezun. jFuzz: A concolic tester for NASA Java. In NASA Formal Methods Workshop, 2009.Google Scholar
- D. Jeffrey, N. Gupta, and R. Gupta. Fault localization using value replacement. In ISSTA, pages 167--178, 2008. Google Scholar
Digital Library
- S. Khurshid, C. S. Păsăreanu, and W. Visser. Generalized symbolic execution for model checking and testing. In TACAS, pages 553--568, 2003. Google Scholar
Digital Library
- J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385--394, 1976. Google Scholar
Digital Library
- N. Li, T. Li, and S. Venkatasubramanian. $t$-closeness: Privacy beyond k-anonymity and l-diversity. In Int. Conf. Data Eng., 2007.Google Scholar
Cross Ref
- B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, pages 141--154, June 2003. Google Scholar
Digital Library
- V. B. Livshits, A. V. Nori, S. K. Rajamani, and A. Banerjee. Merlin: Specification inference for explicit information flow problems. In PLDI, pages 75--86, 2009. Google Scholar
Digital Library
- A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. $l$-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1(1), 2007. Google Scholar
Digital Library
- S. McCamant and M. D. Ernst. Quantitative information flow as network flow capacity. In PLDI, pages 193--205, 2008. Google Scholar
Digital Library
- D. L. Métayer, M. Maarek, V. V. T. Tong, E. Mazza, M.-L. Potet, N. Craipeau, S. Frénot, and R. Hardouin. Liability in software engineering: Overview of the LISE approach and illustration on a case study. In ICSE, pages 135--144, 2010. Google Scholar
Digital Library
- M. D. Penta, D. M. German, Y.-G. Guéhéneuc, and G. Antoniol. An exploratory study of the evolution of software licensing. In ICSE, pages 145--154, 2010. Google Scholar
Digital Library
- S. Person, M. B. Dwyer, S. G. Elbaum, and C. S. Pasareanu. Differential symbolic execution. In FSE, pages 226--237, 2008. Google Scholar
Digital Library
- G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold. Prioritizing test cases for regression testing. In IEEE Trans. Software Eng., pages 929--948, 2001. Google Scholar
Digital Library
- A. Sabelfeld and A. C. Myers. Language-based information-flow security. IEEE Journal on Selected Areas in Communications, 21(1):5--19, 2003. Google Scholar
Digital Library
- P. Samarati. Protecting respondents' identities in microdata release. In IEEE Transactions on Knowledge and Data Engineering, 2001. Google Scholar
Digital Library
- R. A. Santelices, P. K. Chittimalli, T. Apiwattanapong, A. Orso, and M. J. Harrold. Test-suite augmentation for evolving software. In ASE, pages 218--227, 2008. Google Scholar
Digital Library
- R. A. Santelices and M. Harrold. Exploiting program dependencies for scalable multiple-path symbolic execution. In ISSTA, pages 195--206, 2010. Google Scholar
Digital Library
- R. A. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight fault-localization using multiple coverage types. In ICSE, pages 56--66, 2009. Google Scholar
Digital Library
- K. Sen, D. Marinov, and G. Agha. CUTE: A concolic unit testing engine for C. In FSE, pages 263--272, 2005. Google Scholar
Digital Library
- L. Sweeney. Uniqueness of simple demographics in the U.S. population. Technical Report LIDAP-WP4, Carnegie Mellon University, School of Computer Science, Data Privacy Laboratory, 2000.Google Scholar
- L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10:557--570, 2002. Google Scholar
Digital Library
- W. Visser and P. Mehlitz. Model checking programs with Java PathFinder. In SPIN, http://babelfish.arc.nasa.gov/trac/jpf, 2005. Google Scholar
Digital Library
- R. Wang, X. Wang, and Z. Li. Panalyst: Privacy-aware remote error analysis on commodity software. In 17th USENIX Security Symposium, pages 291--306, 2008. Google Scholar
Digital Library
- X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In SIGMOD, pages 689--700, 2007. Google Scholar
Digital Library
- T. Xie, D. Marinov, W. Schulte, and D. Notkin. Symstra: A framework for generating object-oriented unit tests using symbolic execution. In TACAS, pages 365--381, 2005. Google Scholar
Digital Library
- A. Zeller. Isolating cause-effect chains from computer programs. In FSE, pages 1--10, 2002. Google Scholar
Digital Library
- X. Zhang, N. Gupta, and R. Gupta. Locating faults through automated predicate switching. In ICSE, pages 272--281, 2006. Google Scholar
Digital Library
- D. Zhu, J. Jungy, D. Song, T. Kohnoz, and D. Wetherall. TaintEraser: Protecting sensitive data leaks using application-level taint tracking. ACM SIGOPS Operating Systems Review, 45(1), 2011. Google Scholar
Digital Library
Index Terms
kb-anonymity: a model for anonymized behaviour-preserving test and debugging data
Recommendations
kb-anonymity: a model for anonymized behaviour-preserving test and debugging data
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and ImplementationIt is often very expensive and practically infeasible to generate test cases that can exercise all possible program states in a program. This is especially true for a medium or large industrial system. In practice, industrial clients of the system often ...
kbe-anonymity: test data anonymization for evolving programs
ASE '12: Proceedings of the 27th IEEE/ACM International Conference on Automated Software EngineeringHigh-quality test data that is useful for effective testing is often available on users’ site. However, sharing data owned by users with software vendors may raise privacy concerns. Techniques are needed to enable data sharing among data owners and the ...
k-anonymity: Risks and the Reality
TRUSTCOM '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 01Many a time, datasets containing private and sensitive information are useful for third-party data mining. To prevent identification of personal information, data owners release such data using privacy-preserving data publishing techniques. One well-...







Comments