skip to main content
10.1145/3575693.3575716acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections

Copy-on-Pin: The Missing Piece for Correct Copy-on-Write

Published:30 January 2023Publication History

ABSTRACT

Operating systems utilize Copy-on-Write (COW) to conserve memory and improve performance. During the last two decades, a series of COW-related bugs - which compromised security, corrupted memory and degraded performance - was found. The majority of these bugs are related to page "pinning", which operating systems employ to access process memory efficiently and to perform direct I/O. Unfortunately, the true cause of these bugs is not well understood, resulting in incomplete bug fixes. We show this by: (1) surveying previously reported pinning-related COW bugs; (2) uncovering new such bugs in Linux, FreeBSD, and NetBSD; and (3) showing that they occur because the COW logic does not consider page pinnings correctly, resulting in incorrect behavior (e.g., I/O of stale data). We then address the underlying problem by deriving when/how shared pages must be copied and under which conditions pinned pages can be shared to maintain correctness. Based on this assessment, we introduce the "Copy-on-Pin (COP)" scheme, an extension of the COW mechanism that handles pinned pages correctly by ensuring pinned pages and shared pages are mutually exclusive. However, we find that a naive implementation of this scheme hampers performance and increases complexity if pages are copied only when strictly necessary. To compensate, we introduce a relaxed-COP design, which does not require precise tracking of page sharing, maintains correctness without increasing complexity, and (while potentially needlessly copying pages in some corner cases) marginally improves performance. Our relaxed-COP solution has been integrated into Linux 5.19.

References

  1. 2022. Soruce code of custom Linux kernel based on 5.18 that implements the COW logic from 5.8. https://gitlab.com/cop_paper/linux/-/tree/precop Google ScholarGoogle Scholar
  2. 2022. Source code of custom Linux kernel based on 5.18 that implements the COW logic from 5.19. https://gitlab.com/cop_paper/linux/-/tree/relcop Google ScholarGoogle Scholar
  3. 2022. Source code of custom Linux kernel based on 5.18 that implements the COW logic from 5.9. https://gitlab.com/cop_paper/linux/-/tree/nocop Google ScholarGoogle Scholar
  4. 2022. Source code of generic O_DIRECT and fork() test cases. https://gitlab.com/cop_paper/o_direct_fork_tests/-/tree/cop_paper Google ScholarGoogle Scholar
  5. 2022. Source code of vm-scalability benchmark. https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git Google ScholarGoogle Scholar
  6. Mike Accetta, Robert Baron, William Bolosky, David Golub, Richard Rashid, Avadis Tevanian, Michael Young, and Robert Baron Mike Accetta. 1986. Mach: A New Kernel Foundation for UNIX Development. In Proceedings of the Summer 1986 Usenix Conference. USENIX Association, San Diego, CA, USA. 93–112. Google ScholarGoogle Scholar
  7. Nadav Amit. 2020. mm/userfaultfd: fix memory corruption due to writeprotect. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  8. Nadav Amit. 2021. mm: unnecessary COW phenomenon. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  9. Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. 2010. IOMMU: Strategies for mitigating the IOTLB bottleneck. In International Symposium on Computer Architecture. Springer-Verlag, Berlin, Heidelberg. 256–274. https://doi.org/10.1007/978-3-642-24322-6_22 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Andrea Arcangeli. 2014. Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2. https://lists.gnu.org/archive/html/qemu-devel/2014-11/msg03088.html Google ScholarGoogle Scholar
  11. Andrea Arcangeli, Izik Eidus, and Chris Wright. 2009. Increasing memory density by using KSM. In Ottawa Linux Symposium (OLS). Montreal, Quebec, Canada. 19–28. Google ScholarGoogle Scholar
  12. Jens Axboe. 2019. Efficient IO with io_uring. https://kernel.dk/io_uring.pdf Google ScholarGoogle Scholar
  13. Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe. 2019. A fork() in the road. In ACM Workshop on Hot Topics in Operating Systems (HOTOS). Association for Computing Machinery, New York, NY, USA. 14–22. https://doi.org/10.1145/3317550.3321435 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. H. Bell-Thomas. 2020. Interprocess Communication in FreeBSD 11: Performance Analysis. arxiv:2008.02145 Google ScholarGoogle Scholar
  15. Andrea Bittau. 2009. Toward Least-Privilege Isolation for Software. Ph. D. Dissertation. University College London. Google ScholarGoogle Scholar
  16. Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, and Raymond S. Tomlinson. 1972. TENEX, a Paged Time Sharing System for the PDP - 10. Communications of the ACM (CACM), 15, 3 (1972), 135–143. https://doi.org/10.1145/361268.361271 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Erik Bosman, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2016. Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector. In IEEE Symposium on Security and Privacy (SP). IEEE, 987–1004. https://doi.org/10.1109/SP.2016.63 Google ScholarGoogle ScholarCross RefCross Ref
  18. Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based Memory Deduplication through page access characteristics. In ACM/USENIX International Conference on Virtual Execution Environments (VEE). Association for Computing Machinery, New York, NY, USA. 65–76. https://doi.org/10.1145/2576195.2576204 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jonathan Corbet. 2011. Transparent huge pages in 2.6.38. https://lwn.net/Articles/423584/ Google ScholarGoogle Scholar
  20. The MITRE Corporation. 2020. CVE-2020-29368. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-29368 Google ScholarGoogle Scholar
  21. The MITRE Corporation. 2020. CVE-2020-29374. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-29374 Google ScholarGoogle Scholar
  22. The MITRE Corporation. 2021. CVE-2021-39802. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-39802 Google ScholarGoogle Scholar
  23. Alan Cox. 2009. Correct an error in vm_fault_copy_entry(). https://github.com/freebsd/freebsd-src/commit/e4ed417a355e2cfcb7ee5b9caa6be9c2ed239fae Google ScholarGoogle Scholar
  24. Alax Cox. 2009. Simplify both the invocation and the implementation of vm_fault() for wiring. https://github.com/freebsd/freebsd-src/commit/2db65ab46e54af2f56b711c9049e5321bab88a17 Google ScholarGoogle Scholar
  25. Alan Cox and Juan Navarro. 2001. Mitosis: A High Performance, Scalable Virtual Memory System. Rice University, Houston, Texas, USA. Google ScholarGoogle Scholar
  26. Charles D. Cranor and Gurudatta M. Parulkar. 1999. The UVM virtual memory system. In USENIX Annual Technical Conference (ATC). USENIX Association, San Diego, CA, USA. Google ScholarGoogle Scholar
  27. Hugh Dickins. 2005. can_share_swap_page: use page_mapcount. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  28. Hugh Dickins. 2014. mm: get_user_pages(write,force) refuse to COW in shared areas. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  29. John Dyson. 1997. Fix the gdb executable modify problem. https://github.com/freebsd/freebsd-src/commit/a04c970a7aa272333bfa26014f64f461006db115 Google ScholarGoogle Scholar
  30. Francisco Javier Thayer Fábrega, Francisco Javier, and Joshua D. Guttman. 1995. Copy on Write. Google ScholarGoogle Scholar
  31. Robert Fitzgerald and Richard F. Rashid. 1986. The Integration of Virtual Memory Management and Interprocess Communication in Accent. ACM Transactions on Computer Systems (TOCS), 4, 2 (1986), 147–177. https://doi.org/10.1145/214419.214422 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Anshuj Garg, Debadatta Mishra, and Purushottam Kulkarni. 2017. Catalyst: GPU-assisted rapid memory deduplication in virtualization environments. In ACM/USENIX International Conference on Virtual Execution Environments (VEE). Association for Computing Machinery, New York, NY, USA. 44–59. https://doi.org/10.1145/3050748.3050760 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Google. 2021. Android Developer Documentation: Overview ofmemory management. https://developer.android.com/topic/performance/memory-overview Google ScholarGoogle Scholar
  34. Jason Gunthorpe. 2020. Re: mm: Trial do_wp_page() simplification. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  35. Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2010. Difference engine: Harnessing memory redundancy in virtual machines. Communications of the ACM (CACM), 53, 10 (2010), 85–93. https://doi.org/10.1145/1831407.1831429 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Minjong Ha and Sang Hoon Kim. 2022. CCoW: Optimizing Copy-on-Write Considering the Spatial Locality in Workloads. Electronics (Switzerland), 11, 3 (2022), https://doi.org/10.3390/electronics11030461 Google ScholarGoogle ScholarCross RefCross Ref
  37. David Hildenbrand, Martin Schulz, and Nadav Amit. 2022. Software artifacts for the paper "Copy-on-Pin: The Missing Piece for Correct Copy-on-Write". https://doi.org/10.5281/zenodo.7333207 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jann Horn. 2020. Linux: CoW can wrongly grant write access. https://bugs.chromium.org/p/project-zero/issues/detail?id=2045 Google ScholarGoogle Scholar
  39. Hsiao Keng Jerry Chu. 1996. Zero-copy TCP in Solaris. In USENIX Annual Technical Conference (ATC). USENIX Association, San Diego, CA, USA. Google ScholarGoogle Scholar
  40. Shuaijie Jia, Chentao Wu, and Jie Li. 2017. Loc-K: A spatial locality-based memory deduplication scheme with prediction on k-step locations. In IASTED International Conference on Parallel and Distributed Computing and Systems (ICPDCS). IEEE, 310–317. https://doi.org/10.1109/ICPADS.2017.00049 Google ScholarGoogle ScholarCross RefCross Ref
  41. Jinho Jung, Stephen Tong, Hong Hu, Jungwon Lim, Yonghwi Jin, and Taesoo Kim. 2021. WINNIE : Fuzzing Windows Applications with Harness Synthesis and Fast Cloning. In Proceedings 2021 Network and Distributed System Security Symposium. https://doi.org/10.14722/ndss.2021.24334 Google ScholarGoogle ScholarCross RefCross Ref
  42. Sung Hun Kim, Jinkyu Jeong, and Joonwon Lee. 2014. Selective memory deduplication for cost efficiency in mobile smart devices. IEEE Transactions on Consumer Electronics, 60, 2 (2014), 276–284. https://doi.org/10.1109/TCE.2014.6852004 Google ScholarGoogle ScholarCross RefCross Ref
  43. Taehun Kim, Taehyun Kim, and Youngjoo Shin. 2021. Breaking kaslr using memory deduplication in virtualized environments. Electronics (Switzerland), 10, 17 (2021), https://doi.org/10.3390/electronics10172174 Google ScholarGoogle ScholarCross RefCross Ref
  44. Denis Lavrov, Véronique Blanchet, Shaoning Pang, Muyang He, and Abdolhossein Sarrafzadeh. 2017. COR-Honeypot: Copy-On-Risk, virtual machine as Honeypot in the cloud. In IEEE International Conference on Cloud Computing (CLOUD). IEEE, 908–912. https://doi.org/10.1109/CLOUD.2016.0134 Google ScholarGoogle ScholarCross RefCross Ref
  45. Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro, Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit, and Dan Tsafrir. 2017. Page Fault Support for Network Controllers. ACM SIGARCH Computer Architecture News (CAN), 45, 1 (2017), 449–466. https://doi.org/10.1145/3093337.3037710 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Liang Li, Guoren Wang, Gang Wu, Ye Yuan, Lei Chen, and Xiang Lian. 2021. A Comparative Study of Consistent Snapshot Algorithms for Main-Memory Database Systems. IEEE Transactions on Knowledge and Data Engineering, 33, 2 (2021), 316–330. https://doi.org/10.1109/TKDE.2019.2930987 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jens Lindemann and Mathias Fischer. 2019. On the detection of applications in co-resident virtual machines via a memory deduplication side-channel. ACM SIGAPP Applied Computing Review, 18, 4 (2019), 31–46. https://doi.org/10.1145/3307624.3307628 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, 19–25. Google ScholarGoogle Scholar
  49. Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and John S. Quarterman. 1996. The Design and Implementation of the 4.4BSD Operating System. Addison Wesley Longman Publishing Co., Inc., USA. isbn:0201549794 Google ScholarGoogle Scholar
  50. Konrad Miller, Fabian Franz, Thorsten Groeninger, Marc Rittinghaus, Marius Hillenbrand, and Frank Bellosa. 2012. KSM++: Using I/O-based hints to make memory-deduplication scanners more efficient. In Proceedings of the ASPLOS Workshop on Runtime Environments, Systems, Layering and Virtualized Environments (RESoLVE’12). Google ScholarGoogle Scholar
  51. Konrad Miller, Fabian Franz, Marc Rittinghaus, Marius Hillenbrand, and Frank Bellosa. 2013. XLH: More effective memory deduplication scanners through cross-layer hints. In USENIX Annual Technical Conference (ATC). USENIX Association, San Jose, CA, USA. 279–290. Google ScholarGoogle Scholar
  52. Jiwoong Park, Yunjae Lee, Heon Young Yeom, and Yongseok Son. 2020. Memory efficient fork-based checkpointing mechanism for in-memory database systems. In ACM Symposium on Applied Computing (SAC). IEEE, 420–427. https://doi.org/10.1145/3341105.3375782 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Jiwoong Park, Cheolgi Min, Heon Young Yeom, and Yongseok Son. 2019. Z-READ: Towards efficient and transparent zero-copy read. In IEEE International Conference on Cloud Computing (CLOUD). IEEE, 367–371. https://doi.org/10.1109/CLOUD.2019.00066 Google ScholarGoogle ScholarCross RefCross Ref
  54. T. Santhosh Kumar, Debadatta Mishra, Biswabandan Panda, and Nayan Deshmukh. 2019. CoWLight: Hardware assisted copy-on-write fault handling for secure deduplication. In Proceedings of the 8th International Workshop on Hardware and Architectural Support for Security and Privacy. Association for Computing Machinery, New York, NY, USA. 8 pages. https://doi.org/10.1145/3337167.3337170 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In ACM/IEEE International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA. 79–91. https://doi.org/10.1145/2749469.2750379 Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Standard Performance Evaluation Corporation. 2020. SPEC CPU 2017. https://www.spec.org/cpu2017/ Google ScholarGoogle Scholar
  57. Yifeng Sun, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Binbin Zhang, Haogang Chen, and Xiaoming Li. 2009. Fast live cloning of virtual machine based on xen. In 2009 11th IEEE International Conference on High Performance Computing and Communications. IEEE, 392–399. https://doi.org/10.1109/HPCC.2009.97 Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. The Open Group. 2008. Base Specifications Issue 7. IEEE Std 1003.1-2008. Google ScholarGoogle Scholar
  59. Linus Torvalds. 2020. gup: document and work around "COW can break either way" issue. https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/ Google ScholarGoogle Scholar
  60. Linus Torvalds. 2020. mm: do_wp_page() simplification. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09854ba94c6a Google ScholarGoogle Scholar
  61. Linus Torvalds. 2022. Merge tag ’mm-stable-2022-05-25’ of. git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Google ScholarGoogle Scholar
  62. Michael S. Tsirkin. 2006. madvise MADV_DONTFORK/MADV_DOFORK. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  63. Shin‐Yuan ‐Y Tzou and David P. Anderson. 1991. The performance of message‐passing using restricted virtual memory remapping. Software: Practice and Experience, 21, 3 (1991), 251–267. https://doi.org/10.1002/spe.4380210303 Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Fernando Vano-Garcia and Hector Marco-Gisbert. 2020. An Info-Leak Resistant Kernel Randomization for Virtualized Systems. IEEE Access, 8 (2020), 161612–161629. https://doi.org/10.1109/ACCESS.2020.3019774 Google ScholarGoogle ScholarCross RefCross Ref
  65. Fernando Vano-Garcia and Hector Marco-Gisbert. 2020. KASLR-MT: Kernel Address Space Layout Randomization for Multi-Tenant cloud systems. J. Parallel and Distrib. Comput., 137 (2020), 77–90. https://doi.org/10.1016/j.jpdc.2019.11.008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. T. Veni and S. Mary Saira Bhanu. 2014. MDedup++: Exploiting Temporal and Spatial Page-Sharing Behaviors for Memory Deduplication Enhancement. Comput. J., 59, 3 (2014), 353–370. https://doi.org/10.1093/comjnl/bxu149 Google ScholarGoogle ScholarCross RefCross Ref
  67. Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. 2005. Scalability, fidelity, and containment in the Potemkin virtual honeyfarm. ACM SIGOPS Operating Systems Review (OSR), 39, 5 (2005), 148–162. https://doi.org/10.1145/1095810.1095825 Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Carl A. Waldspurger. 2002. Memory Resource Management in VMware ESX Server. ACM SIGOPS Operating Systems Review (OSR), 36, Special Issue (2002), 181–194. https://doi.org/10.1145/844128.844146 Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Jidong Xiao, Zhang Xu, Hai Huang, and Haining Wang. 2013. Security implications of memory deduplication in a virtualized environment. In IEEE International Conference on Dependable Systems & Networks (DSN). IEEE, 1–12. https://doi.org/10.1109/DSN.2013.6575349 Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Peter Xu. 2020. mm/gup: Allow real explicit breaking of COW. https://lore.kernel.org/all/[email protected]/ Google ScholarGoogle Scholar
  71. Wen Xu, Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Designing new operating primitives to improve fuzzing performance. In ACM Conference on Computer and Communications Security (CCS). Association for Computing Machinery, New York, NY, USA. 2313–2328. https://doi.org/10.1145/3133956.3134046 Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Lingjing You, Yongkun Li, Fan Guo, Yinlong Xu, Jinzhong Chen, and Liu Yuan. 2019. Leveraging Array Mapped Tries in KSM for Lightweight Memory Deduplication. In 2019 IEEE International Conference on Networking, Architecture and Storage, NAS 2019 - Proceedings. IEEE, 1–8. https://doi.org/10.1109/NAS.2019.8834730 Google ScholarGoogle ScholarCross RefCross Ref
  73. Kaiyang Zhao, Sishuai Gong, and Pedro Fonseca. 2021. On-demand-fork: A microsecond fork for memory-intensive and latency-sensitive applications. In EuroSys 2021 - Proceedings of the 16th European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA. 540–555. https://doi.org/10.1145/3447786.3456258 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Copy-on-Pin: The Missing Piece for Correct Copy-on-Write

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Article Metrics

          • Downloads (Last 12 months)648
          • Downloads (Last 6 weeks)90

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader