ABSTRACT
Operating systems utilize Copy-on-Write (COW) to conserve memory and improve performance. During the last two decades, a series of COW-related bugs - which compromised security, corrupted memory and degraded performance - was found. The majority of these bugs are related to page "pinning", which operating systems employ to access process memory efficiently and to perform direct I/O. Unfortunately, the true cause of these bugs is not well understood, resulting in incomplete bug fixes. We show this by: (1) surveying previously reported pinning-related COW bugs; (2) uncovering new such bugs in Linux, FreeBSD, and NetBSD; and (3) showing that they occur because the COW logic does not consider page pinnings correctly, resulting in incorrect behavior (e.g., I/O of stale data). We then address the underlying problem by deriving when/how shared pages must be copied and under which conditions pinned pages can be shared to maintain correctness. Based on this assessment, we introduce the "Copy-on-Pin (COP)" scheme, an extension of the COW mechanism that handles pinned pages correctly by ensuring pinned pages and shared pages are mutually exclusive. However, we find that a naive implementation of this scheme hampers performance and increases complexity if pages are copied only when strictly necessary. To compensate, we introduce a relaxed-COP design, which does not require precise tracking of page sharing, maintains correctness without increasing complexity, and (while potentially needlessly copying pages in some corner cases) marginally improves performance. Our relaxed-COP solution has been integrated into Linux 5.19.
- 2022. Soruce code of custom Linux kernel based on 5.18 that implements the COW logic from 5.8. https://gitlab.com/cop_paper/linux/-/tree/precop
Google Scholar
- 2022. Source code of custom Linux kernel based on 5.18 that implements the COW logic from 5.19. https://gitlab.com/cop_paper/linux/-/tree/relcop
Google Scholar
- 2022. Source code of custom Linux kernel based on 5.18 that implements the COW logic from 5.9. https://gitlab.com/cop_paper/linux/-/tree/nocop
Google Scholar
- 2022. Source code of generic O_DIRECT and fork() test cases. https://gitlab.com/cop_paper/o_direct_fork_tests/-/tree/cop_paper
Google Scholar
- 2022. Source code of vm-scalability benchmark. https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git
Google Scholar
- Mike Accetta, Robert Baron, William Bolosky, David Golub, Richard Rashid, Avadis Tevanian, Michael Young, and Robert Baron Mike Accetta. 1986. Mach: A New Kernel Foundation for UNIX Development. In Proceedings of the Summer 1986 Usenix Conference. USENIX Association, San Diego, CA, USA. 93–112.
Google Scholar
- Nadav Amit. 2020. mm/userfaultfd: fix memory corruption due to writeprotect. https://lore.kernel.org/all/[email protected]/
Google Scholar
- Nadav Amit. 2021. mm: unnecessary COW phenomenon. https://lore.kernel.org/all/[email protected]/
Google Scholar
- Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. 2010. IOMMU: Strategies for mitigating the IOTLB bottleneck. In International Symposium on Computer Architecture. Springer-Verlag, Berlin, Heidelberg. 256–274. https://doi.org/10.1007/978-3-642-24322-6_22
Google Scholar
Digital Library
- Andrea Arcangeli. 2014. Re: [Qemu-devel] [PATCH 00/17] RFC: userfault v2. https://lists.gnu.org/archive/html/qemu-devel/2014-11/msg03088.html
Google Scholar
- Andrea Arcangeli, Izik Eidus, and Chris Wright. 2009. Increasing memory density by using KSM. In Ottawa Linux Symposium (OLS). Montreal, Quebec, Canada. 19–28.
Google Scholar
- Jens Axboe. 2019. Efficient IO with io_uring. https://kernel.dk/io_uring.pdf
Google Scholar
- Andrew Baumann, Jonathan Appavoo, Orran Krieger, and Timothy Roscoe. 2019. A fork() in the road. In ACM Workshop on Hot Topics in Operating Systems (HOTOS). Association for Computing Machinery, New York, NY, USA. 14–22. https://doi.org/10.1145/3317550.3321435
Google Scholar
Digital Library
- A. H. Bell-Thomas. 2020. Interprocess Communication in FreeBSD 11: Performance Analysis. arxiv:2008.02145
Google Scholar
- Andrea Bittau. 2009. Toward Least-Privilege Isolation for Software. Ph. D. Dissertation. University College London.
Google Scholar
- Daniel G. Bobrow, Jerry D. Burchfiel, Daniel L. Murphy, and Raymond S. Tomlinson. 1972. TENEX, a Paged Time Sharing System for the PDP - 10. Communications of the ACM (CACM), 15, 3 (1972), 135–143. https://doi.org/10.1145/361268.361271
Google Scholar
Digital Library
- Erik Bosman, Kaveh Razavi, Herbert Bos, and Cristiano Giuffrida. 2016. Dedup Est Machina: Memory Deduplication as an Advanced Exploitation Vector. In IEEE Symposium on Security and Privacy (SP). IEEE, 987–1004. https://doi.org/10.1109/SP.2016.63
Google Scholar
Cross Ref
- Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: Classification-based Memory Deduplication through page access characteristics. In ACM/USENIX International Conference on Virtual Execution Environments (VEE). Association for Computing Machinery, New York, NY, USA. 65–76. https://doi.org/10.1145/2576195.2576204
Google Scholar
Digital Library
- Jonathan Corbet. 2011. Transparent huge pages in 2.6.38. https://lwn.net/Articles/423584/
Google Scholar
- The MITRE Corporation. 2020. CVE-2020-29368. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-29368
Google Scholar
- The MITRE Corporation. 2020. CVE-2020-29374. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-29374
Google Scholar
- The MITRE Corporation. 2021. CVE-2021-39802. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-39802
Google Scholar
- Alan Cox. 2009. Correct an error in vm_fault_copy_entry(). https://github.com/freebsd/freebsd-src/commit/e4ed417a355e2cfcb7ee5b9caa6be9c2ed239fae
Google Scholar
- Alax Cox. 2009. Simplify both the invocation and the implementation of vm_fault() for wiring. https://github.com/freebsd/freebsd-src/commit/2db65ab46e54af2f56b711c9049e5321bab88a17
Google Scholar
- Alan Cox and Juan Navarro. 2001. Mitosis: A High Performance, Scalable Virtual Memory System. Rice University, Houston, Texas, USA.
Google Scholar
- Charles D. Cranor and Gurudatta M. Parulkar. 1999. The UVM virtual memory system. In USENIX Annual Technical Conference (ATC). USENIX Association, San Diego, CA, USA.
Google Scholar
- Hugh Dickins. 2005. can_share_swap_page: use page_mapcount. https://lore.kernel.org/all/[email protected]/
Google Scholar
- Hugh Dickins. 2014. mm: get_user_pages(write,force) refuse to COW in shared areas. https://lore.kernel.org/all/[email protected]/
Google Scholar
- John Dyson. 1997. Fix the gdb executable modify problem. https://github.com/freebsd/freebsd-src/commit/a04c970a7aa272333bfa26014f64f461006db115
Google Scholar
- Francisco Javier Thayer Fábrega, Francisco Javier, and Joshua D. Guttman. 1995. Copy on Write.
Google Scholar
- Robert Fitzgerald and Richard F. Rashid. 1986. The Integration of Virtual Memory Management and Interprocess Communication in Accent. ACM Transactions on Computer Systems (TOCS), 4, 2 (1986), 147–177. https://doi.org/10.1145/214419.214422
Google Scholar
Digital Library
- Anshuj Garg, Debadatta Mishra, and Purushottam Kulkarni. 2017. Catalyst: GPU-assisted rapid memory deduplication in virtualization environments. In ACM/USENIX International Conference on Virtual Execution Environments (VEE). Association for Computing Machinery, New York, NY, USA. 44–59. https://doi.org/10.1145/3050748.3050760
Google Scholar
Digital Library
- Google. 2021. Android Developer Documentation: Overview ofmemory management. https://developer.android.com/topic/performance/memory-overview
Google Scholar
- Jason Gunthorpe. 2020. Re: mm: Trial do_wp_page() simplification. https://lore.kernel.org/all/[email protected]/
Google Scholar
- Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat. 2010. Difference engine: Harnessing memory redundancy in virtual machines. Communications of the ACM (CACM), 53, 10 (2010), 85–93. https://doi.org/10.1145/1831407.1831429
Google Scholar
Digital Library
- Minjong Ha and Sang Hoon Kim. 2022. CCoW: Optimizing Copy-on-Write Considering the Spatial Locality in Workloads. Electronics (Switzerland), 11, 3 (2022), https://doi.org/10.3390/electronics11030461
Google Scholar
Cross Ref
- David Hildenbrand, Martin Schulz, and Nadav Amit. 2022. Software artifacts for the paper "Copy-on-Pin: The Missing Piece for Correct Copy-on-Write". https://doi.org/10.5281/zenodo.7333207
Google Scholar
Digital Library
- Jann Horn. 2020. Linux: CoW can wrongly grant write access. https://bugs.chromium.org/p/project-zero/issues/detail?id=2045
Google Scholar
- Hsiao Keng Jerry Chu. 1996. Zero-copy TCP in Solaris. In USENIX Annual Technical Conference (ATC). USENIX Association, San Diego, CA, USA.
Google Scholar
- Shuaijie Jia, Chentao Wu, and Jie Li. 2017. Loc-K: A spatial locality-based memory deduplication scheme with prediction on k-step locations. In IASTED International Conference on Parallel and Distributed Computing and Systems (ICPDCS). IEEE, 310–317. https://doi.org/10.1109/ICPADS.2017.00049
Google Scholar
Cross Ref
- Jinho Jung, Stephen Tong, Hong Hu, Jungwon Lim, Yonghwi Jin, and Taesoo Kim. 2021. WINNIE : Fuzzing Windows Applications with Harness Synthesis and Fast Cloning. In Proceedings 2021 Network and Distributed System Security Symposium. https://doi.org/10.14722/ndss.2021.24334
Google Scholar
Cross Ref
- Sung Hun Kim, Jinkyu Jeong, and Joonwon Lee. 2014. Selective memory deduplication for cost efficiency in mobile smart devices. IEEE Transactions on Consumer Electronics, 60, 2 (2014), 276–284. https://doi.org/10.1109/TCE.2014.6852004
Google Scholar
Cross Ref
- Taehun Kim, Taehyun Kim, and Youngjoo Shin. 2021. Breaking kaslr using memory deduplication in virtualized environments. Electronics (Switzerland), 10, 17 (2021), https://doi.org/10.3390/electronics10172174
Google Scholar
Cross Ref
- Denis Lavrov, Véronique Blanchet, Shaoning Pang, Muyang He, and Abdolhossein Sarrafzadeh. 2017. COR-Honeypot: Copy-On-Risk, virtual machine as Honeypot in the cloud. In IEEE International Conference on Cloud Computing (CLOUD). IEEE, 908–912. https://doi.org/10.1109/CLOUD.2016.0134
Google Scholar
Cross Ref
- Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro, Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit, and Dan Tsafrir. 2017. Page Fault Support for Network Controllers. ACM SIGARCH Computer Architecture News (CAN), 45, 1 (2017), 449–466. https://doi.org/10.1145/3093337.3037710
Google Scholar
Digital Library
- Liang Li, Guoren Wang, Gang Wu, Ye Yuan, Lei Chen, and Xiang Lian. 2021. A Comparative Study of Consistent Snapshot Algorithms for Main-Memory Database Systems. IEEE Transactions on Knowledge and Data Engineering, 33, 2 (2021), 316–330. https://doi.org/10.1109/TKDE.2019.2930987
Google Scholar
Digital Library
- Jens Lindemann and Mathias Fischer. 2019. On the detection of applications in co-resident virtual machines via a memory deduplication side-channel. ACM SIGAPP Applied Computing Review, 18, 4 (2019), 31–46. https://doi.org/10.1145/3307624.3307628
Google Scholar
Digital Library
- John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, 19–25.
Google Scholar
- Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, and John S. Quarterman. 1996. The Design and Implementation of the 4.4BSD Operating System. Addison Wesley Longman Publishing Co., Inc., USA. isbn:0201549794
Google Scholar
- Konrad Miller, Fabian Franz, Thorsten Groeninger, Marc Rittinghaus, Marius Hillenbrand, and Frank Bellosa. 2012. KSM++: Using I/O-based hints to make memory-deduplication scanners more efficient. In Proceedings of the ASPLOS Workshop on Runtime Environments, Systems, Layering and Virtualized Environments (RESoLVE’12).
Google Scholar
- Konrad Miller, Fabian Franz, Marc Rittinghaus, Marius Hillenbrand, and Frank Bellosa. 2013. XLH: More effective memory deduplication scanners through cross-layer hints. In USENIX Annual Technical Conference (ATC). USENIX Association, San Jose, CA, USA. 279–290.
Google Scholar
- Jiwoong Park, Yunjae Lee, Heon Young Yeom, and Yongseok Son. 2020. Memory efficient fork-based checkpointing mechanism for in-memory database systems. In ACM Symposium on Applied Computing (SAC). IEEE, 420–427. https://doi.org/10.1145/3341105.3375782
Google Scholar
Digital Library
- Jiwoong Park, Cheolgi Min, Heon Young Yeom, and Yongseok Son. 2019. Z-READ: Towards efficient and transparent zero-copy read. In IEEE International Conference on Cloud Computing (CLOUD). IEEE, 367–371. https://doi.org/10.1109/CLOUD.2019.00066
Google Scholar
Cross Ref
- T. Santhosh Kumar, Debadatta Mishra, Biswabandan Panda, and Nayan Deshmukh. 2019. CoWLight: Hardware assisted copy-on-write fault handling for secure deduplication. In Proceedings of the 8th International Workshop on Hardware and Architectural Support for Security and Privacy. Association for Computing Machinery, New York, NY, USA. 8 pages. https://doi.org/10.1145/3337167.3337170
Google Scholar
Digital Library
- Vivek Seshadri, Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry, and Trishul Chilimbi. 2015. Page overlays: An enhanced virtual memory framework to enable fine-grained memory management. In ACM/IEEE International Symposium on Computer Architecture (ISCA). Association for Computing Machinery, New York, NY, USA. 79–91. https://doi.org/10.1145/2749469.2750379
Google Scholar
Digital Library
- Standard Performance Evaluation Corporation. 2020. SPEC CPU 2017. https://www.spec.org/cpu2017/
Google Scholar
- Yifeng Sun, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Binbin Zhang, Haogang Chen, and Xiaoming Li. 2009. Fast live cloning of virtual machine based on xen. In 2009 11th IEEE International Conference on High Performance Computing and Communications. IEEE, 392–399. https://doi.org/10.1109/HPCC.2009.97
Google Scholar
Digital Library
- The Open Group. 2008. Base Specifications Issue 7. IEEE Std 1003.1-2008.
Google Scholar
- Linus Torvalds. 2020. gup: document and work around "COW can break either way" issue. https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Google Scholar
- Linus Torvalds. 2020. mm: do_wp_page() simplification. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=09854ba94c6a
Google Scholar
- Linus Torvalds. 2022. Merge tag ’mm-stable-2022-05-25’ of. git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Google Scholar
- Michael S. Tsirkin. 2006. madvise MADV_DONTFORK/MADV_DOFORK. https://lore.kernel.org/all/[email protected]/
Google Scholar
- Shin‐Yuan ‐Y Tzou and David P. Anderson. 1991. The performance of message‐passing using restricted virtual memory remapping. Software: Practice and Experience, 21, 3 (1991), 251–267. https://doi.org/10.1002/spe.4380210303
Google Scholar
Digital Library
- Fernando Vano-Garcia and Hector Marco-Gisbert. 2020. An Info-Leak Resistant Kernel Randomization for Virtualized Systems. IEEE Access, 8 (2020), 161612–161629. https://doi.org/10.1109/ACCESS.2020.3019774
Google Scholar
Cross Ref
- Fernando Vano-Garcia and Hector Marco-Gisbert. 2020. KASLR-MT: Kernel Address Space Layout Randomization for Multi-Tenant cloud systems. J. Parallel and Distrib. Comput., 137 (2020), 77–90. https://doi.org/10.1016/j.jpdc.2019.11.008
Google Scholar
Digital Library
- T. Veni and S. Mary Saira Bhanu. 2014. MDedup++: Exploiting Temporal and Spatial Page-Sharing Behaviors for Memory Deduplication Enhancement. Comput. J., 59, 3 (2014), 353–370. https://doi.org/10.1093/comjnl/bxu149
Google Scholar
Cross Ref
- Michael Vrable, Justin Ma, Jay Chen, David Moore, Erik Vandekieft, Alex C. Snoeren, Geoffrey M. Voelker, and Stefan Savage. 2005. Scalability, fidelity, and containment in the Potemkin virtual honeyfarm. ACM SIGOPS Operating Systems Review (OSR), 39, 5 (2005), 148–162. https://doi.org/10.1145/1095810.1095825
Google Scholar
Digital Library
- Carl A. Waldspurger. 2002. Memory Resource Management in VMware ESX Server. ACM SIGOPS Operating Systems Review (OSR), 36, Special Issue (2002), 181–194. https://doi.org/10.1145/844128.844146
Google Scholar
Digital Library
- Jidong Xiao, Zhang Xu, Hai Huang, and Haining Wang. 2013. Security implications of memory deduplication in a virtualized environment. In IEEE International Conference on Dependable Systems & Networks (DSN). IEEE, 1–12. https://doi.org/10.1109/DSN.2013.6575349
Google Scholar
Digital Library
- Peter Xu. 2020. mm/gup: Allow real explicit breaking of COW. https://lore.kernel.org/all/[email protected]/
Google Scholar
- Wen Xu, Sanidhya Kashyap, Changwoo Min, and Taesoo Kim. 2017. Designing new operating primitives to improve fuzzing performance. In ACM Conference on Computer and Communications Security (CCS). Association for Computing Machinery, New York, NY, USA. 2313–2328. https://doi.org/10.1145/3133956.3134046
Google Scholar
Digital Library
- Lingjing You, Yongkun Li, Fan Guo, Yinlong Xu, Jinzhong Chen, and Liu Yuan. 2019. Leveraging Array Mapped Tries in KSM for Lightweight Memory Deduplication. In 2019 IEEE International Conference on Networking, Architecture and Storage, NAS 2019 - Proceedings. IEEE, 1–8. https://doi.org/10.1109/NAS.2019.8834730
Google Scholar
Cross Ref
- Kaiyang Zhao, Sishuai Gong, and Pedro Fonseca. 2021. On-demand-fork: A microsecond fork for memory-intensive and latency-sensitive applications. In EuroSys 2021 - Proceedings of the 16th European Conference on Computer Systems. Association for Computing Machinery, New York, NY, USA. 540–555. https://doi.org/10.1145/3447786.3456258
Google Scholar
Digital Library
Index Terms
Copy-on-Pin: The Missing Piece for Correct Copy-on-Write
Recommendations
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshopsVirtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...
Memory buddies: exploiting page sharing for smart colocation in virtualized data centers
VEE '09: Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environmentsMany data center virtualization solutions, such as VMware ESX, employ content-based page sharing to consolidate the resources of multiple servers. Page sharing identifies virtual machine memory pages with identical content and consolidates them into a ...
Memory buddies: exploiting page sharing for smart colocation in virtualized data centers
Many data center virtualization solutions, such as VMware ESX, employ content-based page sharing to consolidate the resources of multiple servers. Page sharing identifies virtual machine memory pages with identical content and consolidates them into a ...






Comments