skip to main content
tutorial

Catalyst: GPU-assisted rapid memory deduplication in virtualization environments

Authors Info & Claims
Published:08 April 2017Publication History
Skip Abstract Section

Abstract

Content based page sharing techniques improve memory efficiency in virtualized systems by identifying and merging identical pages. Kernel Same-page Merging (KSM), a Linux kernel utility for page sharing, sequentially scans memory pages of virtual machines to deduplicate pages. Sequential scanning of pages has several undesirable side effects---wasted CPU cycles when no sharing opportunities exist, and rate of discovery of sharing being dependent on the scanning rate and corresponding CPU availability. In this work, we exploit presence of GPUs on modern systems to enable rapid memory sharing through targeted scanning of pages. Our solution, Catalyst, works in two phases, the first where pages of virtual machines are processed by the GPU to identify likely pages for sharing and a second phase that performs page-level similarity checks on a targeted set of shareable pages. Opportunistic usage of the GPU to produce sharing hints enables rapid and low-overhead duplicate detection, and sharing of memory pages in virtualization environments. We evaluate Catalyst against various benchmarks and workloads to demonstrate that Catalyst can achieve higher memory sharing in lesser time compared to different scan rate configurations of KSM, at lower or comparable compute costs.

References

  1. Heterogeneous system architecture (hsa) foundation. URL http://www.hsafoundation.com/.Google ScholarGoogle Scholar
  2. Cuda toolkit documentation. URL http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#zero-copy.Google ScholarGoogle Scholar
  3. A. Arcangeli, I. Eidus, and C. Wright. Increasing memory density by using ksm. In Proceedings of the 11th Ottawa Linux Symposium (OLS), 2009.Google ScholarGoogle Scholar
  4. E. Bugnion, S. Devine, K. Govil, and M. Rosenblum. Disco: Running commodity operating systems on scalable multiprocessors. ACM Transactions on Computer Systems (TOCS), 15 (4):412--447, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. VLDB Endowment, 7(4):277--288, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Duato, A. J. Pena, F. Silla, J. C. Fernandez, R. Mayo, and E. S. Quintana-Orti. Enabling cuda acceleration within virtual machines using rcuda. In Proceedings of the 18th Annual International Conference on High Performance Computing (HiPC), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Giunta, R. Montella, G. Agrillo, and G. Coviello. A gpgpu transparent virtualization component for high performance computing clouds. In Proceedings of the 16th International European Conference on Parallel Processing (EuroPar). 2010. Google ScholarGoogle ScholarCross RefCross Ref
  8. F. Guo, S. Kim, Y. Baskakov, and I. Banerjee. Proactively breaking large pages to improve memory overcommitment performance in vmware esxi. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren, G. Varghese, G. M. Voelker, and A. Vahdat. Difference engine: Harnessing memory redundancy in virtual machines. Communications of the ACM, 53(10):85--93, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia, V. Talwar, and P. Ranganathan. Gvim: Gpu-accelerated virtual machines. In Proceedings of the 3rd Workshop on System-level Virtualization for High Performance Computing (HPCVirt), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Herrera. Nvidia grid: Graphics accelerated vdi with the visual performance of a workstation, 2014. URL http://www.nvidia.com/content/grid/vdi-whitepaper.pdf.Google ScholarGoogle Scholar
  12. K. Jang, S. Han, S. Han, S. B. Moon, and K. Park. Sslshader: Cheap ssl acceleration with commodity processors. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Geiger: Monitoring the buffer cache in a virtual machine environment. SIGARCH Computer Architecture News, 34(5): 14--24, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Khronos. The open standard for parallel programming of heterogeneous systems, 2015. URL https://www.khronos.org/opencl/.Google ScholarGoogle Scholar
  15. D. Magenheimer, C. Mason, D. McCracken, and K. Hackel. Transcendent memory and linux. In Proceedings of the 11th Ottawa Linux Symposium (OLS), 2009.Google ScholarGoogle Scholar
  16. K. Miller, F. Franz, M. Rittinghaus, M. Hillenbrand, and F. Bellosa. Xlh: More effective memory deduplication scanners through cross-layer hints. In Proceedings of the 24th USENIX Annual Technical Conference (ATC), 2013.Google ScholarGoogle Scholar
  17. G. Miłós, D. G. Murray, S. Hand, and M. A. Fetterman. Satori: Enlightened page sharing. In Proceedings of the 20th USENIX Annual Technical Conference (ATC), 2009.Google ScholarGoogle Scholar
  18. D. Mishra and P. Kulkarni. Comparative analysis of page cache provisioning in virtualized environments. In Proceedings of the 22nd International Symposium on Modelling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Montella, G. Coviello, G. Giunta, G. Laccetti, F. Isaila, and J. G. Blas. A general-purpose virtualization service for hpc on cloud computing: an application to gpus. In Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics (PPAM), 2011.Google ScholarGoogle Scholar
  20. Y. Naoi and H. Yamada. A gpu-accelerated vm live migration for big memory workloads. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC), 2014.Google ScholarGoogle Scholar
  21. NVIDIA. Nvidia grid k1 and k2 graphics-accelerated virtual desktops and applications, June 2013. URL http://www.nvidia.in/content/cloud-computing/pdf/nvidia-grid-datasheet-k1-k2.pdf.Google ScholarGoogle Scholar
  22. NVIDIA. Cuda parallel computing platform, 2015. URL http://www.nvidia.com/object/cuda_home_new.html.Google ScholarGoogle Scholar
  23. NVIDIA. Nvidia nvlink high-speed interconnect, 2017. URL http://www.nvidia.com/object/nvlink.html.Google ScholarGoogle Scholar
  24. NVIDIA. Unified memory in cuda 6, 2017. URL https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/.Google ScholarGoogle Scholar
  25. S. Rachamalla, D. Mishra, and P. Kulkarni. Share-o-meter: An empirical analysis of ksm based memory sharing in virtualized systems. In Proceeding of the 20th Annual IEEE International Conference on High Performance Computing (HiPC), 2013. Google ScholarGoogle ScholarCross RefCross Ref
  26. C. Reano, A. Pea, F. Silla, J. Duato, R. Mayo, and E. Quintana-Orti. Cu2rcu: Towards the complete rcuda remote gpu virtualization and sharing solution. In Proceedings of the 19th Annual International Conference on High Performance Computing (HiPC), 2012. Google ScholarGoogle ScholarCross RefCross Ref
  27. redislabs. redis. URL https://redis.io/.Google ScholarGoogle Scholar
  28. P. Sharma and P. Kulkarni. Singleton: system-wide page deduplication in virtual environments. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. Shi, H. Chen, and J. Sun. vcuda: Gpu accelerated high performance computing in virtual machines. In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2009.Google ScholarGoogle Scholar
  30. L. Shi, H. Chen, J. Sun, and K. Li. vcuda: Gpu-accelerated high-performance computing in virtual machines. IEEE Transactions on Computers, 61(6):804--816, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Sun and R. Ricci. Fast and flexible: Parallel packet processing with gpus and click. In Proceedings of the 9th ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), 2013. Google ScholarGoogle ScholarCross RefCross Ref
  32. W. Sun, R. Ricci, and M. L. Curry. Gpustore: harnessing gpu computing for storage systems in the os kernel. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Tölke and M. Krafczyk. Teraflop computing on a desktop pc with gpus for 3d cfd. International Journal of Computational Fluid Dynamics, 22(7):443--456, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Vasiliadis, S. Antonatos, M. Polychronakis, E. P. Markatos, and S. Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In Proceedings of the 11th International Symposium on Recent Advances in Intrusion Detection (RAID), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Z. Vasily Tarasov and S. Shepler. Filebench: A flexible framework for file system benchmarking.; login:THE USENIX MAGAZINE, 41(1):6--12, 2016.Google ScholarGoogle Scholar
  36. F. Vazquez, E. Garzon, J. Martinez, and J. Fernandez. The sparse matrix vector product on gpus. In Proceedings of the 9th International Conference on Computational and Mathematical Methods in Science and Engineering (CMMSE), 2009.Google ScholarGoogle Scholar
  37. C. A. Waldspurger. Memory resource management in vmware esx server. ACM SIGOPS Operating Systems Review, 36(SI): 181--194, 2002.Google ScholarGoogle Scholar
  38. T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Sandpiper: Black-box and gray-box resource management for virtual machines. Computer Networks, 53(17):2923--2938, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Z. Yang, Y. Zhu, and Y. Pu. Parallel image processing based on cuda. In Proceedings of the 2nd International Conference on Computer Science and Software Engineering (CSSE), 2009.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 52, Issue 7
    VEE '17
    July 2017
    256 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3140607
    Issue’s Table of Contents
    • cover image ACM Conferences
      VEE '17: Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments
      April 2017
      261 pages
      ISBN:9781450349482
      DOI:10.1145/3050748

    Copyright © 2017 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 8 April 2017

    Check for updates

    Qualifiers

    • tutorial
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!