Abstract
New memory technologies are blurring the previously distinctive performance characteristics of adjacent layers in the memory hierarchy. No longer are such layers orders of magnitude different in request latency or capacity. Beyond the traditional single-layer view of caching, we now must re-cast the problem as a data placement challenge: which data should be cached in faster memory if it could instead be served directly from slower memory? We present CHOPT, an offline algorithm for data placement across multiple tiers of memory with asymmetric read and write costs. We show that CHOPT is optimal and can therefore serve as the upper bound of performance gain for any data placement algorithm. We also demonstrate an approximation of CHOPT which makes its execution time for long traces practical using spatial sampling of requests incurring a small 0.2% average error on representative workloads at a sampling ratio of 1%. Our evaluation of CHOPT on more than 30 production traces and benchmarks shows that optimal data placement decisions could improve average request latency by 8.2%-44.8% when compared with the long-established gold standard: Belady and Mattson's offline, evict-farthest-in-the-future optimal algorithms. Our results identify substantial improvement opportunities for future online memory management research.
- Neha Agarwal and Thomas F Wenisch. 2017. Thermostat: Application-transparent page management for two-tiered main memory. In ACM SIGARCH Computer Architecture News, Vol. 45. ACM, 631--644.Google Scholar
Digital Library
- Susanne Albers, Sanjeev Arora, and Sanjeev Khanna. 1999. Page replacement for general caching problems. In SODA, Vol. 99. Citeseer, 31--40.Google Scholar
- Qasim Ali and Praveen Yedlapalli. 2019. Persistent Memory Performance in vSphere 6.7. (2019).Google Scholar
- Mohamed Arafa, Bahaa Fahim, Sailesh Kottapalli, Akhilesh Kumar, Lily P Looi, Sreenivas Mandava, Andy Rudoff, Ian M Steiner, Bob Valentine, Geetha Vedaraman, et almbox. 2019. Cascade Lake: Next generation Intel Xeon scalable processor. IEEE Micro , Vol. 39, 2 (2019), 29--36.Google Scholar
- Amotz Bar-Noy, Reuven Bar-Yehuda, Ari Freund, Joseph Naor, and Baruch Schieber. 2001. A unified approach to approximating resource allocation and scheduling. Journal of the ACM (JACM) , Vol. 48, 5 (2001), 1069--1090.Google Scholar
Digital Library
- Nathan Beckmann, Haoxian Chen, and Asaf Cidon. 2018. LHD: Improving Cache Hit Rate by Maximizing Hit Density. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 389--403.Google Scholar
Digital Library
- Nathan Beckmann and Daniel Sanchez. 2015. Talus: A simple way to remove cliffs in cache performance. In 21st IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 64--75.Google Scholar
Cross Ref
- Nathan Beckmann and Daniel Sanchez. 2017. Maximizing cache performance under uncertainty. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 109--120.Google Scholar
Cross Ref
- Laszlo A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal , Vol. 5, 2 (1966), 78--101.Google Scholar
Digital Library
- Daniel S Berger. 2018a. Design and Analysis of Adaptive Caching Techniques for Internet Content Delivery. (2018).Google Scholar
- Daniel S Berger. 2018b. Towards Lightweight and Robust Machine Learning for CDN Caching.. In HotNets . 134--140.Google Scholar
- Daniel S Berger, Nathan Beckmann, and Mor Harchol-Balter. 2018. Practical bounds on optimal caching with variable object sizes. Proceedings of the ACM Measurement and Analysis of Computing Systems , Vol. 2, 2 (2018), 32.Google Scholar
Digital Library
- Daniel S Berger, Ramesh K Sitaraman, and Mor Harchol-Balter. 2017. AdaptSize: Orchestrating the hot object memory cache in a content delivery network. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . 483--498.Google Scholar
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In 17th International Conference on Parallel Architectures and Compilation Techniques (PACT 08) . ACM, 72--81.Google Scholar
Digital Library
- Daniel Byrne, Nilufer Onder, and Zhenlin Wang. 2018. mPart: miss-ratio curve guided partitioning in key-value stores. In ACM SIGPLAN Notices , Vol. 53. ACM, 84--95.Google Scholar
Digital Library
- Martin C Carlisle and Errol L Lloyd. 1991. On the k-coloring of intervals. In International Conference on Computing and Information. Springer, 90--101.Google Scholar
Cross Ref
- Yue Cheng, Fred Douglis, Philip Shilane, Grant Wallace, Peter Desnoyers, and Kai Li. 2016. Erasing Belady's limitations: In search of flash cache offline optimality. In USENIX Annual Technical Conference (ATC 16). 379--392.Google Scholar
- Sangyeun Cho and Hyunjin Lee. 2009. Flip-N-Write: A Simple Deterministic Technique to Improve PRAM Write Performance, Energy and Endurance. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). ACM, New York, NY, USA, 347--357. https://doi.org/10.1145/1669112.1669157Google Scholar
Digital Library
- Gil Einziger, Ohad Eytan, Roy Friedman, and Ben Manes. 2018. Adaptive software cache management. In 19th International Middleware Conference (MIDDLEWARE 18). ACM, 94--106.Google Scholar
Digital Library
- Gil Einziger, Roy Friedman, and Ben Manes. 2017. TinyLFU: A highly efficient cache admission policy. ACM Transactions on Storage (ToS) , Vol. 13, 4 (2017), 35.Google Scholar
- Assaf Eisenman, Asaf Cidon, Evgenya Pergament, Or Haimovich, Ryan Stutsman, Mohammad Alizadeh, and Sachin Katti. 2019. Flashield: a Hybrid Key-value Cache that Controls Flash Write Amplification.. In NSDI . 65--78.Google Scholar
- Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. 2018a. Reducing DRAM footprint with NVM in Facebook. In 13th EuroSys Conference. ACM, 42.Google Scholar
Digital Library
- Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018b. Bandana: Using non-volatile memory for storing deep learning models. arXiv preprint arXiv:1811.05922 (2018).Google Scholar
- Martin Farach-Colton and Vincenzo Liberatore. 2000. On local register allocation. Journal of Algorithms , Vol. 37, 1 (2000), 37--65.Google Scholar
Digital Library
- Brad Fitzpatrick. 2009. Memcached . http://memcached.org Retrieved Aug 7 2019 fromGoogle Scholar
- Jayesh Gaur, Mainak Chaudhuri, and Sreenivas Subramoney. 2011. Bypass and insertion algorithms for exclusive last-level caches. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 81--92.Google Scholar
Digital Library
- Binny S Gill. 2008. On multi-level exclusive caching: offline optimality and why promotions are better than demotions. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. USENIX Association, 4.Google Scholar
Digital Library
- Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert NM Watson, and Steven Hand. 2016. Firmament: fast, centralized cluster scheduling at scale. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 99--115.Google Scholar
Digital Library
- Xiameng Hu, Xiaolin Wang, Lan Zhou, Yingwei Luo, Chen Ding, and Zhenlin Wang. 2016. Kinetic modeling of data eviction in cache. In 2016 USENIX Annual Technical Conference (ATC 16) . 351--364.Google Scholar
- Qi Huang, Ken Birman, Robbert Van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C Li. 2013. An analysis of Facebook photo caching. In 24th ACM Symposium on Operating Systems Principles (SOSP 13). ACM, 167--181.Google Scholar
Digital Library
- Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan Wang, Yi Xu, Subramanya R Dulloor, et almbox. 2019. Basic performance measurements of the Intel Optane DC persistent memory module. arXiv preprint arXiv:1903.05714 (2019).Google Scholar
- Akanksha Jain and Calvin Lin. 2016. Back to the future: leveraging Belady's algorithm for improved cache replacement. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 78--89.Google Scholar
Digital Library
- Akanksha Jain and Calvin Lin. 2018. Rethinking belady's algorithm to accommodate prefetching. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 110--123.Google Scholar
Digital Library
- Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. HeteroOS -- OS design for heterogeneous memory management in datacenter. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 521--534.Google Scholar
- Richard E. Kessler, Mark D Hill, and David A Wood. 1994. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Trans. Comput. , Vol. 43, 6 (1994), 664--675.Google Scholar
Digital Library
- Kunal Korgaonkar, Ishwar Bhati, Huichu Liu, Jayesh Gaur, Sasikanth Manipatruni, Sreenivas Subramoney, Tanay Karnik, Steven Swanson, Ian Young, and Hong Wang. 2018. Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In 45th Annual International Symposium on Computer Architecture (ISCA 18). IEEE Press, 315--327.Google Scholar
Digital Library
- Kornilios Kourtis, Nikolas Ioannou, and Ioannis Koltsidas. 2019. Reaping the performance of fast NVM storage with uDepot. In 17th USENIX Conference on File and Storage Technologies (FAST 19). 1--15.Google Scholar
Digital Library
- Pengcheng Li, Colin Pronovost, William Wilson, Benjamin Tait, Jie Zhou, Chen Ding, and John Criswell. 2019. Beating OPT with Statistical Clairvoyance and Variable Size Caching. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 19). ACM, 243--256.Google Scholar
- Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, and Onur Mutlu. 2017. Utility-based hybrid memory management. In 2017 IEEE International Conference on Cluster Computing (CLUSTER 17). IEEE, 152--165.Google Scholar
Cross Ref
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Notices, Vol. 40. ACM, 190--200.Google Scholar
- Jasmina Malicevic, Subramanya Dulloor, Narayanan Sundaram, Nadathur Satish, Jeff Jackson, and Willy Zwaenepoel. 2015. Exploiting NVM in large-scale graph analytics. In Proceedings of the 3rd Workshop on Interactions of NVM/FLASH with Operating Systems and Workloads. ACM, 2.Google Scholar
Digital Library
- Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation techniques for storage hierarchies. IBM Systems Journal , Vol. 9, 2 (1970), 78--117.Google Scholar
Digital Library
- Nimrod Megiddo and Dharmendra S Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache.. In FAST, Vol. 3. 115--130.Google Scholar
Digital Library
- Pierre Michaud. 2016. Some mathematical facts about optimal cache replacement. ACM Transactions on Architecture and Code Optimization , Vol. 13, 4 (2016).Google Scholar
Digital Library
- Sparsh Mittal. 2016. A survey of cache bypassing techniques. Journal of Low Power Electronics and Applications , Vol. 6, 2 (2016), 5.Google Scholar
Cross Ref
- Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. 2010. Introducing the Graph 500 . Cray Users Group (CUG) , Vol. 19 (2010), 45--74.Google Scholar
- Gotze Philipp, Baumann Stephan, and Sattler Kai-Uwe. 2018. An NVM-aware storage layout for analytical workloads. In 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW). IEEE, 110--115.Google Scholar
Cross Ref
- Hanfeng Qin and Hai Jin. 2017. Warstack: Improving LLC Replacement for NVM with a Writeback-Aware Reuse Stack. In 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). IEEE, 233--236.Google Scholar
Cross Ref
- Moinuddin K Qureshi, Michele M Franceschini, and Luis A Lastras-Montano. 2010. Improving read performance of phase change memories via write cancellation and write pausing. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture. IEEE, 1--11.Google Scholar
Cross Ref
- Moinuddin K Qureshi, Aamer Jaleel, Yale N Patt, Simon C Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. ACM SIGARCH Computer Architecture News , Vol. 35, 2 (2007), 381--391.Google Scholar
Digital Library
- Moinuddin K Qureshi and Yale N Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE, 423--432.Google Scholar
Digital Library
- Frederic Sala, Ryan Gabrys, and Lara Dolecek. 2013. Dynamic threshold schemes for multi-level non-volatile memories. IEEE Transactions on Communications , Vol. 61, 7 (2013), 2624--2634.Google Scholar
Cross Ref
- Stefan Saroiu, Krishna P Gummadi, Richard J Dunn, Steven D Gribble, and Henry M Levy. 2002. An analysis of internet content delivery systems. ACM SIGOPS Operating Systems Review , Vol. 36, SI (2002), 315--327.Google Scholar
Digital Library
- Zhan Shi, Xiangru Huang, Akanksha Jain, and Calvin Lin. 2019. Applying Deep Learning to the Cache Replacement Problem. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 413--425.Google Scholar
Digital Library
- Steven Swanson. 2019. Redesigning File Systems for Nonvolatile Main Memory. IEEE Micro , Vol. 39, 1 (2019), 62--64.Google Scholar
Cross Ref
- Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park. 2017. Cache modeling and optimization using miniature simulations. In USENIX Annual Technical Conference (ATC 17). 487--498.Google Scholar
- Carl A Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad. 2015. Efficient MRC Construction with SHARDS. In 13th USENIX Conference on File and Storage Technologies (FAST 15). 95--110.Google Scholar
- Zhe Wang, Shuchang Shan, Ting Cao, Junli Gu, Yi Xu, Shuai Mu, Yuan Xie, and Daniel A Jiménez. 2013. WADE: Writeback-aware dynamic cache management for NVM-based main memory system. ACM Transactions on Architecture and Code Optimization (TACO) , Vol. 10, 4 (2013), 51.Google Scholar
- Kevin D Wayne. 2002. A polynomial combinatorial algorithm for generalized minimum cost flow. Mathematics of Operations Research , Vol. 27, 3 (2002), 445--459.Google Scholar
Digital Library
- Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas JA Harvey, and Andrew Warfield. 2014. Characterizing storage workloads with counter stacks. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 335--349.Google Scholar
Digital Library
- Theodore M Wong and John Wilkes. 2002. My Cache Or Yours?: Making Storage More Exclusive. In USENIX Annual Technical Conference, General Track. 161--175.Google Scholar
Digital Library
- Fengguang Wu. 2018. PMEM NUMA node and hotness accounting/migration . In Linux Kernel Mailing List Archive. https://lkml.org/lkml/2018/12/26/138, Last accessed on 08-08--2019.Google Scholar
- Jianhui Yue and Yifeng Zhu. 2013. Accelerating Write by Exploiting PCM Asymmetries. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) (HPCA '13). IEEE Computer Society, Washington, DC, USA, 282--293. https://doi.org/10.1109/HPCA.2013.6522326Google Scholar
Digital Library
- Yingjie Zhao, Nong Xiao, and Fang Liu. 2010. Red: An efficient replacement algorithm based on REsident distance for exclusive storage caches. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1--6.Google Scholar
Digital Library
- Pengfei Zuo, Yu Hua, Ming Zhao, Wen Zhou, and Yuncheng Guo. 2018. Improving the performance and endurance of encrypted non-volatile main memory through deduplicating writes. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 442--454.Google Scholar
Digital Library
Index Terms
Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems
Recommendations
Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems
New memory technologies are blurring the previously distinctive performance characteristics of adjacent layers in the memory hierarchy. No longer are such layers orders of magnitude different in request latency or capacity. Beyond the traditional single-...
Optimal Data Placement for Heterogeneous Cache, Memory, and Storage Systems
SIGMETRICS '20: Abstracts of the 2020 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsNew memory technologies are blurring the previously distinctive performance characteristics of adjacent layers in the memory hierarchy. No longer are such layers orders of magnitude different in request latency or capacity. Beyond the traditional single-...
Application-adaptive intelligent cache memory system
This article presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching ...






Comments