skip to main content
article

Mining block correlations to improve storage performance

Authors Info & Claims
Published:01 May 2005Publication History
Skip Abstract Section

Abstract

Block correlations are common semantic patterns in storage systems. They can be exploited for improving the effectiveness of storage caching, prefetching, data layout, and disk scheduling. Unfortunately, information about block correlations is unavailable at the storage system level. Previous approaches for discovering file correlations in file systems do not scale well enough for discovering block correlations in storage systems.In this article, we propose two algorithms, C-Miner and C-Miner*, that use a data mining technique called frequent sequence mining to discover block correlations in storage systems. Both algorithms run reasonably fast with feasible space requirement, indicating that they are practical for dynamically inferring correlations in a storage system. C-Miner is a direct application of a frequent-sequence mining algorithm with a few modifications; compared with C-Miner, C-Miner* is redesigned for mining block correlations by making concessions for the specific problem of long sequences in storage system traces. Therefore, C-Miner* can discover 7--109% more correlation rules within 2--15 times shorter time than C-Miner. Moreover, we have also evaluated the benefits of block correlation-directed prefetching and data layout through experiments. Our results using real system workloads show that correlation-directed prefetching and data layout can reduce average I/O response time by 12--30% compared to the base case, and 7--25% compared to the commonly used sequential prefetching scheme for most workloads.

References

  1. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., and Veitch, A. 2002. Hippodrome: Running circles around storage administration. In Proceedings of the First USENIX Conference on File and Storage Technologies.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anthes, G. H. 2002. Storage virtualization: The next step. Computer World, January 28, 2002, p. 43.]]Google ScholarGoogle Scholar
  4. Ari, I., Amer, A., Miller, E., Brandt, S., and Long, D. 2002. Who is more adaptive? ACME: Adaptive caching using multiple experts. In Proceedings of the Workshop on Distributed Data and Structures (WDAS).]]Google ScholarGoogle Scholar
  5. Arpaci-Dusseau, A. C. and Arpaci-Dusseau, R. H. 2001. Information and control in gray-box systems. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ayres, J., Flannick, J., Gehrke, J., and Yiu, T. 2002. Sequential pattern mining using a bitmap representation. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 429--435.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Barroso, L. A., Gharachorloo, K., and Bugnion, E. 1998. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture. IEEE Press, Los Alamitos, CA, 3--14.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brown, A. D., Mowry, T. C., and Krieger, O. 2001. Compiler-based I/O prefetching for out-of-core applications. ACM Trans. Comput. Syst. 19, 2, 111--170.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cao, P., Felten, E., and Li, K. 1994. Application-controlled file caching policies. In Proceedings of the USENIX Summer 1994 Technical Conference. 171--182.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cao, P., Felten, E. W., Karlin, A., and Li, K. 1995. A study of integrated prefetching and caching strategies. In Proceedings of ACM SIGMETRICS.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Carrera, E. V., Pinheiro, E., and Bianchini, R. 2003. Conserving disk energy in network servers. In Proceedings of the 17th International Conference on Supercomputing.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chang, F. W. and Gibson, G. A. 1999. Automatic I/O hint generation through speculative execution. In Proceedings of the Conference on 2003 Operating Systems Design and Implementation. 1--14.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chen, Z., Zhou, Y., and Li, K. 2003. Eviction-based cache placement for storage caches. In Proceedings of the Conference on 2003 USENIX Annual Technical Conference. 269--282.]]Google ScholarGoogle Scholar
  14. Choi, J., Noh, S. H., Min, S. L., and Cho, Y. 2000. Towards application/file-level characterization of block references: A case for fine-grained buffer management. In Proceedings of the 2000 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chou, H. and DeWitt, D. 1993. An evaluation of buffer management strategies for relational database systems. In Proceedings of the 19th International Conference on Very Large Data Bases (Dublin, Ireland). 127--141.]]Google ScholarGoogle Scholar
  16. Clifton, C. and Gengo, G. 2000. Developing custom intrusion detection filters using data mining. In Proceedings of the 2000 Military Communications International Symposium (MILCOM2000, Los Angeles, CA).]]Google ScholarGoogle Scholar
  17. Curewitz, K. M., Krishnan, P., and Vitter, J. S. 1993. Practical prefetching via data compression. In Proceedings of the 1993 ACM-SIGMOD Conference on Management of Data. 257--266.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Eaton, P. R., Geels, D., and Mori, G. 1999. Clump: Improving file system performance through adaptive optimizations. Go to http://www.citeseer.csail.mit.edu/eatox99clump.html.]]Google ScholarGoogle Scholar
  19. EMC Corporation. 1999. Symmetrix 3000 and 5000 Enterprise Storage Systems Product Description Guide. EMC Corporation, Hopkinton, MA. Web site: http://www.emc.com.]]Google ScholarGoogle Scholar
  20. Ganger, G. 1995. Systemoriented evaluation of I/O subsystem performance. Tech. rep. CSE-TR-243-95. University of Michigan, Ann Arbor, MI.]]Google ScholarGoogle Scholar
  21. Gerlhof, C. A. and Kemper, A. 1994a. A multi-threaded architecture for prefetching in object bases. In Advances in Database Technology---EDBT'94. 4th International Conference on Extending Database Technology, Cambridge, United Kingdom, March 28-31, 1994, Proceedings, M. Jarke, J. A. B. Jr., and K. G. Jeffery, Eds. Lecture Notes in Computer Science, vol. 779. Springer, Berlin, Germany, 351--364.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gerlhof, C. A. and Kemper, A. 1994b. Prefetch support relations in object bases. In Persistent Object Systems, Proceedings of the Sixth International Workshop on Persistent Object Systems, Tarascon, Provence, France, 5--9 September 1994, M. P. Atkinson, D. Maier, and V. Benzaken, Eds. Workshops in Computing. Springer, Berlin, Germany, and British Computer Society, Swindon, Wilts., U.K., 115--126.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., Gobioff, H., Hardin, C., Riedel, E., Rochberg, D., and Zelenka, J. 1998. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Griffioen, J. and Appleton, R. 1994. Reducing file system latency using a predictive approach. In Proceedings of the 1994 Summer USENIX Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Griffioen, J. and Appleton, R. 1995. Performance measurements of automatic prefetching. In Proceedings of the International Conference on Parallel and Distributed Computing Systems.]]Google ScholarGoogle Scholar
  26. Han, J. 2002. How can data mining help bio-data analysis? In Proceedings of the 2002 Workshop on Data Mining in Bioinformatics (BIOKDD'02, Edmonton, Canada). 1--4.]]Google ScholarGoogle Scholar
  27. Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hsu, W. W., Smith, A. J., and Young, H. C. 2001. I/O reference behavior of production database workloads and the TPC benchmarks---an analysis at the logical level. ACM Trans. Database Syst. 26, 1, 96--143.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. IBM. 2002. Storage Tank, a distributed storage system. IBM White paper. Web site: http://www.almaden.ibm.com/StorageSystems/file_systems/storage_tank/papers.shtml.]]Google ScholarGoogle Scholar
  30. Keeton, K. and Wilkes, J. 2002. Automating data dependability. In Proceedings of 10th ACM-SIGOPS European Workshop.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kim, J., Choi, J., Kim, J., Noh, S., Min, S., Cho, Y., and Kim, C. 2000. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI, San Diego, CA). 119--134.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kimbrel, T., Tomkins, A., Patterson, R. H., Bershad, B., Cao, P., Felten, E., Gibson, G., Karlin, A. R., and Li, K. 1996. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 19--34.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kroeger, T. M. and Long, D. D. E. 1995. Predicting file-system actions from prior events. In Proceedings of the 1996 USENIX Annual Technical Conference. 319--328.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kuenning, G. 1994. Design of the SEER predictive caching scheme. In Proceedings of the Workshop on Mobile Computing Systems and Applications.]]Google ScholarGoogle Scholar
  35. Kuenning, G. H. and Popek, G. J. 1997. Automated hoarding for mobile computers. In Proceedings of the 15th Symposium on Operating Systems Principles (St. Malo, France). ACM Press, New York, NY, 264--275.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lee, E. K. and Thekkath, C. A. 1996. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 84--92.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Lee, W. and Stolfo, S. 1998. Data mining approaches for intrusion detection. In Proceedings of the 7th USENIX Security Symposium (San Antonio, TX).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lei, H. and Duchamp, D. 1997. An analytical approach to file prefetching. In Proceedings of the 1997 USENIX Annual Technical Conference (Anaheim, CA.)]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Leutenegger, S. T. and Dias, D. 1993. A modeling study of the TPC-C benchmark. SIGMOD Rec. 22, 2 (June), 22--31.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Madhyastha, T. M., Gibson, G. A., and Faloutsos, C. 1999. Informed prefetching of collective input/output requests. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (CDROM). ACM Press, New York, NY, 13.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Madhyastha, T. M. and Reed, D. A. 1997. Input/output access pattern classification using hidden Markov models. In IOPADS '97: Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems. ACM Press, New York, NY, 57--67.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Madhyastha, T. M. and Reed, D. A. 2002. Learning to classify parallel input/output access patterns. IEEE Trans. Parallel Distrib. Syst. 13, 8, 802--813.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Megiddo, N. and Modha, D. S. 2003. Arc: A self-tuning, low overhead replacement cache. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST 03, San Francisco, CA).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mowry, T. C., Demke, A. K., and Krieger, O. 1996. Automatic compiler-inserted I/O prefetching for out-of-core applications. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 3--17.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Palmer, M. and Zdonik, S. B. 1991. Fido: A cache that learns to fetch. In 17th International Conference on Very Large Data Bases, September 3--6, 1991, Barcelona, Catalonia, Spain, Proceedings, G. M. Lohman, A. Sernadas, and R. Camps, Eds. Morgan Kaufmann, San Francisco, CA, 255--264.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Patterson, R. H., Gibson, G. A., Ginting, E., Stodolsky, D., and Zelenka, J. 1995. Informed prefetching and caching. In Proceedings of the 15th SOSP.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. 2001. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 2001 International Conference on Data Engineering (ICDE'01, Heidelberg, Germany). 215--224.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Pitkow, J. E. and Pirolli, P. 1999. Mining longest repeating subsequences to predict World Wide Web surfing. In Proceedings of the USENIX Symposium on Internet Technologies and Systems.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ruemmler, C. and Wilkes, J. 1993a. A trace-driven analysis of disk working set sizes. Tech. rep. HPL-OSR-93-23. Hewlett-Packard Laboratories, Palo Alto, CA.]]Google ScholarGoogle Scholar
  50. Ruemmler, C. and Wilkes, J. 1993b. UNIX disk access patterns. In Proceedings of the Winter 1993 USENIX Conference.]]Google ScholarGoogle Scholar
  51. Salmon, B., Thereska, E., Soules, C. A., and Ganger, G. R. 2003. A two-tiered software architecture for automated tuning of disk layouts. In Proceedings of the First Workshop on Algorithms and Architectures for Self-Managing Systems.]]Google ScholarGoogle Scholar
  52. Schechter, S., Krishnan, M., and Smith, M. D. 1998. Using path profiles to predict http requests. In Proceedings of the Seventh International World Wide Web Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Schindler, J., Griffin, J., Lumb, C., and Ganger, G. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Proceedings of the First USENIX Conference on File and Storage Technologies.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Seifert, A. and Scholl, M. H. 2002. A multi-version cache replacement and prefetching policy for hybrid data delivery environments. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Sivathanu, M., Prabhakaran, V., Popovici, F., Denehy, T. E., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2003. Semantically-smart disk systems. In Proceedings of the Second USENIX Conference on File and Storage Technologies.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Smith, A. J. 1978a. Sequentiality and prefetching in database systems. ACM Trans. Database Syst. 3, 3 (Sept.), 223--247.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Smith, B. J. 1978b. A pipelined, shared resource MIMD computer. In Proceedings of International Conference on Parallel Processing. 6--8.]]Google ScholarGoogle Scholar
  58. Soloviev, V. 1996. Prefetching in segmented disk cache for multi-disk systems. In Proceedings of the Fourth Workshop on I/O in Parallel and Distributed Systems. ACM Press, New York, NY, 69--82.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Storage Performance Council. 2004. SPC I/O traces. Web site: http://www.storageperformance.org/.]]Google ScholarGoogle Scholar
  60. Tait, C. D., Lei, H., Acharya, S., and Chang, H. 1995. Intelligent file hoarding for mobile computers. In Proceedings of the Conference on Mobile Computing and Networking. 119--125.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Tomkins, A., Patterson, R. H., and Gibson, G. 1997. Informed multi-process prefetching and caching. In Proceedings of the 1997 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM Press, New York, NY, 100--114.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Vellanki, V. and Chervenak, A. 1999. A cost-benefit scheme for high performance predictive prefetching. In Proceedings of SC99: High Performance Networking and Computing (Portland, OR). ACM Press, New York, NY, and IEEE Computer Society Press, Los Alamitos, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Vitter, J. S. and Krishnan, P. 1991. Optimal prefetching via data compression. In Proceedings of the 32nd Annual IEEE Symposium on Foundations of Computer Science.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Wang, M., Madhyastha, T., Chan, N. H., Papadimitriou, S., and Faloutsos, C. 2002. Data mining meets performance evaluation: Fast algorithms for modeling bursty traffic. In Proceedings of the 18th International Conference on Data Engineering.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Wedekind, H. and Zoerntlein, G. 1986. Prefetching in realtime database applications. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 215--226.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. 1995. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th Symposium on Operating Systems Principles.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Wong, T. and Wilkes, J. 2002. My cache or yours? Making storage more exclusive. In Proceedings of USENIX.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Yan, X., Han, J., and Afshar, R. 2003. CloSpan: Mining closed sequential patterns in large datasets. In Proceedings of the 2003 SIAM International Conference Data Mining (SDM'03, San Francisco, CA).]]Google ScholarGoogle Scholar
  69. Zaki, M. 2001. SPADE: An efficient algorithm for mining frequent sequences. Machine Learn. 40, 31--60.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Zhang, Y., Zhang, J., Sivasubramaniam, A., Liu, C., and Franke, H. 2003. Decision-support workload characteristics on clustered database server from the OS perspective. In Proceedings of the International Conference on Distributed Conputing Systems (ICDCS).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Zhou, Y., Philbin, J. F., and Li, K. 2001. The multi-queue replacement algorithm for second level buffer caches. In Proceedings of the Usenix Technical Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining block correlations to improve storage performance

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader
                  About Cookies On This Site

                  We use cookies to ensure that we give you the best experience on our website.

                  Learn more

                  Got it!