Abstract
Block correlations are common semantic patterns in storage systems. They can be exploited for improving the effectiveness of storage caching, prefetching, data layout, and disk scheduling. Unfortunately, information about block correlations is unavailable at the storage system level. Previous approaches for discovering file correlations in file systems do not scale well enough for discovering block correlations in storage systems.In this article, we propose two algorithms, C-Miner and C-Miner*, that use a data mining technique called frequent sequence mining to discover block correlations in storage systems. Both algorithms run reasonably fast with feasible space requirement, indicating that they are practical for dynamically inferring correlations in a storage system. C-Miner is a direct application of a frequent-sequence mining algorithm with a few modifications; compared with C-Miner, C-Miner* is redesigned for mining block correlations by making concessions for the specific problem of long sequences in storage system traces. Therefore, C-Miner* can discover 7--109% more correlation rules within 2--15 times shorter time than C-Miner. Moreover, we have also evaluated the benefits of block correlation-directed prefetching and data layout through experiments. Our results using real system workloads show that correlation-directed prefetching and data layout can reduce average I/O response time by 12--30% compared to the base case, and 7--25% compared to the commonly used sequential prefetching scheme for most workloads.
- Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering.]] Google Scholar
Digital Library
- Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., and Veitch, A. 2002. Hippodrome: Running circles around storage administration. In Proceedings of the First USENIX Conference on File and Storage Technologies.]] Google Scholar
Digital Library
- Anthes, G. H. 2002. Storage virtualization: The next step. Computer World, January 28, 2002, p. 43.]]Google Scholar
- Ari, I., Amer, A., Miller, E., Brandt, S., and Long, D. 2002. Who is more adaptive? ACME: Adaptive caching using multiple experts. In Proceedings of the Workshop on Distributed Data and Structures (WDAS).]]Google Scholar
- Arpaci-Dusseau, A. C. and Arpaci-Dusseau, R. H. 2001. Information and control in gray-box systems. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles.]] Google Scholar
Digital Library
- Ayres, J., Flannick, J., Gehrke, J., and Yiu, T. 2002. Sequential pattern mining using a bitmap representation. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, NY, 429--435.]] Google Scholar
Digital Library
- Barroso, L. A., Gharachorloo, K., and Bugnion, E. 1998. Memory system characterization of commercial workloads. In Proceedings of the 25th Annual International Symposium on Computer Architecture. IEEE Press, Los Alamitos, CA, 3--14.]] Google Scholar
Digital Library
- Brown, A. D., Mowry, T. C., and Krieger, O. 2001. Compiler-based I/O prefetching for out-of-core applications. ACM Trans. Comput. Syst. 19, 2, 111--170.]] Google Scholar
Digital Library
- Cao, P., Felten, E., and Li, K. 1994. Application-controlled file caching policies. In Proceedings of the USENIX Summer 1994 Technical Conference. 171--182.]] Google Scholar
Digital Library
- Cao, P., Felten, E. W., Karlin, A., and Li, K. 1995. A study of integrated prefetching and caching strategies. In Proceedings of ACM SIGMETRICS.]] Google Scholar
Digital Library
- Carrera, E. V., Pinheiro, E., and Bianchini, R. 2003. Conserving disk energy in network servers. In Proceedings of the 17th International Conference on Supercomputing.]] Google Scholar
Digital Library
- Chang, F. W. and Gibson, G. A. 1999. Automatic I/O hint generation through speculative execution. In Proceedings of the Conference on 2003 Operating Systems Design and Implementation. 1--14.]] Google Scholar
Digital Library
- Chen, Z., Zhou, Y., and Li, K. 2003. Eviction-based cache placement for storage caches. In Proceedings of the Conference on 2003 USENIX Annual Technical Conference. 269--282.]]Google Scholar
- Choi, J., Noh, S. H., Min, S. L., and Cho, Y. 2000. Towards application/file-level characterization of block references: A case for fine-grained buffer management. In Proceedings of the 2000 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems.]] Google Scholar
Digital Library
- Chou, H. and DeWitt, D. 1993. An evaluation of buffer management strategies for relational database systems. In Proceedings of the 19th International Conference on Very Large Data Bases (Dublin, Ireland). 127--141.]]Google Scholar
- Clifton, C. and Gengo, G. 2000. Developing custom intrusion detection filters using data mining. In Proceedings of the 2000 Military Communications International Symposium (MILCOM2000, Los Angeles, CA).]]Google Scholar
- Curewitz, K. M., Krishnan, P., and Vitter, J. S. 1993. Practical prefetching via data compression. In Proceedings of the 1993 ACM-SIGMOD Conference on Management of Data. 257--266.]] Google Scholar
Digital Library
- Eaton, P. R., Geels, D., and Mori, G. 1999. Clump: Improving file system performance through adaptive optimizations. Go to http://www.citeseer.csail.mit.edu/eatox99clump.html.]]Google Scholar
- EMC Corporation. 1999. Symmetrix 3000 and 5000 Enterprise Storage Systems Product Description Guide. EMC Corporation, Hopkinton, MA. Web site: http://www.emc.com.]]Google Scholar
- Ganger, G. 1995. Systemoriented evaluation of I/O subsystem performance. Tech. rep. CSE-TR-243-95. University of Michigan, Ann Arbor, MI.]]Google Scholar
- Gerlhof, C. A. and Kemper, A. 1994a. A multi-threaded architecture for prefetching in object bases. In Advances in Database Technology---EDBT'94. 4th International Conference on Extending Database Technology, Cambridge, United Kingdom, March 28-31, 1994, Proceedings, M. Jarke, J. A. B. Jr., and K. G. Jeffery, Eds. Lecture Notes in Computer Science, vol. 779. Springer, Berlin, Germany, 351--364.]] Google Scholar
Digital Library
- Gerlhof, C. A. and Kemper, A. 1994b. Prefetch support relations in object bases. In Persistent Object Systems, Proceedings of the Sixth International Workshop on Persistent Object Systems, Tarascon, Provence, France, 5--9 September 1994, M. P. Atkinson, D. Maier, and V. Benzaken, Eds. Workshops in Computing. Springer, Berlin, Germany, and British Computer Society, Swindon, Wilts., U.K., 115--126.]] Google Scholar
Digital Library
- Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., Gobioff, H., Hardin, C., Riedel, E., Rochberg, D., and Zelenka, J. 1998. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).]] Google Scholar
Digital Library
- Griffioen, J. and Appleton, R. 1994. Reducing file system latency using a predictive approach. In Proceedings of the 1994 Summer USENIX Conference.]] Google Scholar
Digital Library
- Griffioen, J. and Appleton, R. 1995. Performance measurements of automatic prefetching. In Proceedings of the International Conference on Parallel and Distributed Computing Systems.]]Google Scholar
- Han, J. 2002. How can data mining help bio-data analysis? In Proceedings of the 2002 Workshop on Data Mining in Bioinformatics (BIOKDD'02, Edmonton, Canada). 1--4.]]Google Scholar
- Han, J. and Kamber, M. 2001. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, CA.]] Google Scholar
Digital Library
- Hsu, W. W., Smith, A. J., and Young, H. C. 2001. I/O reference behavior of production database workloads and the TPC benchmarks---an analysis at the logical level. ACM Trans. Database Syst. 26, 1, 96--143.]] Google Scholar
Digital Library
- IBM. 2002. Storage Tank, a distributed storage system. IBM White paper. Web site: http://www.almaden.ibm.com/StorageSystems/file_systems/storage_tank/papers.shtml.]]Google Scholar
- Keeton, K. and Wilkes, J. 2002. Automating data dependability. In Proceedings of 10th ACM-SIGOPS European Workshop.]] Google Scholar
Digital Library
- Kim, J., Choi, J., Kim, J., Noh, S., Min, S., Cho, Y., and Kim, C. 2000. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI, San Diego, CA). 119--134.]] Google Scholar
Digital Library
- Kimbrel, T., Tomkins, A., Patterson, R. H., Bershad, B., Cao, P., Felten, E., Gibson, G., Karlin, A. R., and Li, K. 1996. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 19--34.]] Google Scholar
Digital Library
- Kroeger, T. M. and Long, D. D. E. 1995. Predicting file-system actions from prior events. In Proceedings of the 1996 USENIX Annual Technical Conference. 319--328.]] Google Scholar
Digital Library
- Kuenning, G. 1994. Design of the SEER predictive caching scheme. In Proceedings of the Workshop on Mobile Computing Systems and Applications.]]Google Scholar
- Kuenning, G. H. and Popek, G. J. 1997. Automated hoarding for mobile computers. In Proceedings of the 15th Symposium on Operating Systems Principles (St. Malo, France). ACM Press, New York, NY, 264--275.]] Google Scholar
Digital Library
- Lee, E. K. and Thekkath, C. A. 1996. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 84--92.]] Google Scholar
Digital Library
- Lee, W. and Stolfo, S. 1998. Data mining approaches for intrusion detection. In Proceedings of the 7th USENIX Security Symposium (San Antonio, TX).]] Google Scholar
Digital Library
- Lei, H. and Duchamp, D. 1997. An analytical approach to file prefetching. In Proceedings of the 1997 USENIX Annual Technical Conference (Anaheim, CA.)]] Google Scholar
Digital Library
- Leutenegger, S. T. and Dias, D. 1993. A modeling study of the TPC-C benchmark. SIGMOD Rec. 22, 2 (June), 22--31.]] Google Scholar
Digital Library
- Madhyastha, T. M., Gibson, G. A., and Faloutsos, C. 1999. Informed prefetching of collective input/output requests. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (CDROM). ACM Press, New York, NY, 13.]] Google Scholar
Digital Library
- Madhyastha, T. M. and Reed, D. A. 1997. Input/output access pattern classification using hidden Markov models. In IOPADS '97: Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems. ACM Press, New York, NY, 57--67.]] Google Scholar
Digital Library
- Madhyastha, T. M. and Reed, D. A. 2002. Learning to classify parallel input/output access patterns. IEEE Trans. Parallel Distrib. Syst. 13, 8, 802--813.]] Google Scholar
Digital Library
- Megiddo, N. and Modha, D. S. 2003. Arc: A self-tuning, low overhead replacement cache. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST 03, San Francisco, CA).]] Google Scholar
Digital Library
- Mowry, T. C., Demke, A. K., and Krieger, O. 1996. Automatic compiler-inserted I/O prefetching for out-of-core applications. In Proceedings of the 1996 Symposium on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA, 3--17.]] Google Scholar
Digital Library
- Palmer, M. and Zdonik, S. B. 1991. Fido: A cache that learns to fetch. In 17th International Conference on Very Large Data Bases, September 3--6, 1991, Barcelona, Catalonia, Spain, Proceedings, G. M. Lohman, A. Sernadas, and R. Camps, Eds. Morgan Kaufmann, San Francisco, CA, 255--264.]] Google Scholar
Digital Library
- Patterson, R. H., Gibson, G. A., Ginting, E., Stodolsky, D., and Zelenka, J. 1995. Informed prefetching and caching. In Proceedings of the 15th SOSP.]] Google Scholar
Digital Library
- Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. 2001. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 2001 International Conference on Data Engineering (ICDE'01, Heidelberg, Germany). 215--224.]] Google Scholar
Digital Library
- Pitkow, J. E. and Pirolli, P. 1999. Mining longest repeating subsequences to predict World Wide Web surfing. In Proceedings of the USENIX Symposium on Internet Technologies and Systems.]] Google Scholar
Digital Library
- Ruemmler, C. and Wilkes, J. 1993a. A trace-driven analysis of disk working set sizes. Tech. rep. HPL-OSR-93-23. Hewlett-Packard Laboratories, Palo Alto, CA.]]Google Scholar
- Ruemmler, C. and Wilkes, J. 1993b. UNIX disk access patterns. In Proceedings of the Winter 1993 USENIX Conference.]]Google Scholar
- Salmon, B., Thereska, E., Soules, C. A., and Ganger, G. R. 2003. A two-tiered software architecture for automated tuning of disk layouts. In Proceedings of the First Workshop on Algorithms and Architectures for Self-Managing Systems.]]Google Scholar
- Schechter, S., Krishnan, M., and Smith, M. D. 1998. Using path profiles to predict http requests. In Proceedings of the Seventh International World Wide Web Conference.]] Google Scholar
Digital Library
- Schindler, J., Griffin, J., Lumb, C., and Ganger, G. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Proceedings of the First USENIX Conference on File and Storage Technologies.]] Google Scholar
Digital Library
- Seifert, A. and Scholl, M. H. 2002. A multi-version cache replacement and prefetching policy for hybrid data delivery environments. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB).]] Google Scholar
Digital Library
- Sivathanu, M., Prabhakaran, V., Popovici, F., Denehy, T. E., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2003. Semantically-smart disk systems. In Proceedings of the Second USENIX Conference on File and Storage Technologies.]] Google Scholar
Digital Library
- Smith, A. J. 1978a. Sequentiality and prefetching in database systems. ACM Trans. Database Syst. 3, 3 (Sept.), 223--247.]] Google Scholar
Digital Library
- Smith, B. J. 1978b. A pipelined, shared resource MIMD computer. In Proceedings of International Conference on Parallel Processing. 6--8.]]Google Scholar
- Soloviev, V. 1996. Prefetching in segmented disk cache for multi-disk systems. In Proceedings of the Fourth Workshop on I/O in Parallel and Distributed Systems. ACM Press, New York, NY, 69--82.]] Google Scholar
Digital Library
- Storage Performance Council. 2004. SPC I/O traces. Web site: http://www.storageperformance.org/.]]Google Scholar
- Tait, C. D., Lei, H., Acharya, S., and Chang, H. 1995. Intelligent file hoarding for mobile computers. In Proceedings of the Conference on Mobile Computing and Networking. 119--125.]] Google Scholar
Digital Library
- Tomkins, A., Patterson, R. H., and Gibson, G. 1997. Informed multi-process prefetching and caching. In Proceedings of the 1997 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM Press, New York, NY, 100--114.]] Google Scholar
Digital Library
- Vellanki, V. and Chervenak, A. 1999. A cost-benefit scheme for high performance predictive prefetching. In Proceedings of SC99: High Performance Networking and Computing (Portland, OR). ACM Press, New York, NY, and IEEE Computer Society Press, Los Alamitos, CA.]] Google Scholar
Digital Library
- Vitter, J. S. and Krishnan, P. 1991. Optimal prefetching via data compression. In Proceedings of the 32nd Annual IEEE Symposium on Foundations of Computer Science.]] Google Scholar
Digital Library
- Wang, M., Madhyastha, T., Chan, N. H., Papadimitriou, S., and Faloutsos, C. 2002. Data mining meets performance evaluation: Fast algorithms for modeling bursty traffic. In Proceedings of the 18th International Conference on Data Engineering.]] Google Scholar
Digital Library
- Wedekind, H. and Zoerntlein, G. 1986. Prefetching in realtime database applications. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 215--226.]] Google Scholar
Digital Library
- Wilkes, J., Golding, R., Staelin, C., and Sullivan, T. 1995. The HP AutoRAID hierarchical storage system. In Proceedings of the 15th Symposium on Operating Systems Principles.]] Google Scholar
Digital Library
- Wong, T. and Wilkes, J. 2002. My cache or yours? Making storage more exclusive. In Proceedings of USENIX.]] Google Scholar
Digital Library
- Yan, X., Han, J., and Afshar, R. 2003. CloSpan: Mining closed sequential patterns in large datasets. In Proceedings of the 2003 SIAM International Conference Data Mining (SDM'03, San Francisco, CA).]]Google Scholar
- Zaki, M. 2001. SPADE: An efficient algorithm for mining frequent sequences. Machine Learn. 40, 31--60.]] Google Scholar
Digital Library
- Zhang, Y., Zhang, J., Sivasubramaniam, A., Liu, C., and Franke, H. 2003. Decision-support workload characteristics on clustered database server from the OS perspective. In Proceedings of the International Conference on Distributed Conputing Systems (ICDCS).]] Google Scholar
Digital Library
- Zhou, Y., Philbin, J. F., and Li, K. 2001. The multi-queue replacement algorithm for second level buffer caches. In Proceedings of the Usenix Technical Conference.]] Google Scholar
Digital Library
Index Terms
Mining block correlations to improve storage performance
Recommendations
C-Miner: Mining Block Correlations in Storage Systems
FAST '04: Proceedings of the 3rd USENIX Conference on File and Storage TechnologiesBlock correlations are common semantic patterns in storage systems. These correlations can be exploited for improving the effectiveness of storage caching, prefetching, data layout and disk scheduling. Unfortunately, information about block correlations ...
Mining of frequent itemsets with JoinFI-mine algorithm
AIKED'11: Proceedings of the 10th WSEAS international conference on Artificial intelligence, knowledge engineering and data basesAssociation rule mining among frequent items has been widely studied in data mining field. Many researches have improved the algorithm for generation of all the frequent itemsets. In this paper, we proposed a new algorithm to mine all frequents itemsets ...
FARMER: a novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file system performance
HPDC '08: Proceedings of the 17th international symposium on High performance distributed computingFile correlation, which refers to a relationship among related files that can manifest in the form of their common access locality (temporal and/or spatial), has become an increasingly important consideration for performance enhancement in peta-scale ...






Comments