skip to main content
research-article

On the Lifecycle of the File

Authors Info & Claims
Published:18 February 2019Publication History
Skip Abstract Section

Abstract

Users and Operating Systems (OSs) have vastly different views of files. OSs use files to persist data and structured information. To accomplish this, OSs treat files as named collections of bytes managed in hierarchical file systems. Despite their critical role in computing, little attention is paid to the lifecycle of the file, the evolution of file contents, or the evolution of file metadata. In contrast, users have rich mental models of files: they group files into projects, send data repositories to others, work on documents over time, and stash them aside for future use. Current OSs and Revision Control Systems ignore such mental models, persisting a selective, manually designated history of revisions. Preserving the mental model allows applications to better match how users view their files, making file processing and archiving tools more effective. We propose two mechanisms that OSs can adopt to better preserve the mental model: File Lifecycle Events (FLEs) that record a file’s progression and Complex File Events (CFEs) that combine them into meaningful patterns. We present the Complex File Events Engine (CoFEE), which uses file system monitoring and an extensible rulebase (Drools) to detect FLEs and convert them into complex ones. CFEs are persisted in NoSQL stores for later querying.

References

  1. Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2009. Generating realistic impressions for file-system benchmarking. ACM Transactions on Storage 5, 4 (Dec. 2009), Article 16, 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Nitin Agrawal, William J. Bolosky, John R. Douceur, and Jacob R. Lorch. 2007. A five-year study of file-system metadata. ACM Transactions on Storage 3, 3 (Oct. 2007), Article 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Christian Allred. 2009. Understanding windows file system transactions. In Proceedings of the Storage Developer Conference.Google ScholarGoogle Scholar
  4. Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011. Putting lipstick on pig: Enabling database-style workflow provenance. Proceedings of the VLDB Endowment 5, 4 (Dec. 2011), 346--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Apache. 2015. Apache Subversion. Retrieved October 28, 2018 from https://subversion.apache.org.Google ScholarGoogle Scholar
  6. Apple, Inc. 2018. Back Up Your Mac With Time Machine. Retrieved October 28, 2018 from https://support.apple.com/en-us/HT201250.Google ScholarGoogle Scholar
  7. AWS. 2006. Amazon Simple Storage Service Developer Guide (API Version 2006-03-01): Using Versioning. Amazon Web Services, Inc. Available at http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.htmlGoogle ScholarGoogle Scholar
  8. Aishwary Bhashkar, Abhijit P. Kulkarni, and Prakash D. Jagdale. 2016. Anti-ransomware. Patent No. US20160378988A1, Filed Sept. 3, 2015, Issued Dec. 29, 2016.Google ScholarGoogle Scholar
  9. M. A. Borkin, C. S. Yeh, M. Boyd, P. Macko, K. Z. Gajos, M. Seltzer, and H. Pfister. 2013. Evaluation of filesystem provenance visualization tools. IEEE Transactions on Visualization and Computer Graphics 19, 12 (Dec. 2013), 2476--2485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Btrfs. 2018. Btrfs Wiki. Retrieved October 28, 2018 from https://btrfs.wiki.kernel.org/.Google ScholarGoogle Scholar
  11. Adriane Chapman, M. David Allen, and Barbara Blaustein. 2012. It’s about the data: Provenance as a tool for assessing data fitness. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hsiang-Ting Chen, Li-Yi Wei, and Chun-Fa Chang. 2011. Nonlinear revision control for images. In Proceedings of ACM SIGGRAPH 2011 Papers (SIGGRAPH’11). Article 105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. 2004. Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50, 7 (July 2004), 1545--1551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Reidar Conradi and Bernhard Westfechtel. 1998. Version models for software configuration management. ACM Computing Surveys 30, 2 (June 1998), 232--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Andrea Continella, Alessandro Guagnelli, Giovanni Zingaro, Giulio De Pasquale, Alessandro Barenghi, Stefano Zanero, et al. 2016. ShieldFS: A self-healing, ransomware-aware filesystem. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC’16). ACM, New York, NY, 336--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Brian Cornell, Peter A. Dinda, and Fabián E. Bustamante. 2004. Wayback: A user-level versioning file system for Linux. In Proceedings of the FREENIX Track: 2004 USENIX Annual Technical Conference. 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Digital Equipment Corporation. 1980. TOPS-20 User’s Guide. Digital Equipment Corporation.Google ScholarGoogle Scholar
  18. Jeff Darcy. 2011. Building a cloud file system. ;login: 36, 3 (June 2011), 14--21.Google ScholarGoogle Scholar
  19. Ralf Diestelkämper, Melanie Herschel, and Priyanka Jadhav. 2017. Provenance in DISC systems: Reducing space overhead at runtime. In Proceedings of the 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jesse David Dinneen, Fabian Odoni, Ilja Frissen, and Charles-Antoine Julien. 2016. Cardinal: Novel software for studying file management behavior. In Proceedings of the 79th ASIS8T Annual Meeting: Creating Knowledge, Enhancing Lives Through Information 8 Technology (ASIST’16). Article 62, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. John R. Douceur and William J. Bolosky. 1999. A large-scale study of file-system contents. In Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’99). ACM, New York, NY, 59--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ertem Esiner and Anwitaman Datta. 2016. Auditable versioned data storage outsourcing. Future Generation Computer Systems 55, C (Feb. 2016), 17--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. George Forman, Kave Eshghi, and Stephane Chiocchetti. 2005. Finding similar files in large document repositories. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, NY, 394--400. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ashish Gehani, Hasanat Kazmi, and Hassaan Irshad. 2016. Scaling SPADE to ”Big Provenance.” In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (TaPP’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. GNU. 2016. GNU Make. Retrieved October 28, 2018 from https://www.gnu.org/software/make/.Google ScholarGoogle Scholar
  26. Google. 2018. Google Drive—Cloud Storage 8 File Backup for Photos, Docs 8 More. Retrieved October 28, 2018 from https://drive.google.com.Google ScholarGoogle Scholar
  27. Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. A file is not a file: Understanding the I/O behavior of apple desktop applications. ACM Transactions on Computer Systems 30, 3 (Aug. 2012), Article 10, 39 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Dave Hitz, James Lau, and Michael Malcolm. 1994. File System Design for an NFS File Server Appliance. Technical Report 3002. NetworkAppliance, Mountain View, CA.Google ScholarGoogle Scholar
  29. Mark Howison, Nicholas A. Sinnott-Armstrong, and Casey W. Dunn. 2012. BioLite, a lightweight bioinformatics framework with automated tracking of diagnostics and provenance. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ping Huang, Ke Zhou, Hua Wang, and Chun Hua Li. 2012. BVSSD: Build built-in versioning flash-based solid state drives. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). ACM, New York, NY, Article 11, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jeff Inman, Will Vining, Garrett Ransom, and Gary Grider. 2017. MarFS, a near-POSIX interface to cloud objects. login: 42, 1 (Spring 2017), 26--30.Google ScholarGoogle Scholar
  32. JBoss Community. 2018. Drools—Business Rules Management System. Retrieved October 28, 2018 from http://drools.org/.Google ScholarGoogle Scholar
  33. Luiz M. R. Gadelha Jr., Marta Mattoso, Michael Wilde, and Ian Foster. 2011. Provenance query patterns for many-task scientific computing. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP’11).Google ScholarGoogle Scholar
  34. Ho Min Jung, Sang Yong Park, Jeong Gun Lee, and Young Woong Ko. 2012. Efficient data deduplication system considering file modification pattern. International Journal of Security and Its Applications 6, 2 (2012), 421--426.Google ScholarGoogle Scholar
  35. Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. 2012. Using file relationships in malware classification. In Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Saiful Khan, Urszula Kanturska, Tom Waters, James Eaton, Renè Bañares-Alcántara, et al. 2016. Ontology-assisted provenance visualization for supporting enterprise search of engineering and business files. Advanced Engineering Informatics 30, 2 (2016), 244--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. David Koop. 2016. Versioning version trees: The provenance of actions that affect multiple versions. In Provenance and Annotation of Data and Processes, M. Mattoso and B. Glavie (Eds.). Lecture Notes in Computer Science, Vol. 9672. Springer, 109--121.Google ScholarGoogle Scholar
  38. Stefan Küng, Lübbe Onken, and Simon Large. 2018. Deleting, Moving and Renaming. Retrieved October 28, 2018 from https://tortoisesvn.net/docs/release/TortoiseSVN_en/tsvn-dug-rename.html.Google ScholarGoogle Scholar
  39. Guy Laden, Paula Ta-Shma, Eitan Yaffe, Michael Factor, and Shachar Fienblit. 2007. Architectures for controller based CDP. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 107--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Houssem Ben Lahmar and Melanie Herschel. 2017. Provenance-based recommendations for visual data exploration. In Proceedings of the 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP’17). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific user behavior and data-sharing trends in a petascale file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 46, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Po-Ching Lin, Ying-Dar Lin, and Yuan-Cheng Lai. 2011. A hybrid algorithm of backward hashing and automaton tracking for virus scanning. IEEE Transactions on Computers 60, 4 (2011), 594--601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Lucarella. 1988. A search strategy for large document bases. Electronic Publishing—Origination, Dissemination, and Design 1, 2 (Sept. 1988), 105--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Peter Macko and Margo Seltzer. 2011. Provenance map orbiter: Interactive exploration of large provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP’11).Google ScholarGoogle Scholar
  45. Peter Macko and Margo Seltzer. 2012. A general-purpose provenance library. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Udi Manber. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference (WTEC’94). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. ACM Transactions on Computer Systems 2, 3 (Aug. 1984), 181--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Microsoft. 2014. Description of How Word Creates Temporary Files. Knowledge Base Article KB211632. Microsoft, Redmond, WA. Available at https://support.microsoft.com/en-us/kb/211632.Google ScholarGoogle Scholar
  49. Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. 2015. Cross-checking semantic correctness: The case of finding file system bugs. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 361--377. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Luc Moreau and Paolo Missier (Eds.). 2013. PROV-DM: The PROV Data Model. Retrieved January 10, 2019 from http://www.w3.org/TR/2013/REC-prov-dm-20130430/.Google ScholarGoogle Scholar
  51. Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. 2006. Provenance-aware storage systems. In Proceedings of the 2006 USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and Erez Zadok. 2004. A versatile and user-oriented versioning file system. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). 115--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Neo Technology, Inc. 2016. Bolt Protocol. Protocol Standard Version 1. Neo Technology, Inc. Available at https://boltprotocol.org.Google ScholarGoogle Scholar
  54. Neo4j. 2018. Cypher Query Language Reference. Language Reference Version 9. openCypher Project. Available at http://www.opencypher.org/resources.Google ScholarGoogle Scholar
  55. Neo4j. 2018. Neo4j Database. Retrieved June 10, 2018 from https://neo4j.com.Google ScholarGoogle Scholar
  56. Uwe Pachler. 2012. JPathwatch: A Java Library for Monitoring Directory and File Changes. Retrieved June 4, 2018 from https://jpathwatch.wordpress.com/.Google ScholarGoogle Scholar
  57. João Paulo and José Pereira. 2014. A survey and classification of storage deduplication systems. ACM Computing Surveys 47, 1 (June 2014), Article 11, 30 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Zachary N. J. Peterson and Randal C. Burns. 2005. Ext3cow: A time-shifting file system for regulatory compliance. ACM Transactions on Storage 1, 2 (2005), 190--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Git Project. 2018. Git-log: Show Commit Logs (version 2.16.0 ed.). Software Freedom Conservancy, Brooklyn, NY. Available at https://git-scm.com/docs/git-log.Google ScholarGoogle Scholar
  60. Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. 2000. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference (ATEC’00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. V. Roussev. 2009. Hashing and data fingerprinting in digital forensics. IEEE Security and Privacy 7, 2 (March 2009), 49--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. D. J. Santry, M. J. Feeley, N. C. Hutchinson, and A. C. Veitch. 1999. Elephant: The file system that never forgets. In Proceedings of the 7th Workshop on Hot Topics in Operating Systems. 2--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Andreas Schreiber and Regina Struminski. 2017. Visualizing provenance using comics. In Proceedings of the 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Margo Seltzer and Nicholas Murphy. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop of Hot Topics in Operating Systems (HotOS’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Keith A. Smith and Margo I. Seltzer. 1997. File system aging—Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’97). ACM, New York, NY, 203--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger. 2003. Metadata efficiency in versioning file systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. R. Spillane, R. Sears, C. Yalamanchili, S. Gaikwad, M. Chinni, and E. Zadok. 2009. Story book: An efficient extensible provenance framework. In Proceedings of the 1st Workshop on Theory and Practice of Provenance (TaPP’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Salmin Sultana and Elisa Bertino. 2013. A file provenance system. In Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy (CODASPY’13). ACM, New York, NY, 153--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Linus Torvalds. 2005. Re: Merge with git-pasky II. Newsgroup: gmane.comp.version-control.git. Retrieved October 28, 2018 from https://web.archive.org/web/20080328131150/http://article.gmane.org/gmane.comp.version-control.git/217.Google ScholarGoogle Scholar
  70. Avishay Traeger, Erez Zadok, Nikolai Joukov, and Charles P. Wright. 2008. A nine year study of file system and storage benchmarking. ACM Transactions on Storage 4, 2 (May 2008), Article 5, 56 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Deepak Venugopal and Guoning Hu. 2008. Efficient signature based malware detection on mobile devices. Mobile Information Systems 4, 1 (Jan. 2008), 33--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Darrell D. E. Long, Ahmed Amer, Dan Feng, and Zhipeng Tan. 2011. Compressing provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP’11).Google ScholarGoogle Scholar
  73. Omry Yadan. 2012. JNotify: File system events library for Java. Retrieved September 26, 2017 from http://jnotify.sourceforge.net/.Google ScholarGoogle Scholar

Index Terms

  1. On the Lifecycle of the File

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Storage
          ACM Transactions on Storage  Volume 15, Issue 1
          Special Issue on ACM International Systems and Storage Conference (SYSTOR) 2018
          February 2019
          194 pages
          ISSN:1553-3077
          EISSN:1553-3093
          DOI:10.1145/3311821
          • Editor:
          • Sam H. Noh
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 February 2019
          • Accepted: 1 October 2018
          • Revised: 1 September 2018
          • Received: 1 February 2018
          Published in tos Volume 15, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!