Abstract
Users and Operating Systems (OSs) have vastly different views of files. OSs use files to persist data and structured information. To accomplish this, OSs treat files as named collections of bytes managed in hierarchical file systems. Despite their critical role in computing, little attention is paid to the lifecycle of the file, the evolution of file contents, or the evolution of file metadata. In contrast, users have rich mental models of files: they group files into projects, send data repositories to others, work on documents over time, and stash them aside for future use. Current OSs and Revision Control Systems ignore such mental models, persisting a selective, manually designated history of revisions. Preserving the mental model allows applications to better match how users view their files, making file processing and archiving tools more effective. We propose two mechanisms that OSs can adopt to better preserve the mental model: File Lifecycle Events (FLEs) that record a file’s progression and Complex File Events (CFEs) that combine them into meaningful patterns. We present the Complex File Events Engine (CoFEE), which uses file system monitoring and an extensible rulebase (Drools) to detect FLEs and convert them into complex ones. CFEs are persisted in NoSQL stores for later querying.
- Nitin Agrawal, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2009. Generating realistic impressions for file-system benchmarking. ACM Transactions on Storage 5, 4 (Dec. 2009), Article 16, 30 pages. Google Scholar
Digital Library
- Nitin Agrawal, William J. Bolosky, John R. Douceur, and Jacob R. Lorch. 2007. A five-year study of file-system metadata. ACM Transactions on Storage 3, 3 (Oct. 2007), Article 9. Google Scholar
Digital Library
- Christian Allred. 2009. Understanding windows file system transactions. In Proceedings of the Storage Developer Conference.Google Scholar
- Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011. Putting lipstick on pig: Enabling database-style workflow provenance. Proceedings of the VLDB Endowment 5, 4 (Dec. 2011), 346--357. Google Scholar
Digital Library
- Apache. 2015. Apache Subversion. Retrieved October 28, 2018 from https://subversion.apache.org.Google Scholar
- Apple, Inc. 2018. Back Up Your Mac With Time Machine. Retrieved October 28, 2018 from https://support.apple.com/en-us/HT201250.Google Scholar
- AWS. 2006. Amazon Simple Storage Service Developer Guide (API Version 2006-03-01): Using Versioning. Amazon Web Services, Inc. Available at http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.htmlGoogle Scholar
- Aishwary Bhashkar, Abhijit P. Kulkarni, and Prakash D. Jagdale. 2016. Anti-ransomware. Patent No. US20160378988A1, Filed Sept. 3, 2015, Issued Dec. 29, 2016.Google Scholar
- M. A. Borkin, C. S. Yeh, M. Boyd, P. Macko, K. Z. Gajos, M. Seltzer, and H. Pfister. 2013. Evaluation of filesystem provenance visualization tools. IEEE Transactions on Visualization and Computer Graphics 19, 12 (Dec. 2013), 2476--2485. Google Scholar
Digital Library
- Btrfs. 2018. Btrfs Wiki. Retrieved October 28, 2018 from https://btrfs.wiki.kernel.org/.Google Scholar
- Adriane Chapman, M. David Allen, and Barbara Blaustein. 2012. It’s about the data: Provenance as a tool for assessing data fitness. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google Scholar
Digital Library
- Hsiang-Ting Chen, Li-Yi Wei, and Chun-Fa Chang. 2011. Nonlinear revision control for images. In Proceedings of ACM SIGGRAPH 2011 Papers (SIGGRAPH’11). Article 105. Google Scholar
Digital Library
- X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. 2004. Shared information and program plagiarism detection. IEEE Transactions on Information Theory 50, 7 (July 2004), 1545--1551. Google Scholar
Digital Library
- Reidar Conradi and Bernhard Westfechtel. 1998. Version models for software configuration management. ACM Computing Surveys 30, 2 (June 1998), 232--282. Google Scholar
Digital Library
- Andrea Continella, Alessandro Guagnelli, Giovanni Zingaro, Giulio De Pasquale, Alessandro Barenghi, Stefano Zanero, et al. 2016. ShieldFS: A self-healing, ransomware-aware filesystem. In Proceedings of the 32nd Annual Conference on Computer Security Applications (ACSAC’16). ACM, New York, NY, 336--347. Google Scholar
Digital Library
- Brian Cornell, Peter A. Dinda, and Fabián E. Bustamante. 2004. Wayback: A user-level versioning file system for Linux. In Proceedings of the FREENIX Track: 2004 USENIX Annual Technical Conference. 19--28. Google Scholar
Digital Library
- Digital Equipment Corporation. 1980. TOPS-20 User’s Guide. Digital Equipment Corporation.Google Scholar
- Jeff Darcy. 2011. Building a cloud file system. ;login: 36, 3 (June 2011), 14--21.Google Scholar
- Ralf Diestelkämper, Melanie Herschel, and Priyanka Jadhav. 2017. Provenance in DISC systems: Reducing space overhead at runtime. In Proceedings of the 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP’17). Google Scholar
Digital Library
- Jesse David Dinneen, Fabian Odoni, Ilja Frissen, and Charles-Antoine Julien. 2016. Cardinal: Novel software for studying file management behavior. In Proceedings of the 79th ASIS8T Annual Meeting: Creating Knowledge, Enhancing Lives Through Information 8 Technology (ASIST’16). Article 62, 10 pages. Google Scholar
Digital Library
- John R. Douceur and William J. Bolosky. 1999. A large-scale study of file-system contents. In Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’99). ACM, New York, NY, 59--70. Google Scholar
Digital Library
- Ertem Esiner and Anwitaman Datta. 2016. Auditable versioned data storage outsourcing. Future Generation Computer Systems 55, C (Feb. 2016), 17--28. Google Scholar
Digital Library
- George Forman, Kave Eshghi, and Stephane Chiocchetti. 2005. Finding similar files in large document repositories. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD’05). ACM, New York, NY, 394--400. Google Scholar
Digital Library
- Ashish Gehani, Hasanat Kazmi, and Hassaan Irshad. 2016. Scaling SPADE to ”Big Provenance.” In Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (TaPP’16). Google Scholar
Digital Library
- GNU. 2016. GNU Make. Retrieved October 28, 2018 from https://www.gnu.org/software/make/.Google Scholar
- Google. 2018. Google Drive—Cloud Storage 8 File Backup for Photos, Docs 8 More. Retrieved October 28, 2018 from https://drive.google.com.Google Scholar
- Tyler Harter, Chris Dragga, Michael Vaughn, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2012. A file is not a file: Understanding the I/O behavior of apple desktop applications. ACM Transactions on Computer Systems 30, 3 (Aug. 2012), Article 10, 39 pages. Google Scholar
Digital Library
- Dave Hitz, James Lau, and Michael Malcolm. 1994. File System Design for an NFS File Server Appliance. Technical Report 3002. NetworkAppliance, Mountain View, CA.Google Scholar
- Mark Howison, Nicholas A. Sinnott-Armstrong, and Casey W. Dunn. 2012. BioLite, a lightweight bioinformatics framework with automated tracking of diagnostics and provenance. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google Scholar
Digital Library
- Ping Huang, Ke Zhou, Hua Wang, and Chun Hua Li. 2012. BVSSD: Build built-in versioning flash-based solid state drives. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). ACM, New York, NY, Article 11, 12 pages. Google Scholar
Digital Library
- Jeff Inman, Will Vining, Garrett Ransom, and Gary Grider. 2017. MarFS, a near-POSIX interface to cloud objects. login: 42, 1 (Spring 2017), 26--30.Google Scholar
- JBoss Community. 2018. Drools—Business Rules Management System. Retrieved October 28, 2018 from http://drools.org/.Google Scholar
- Luiz M. R. Gadelha Jr., Marta Mattoso, Michael Wilde, and Ian Foster. 2011. Provenance query patterns for many-task scientific computing. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP’11).Google Scholar
- Ho Min Jung, Sang Yong Park, Jeong Gun Lee, and Young Woong Ko. 2012. Efficient data deduplication system considering file modification pattern. International Journal of Security and Its Applications 6, 2 (2012), 421--426.Google Scholar
- Nikos Karampatziakis, Jack W. Stokes, Anil Thomas, and Mady Marinescu. 2012. Using file relationships in malware classification. In Proceedings of the Conference on Detection of Intrusions and Malware and Vulnerability Assessment. Google Scholar
Digital Library
- Saiful Khan, Urszula Kanturska, Tom Waters, James Eaton, Renè Bañares-Alcántara, et al. 2016. Ontology-assisted provenance visualization for supporting enterprise search of engineering and business files. Advanced Engineering Informatics 30, 2 (2016), 244--257. Google Scholar
Digital Library
- David Koop. 2016. Versioning version trees: The provenance of actions that affect multiple versions. In Provenance and Annotation of Data and Processes, M. Mattoso and B. Glavie (Eds.). Lecture Notes in Computer Science, Vol. 9672. Springer, 109--121.Google Scholar
- Stefan Küng, Lübbe Onken, and Simon Large. 2018. Deleting, Moving and Renaming. Retrieved October 28, 2018 from https://tortoisesvn.net/docs/release/TortoiseSVN_en/tsvn-dug-rename.html.Google Scholar
- Guy Laden, Paula Ta-Shma, Eitan Yaffe, Michael Factor, and Shachar Fienblit. 2007. Architectures for controller based CDP. In Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07). 107--121. Google Scholar
Digital Library
- Houssem Ben Lahmar and Melanie Herschel. 2017. Provenance-based recommendations for visual data exploration. In Proceedings of the 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP’17). Google Scholar
Digital Library
- Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific user behavior and data-sharing trends in a petascale file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 46, 12 pages. Google Scholar
Digital Library
- Po-Ching Lin, Ying-Dar Lin, and Yuan-Cheng Lai. 2011. A hybrid algorithm of backward hashing and automaton tracking for virus scanning. IEEE Transactions on Computers 60, 4 (2011), 594--601. Google Scholar
Digital Library
- D. Lucarella. 1988. A search strategy for large document bases. Electronic Publishing—Origination, Dissemination, and Design 1, 2 (Sept. 1988), 105--116. Google Scholar
Digital Library
- Peter Macko and Margo Seltzer. 2011. Provenance map orbiter: Interactive exploration of large provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP’11).Google Scholar
- Peter Macko and Margo Seltzer. 2012. A general-purpose provenance library. In Proceedings of the 4th USENIX Workshop on the Theory and Practice of Provenance (TaPP’12). Google Scholar
Digital Library
- Udi Manber. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference (WTEC’94). Google Scholar
Digital Library
- Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry. 1984. A fast file system for UNIX. ACM Transactions on Computer Systems 2, 3 (Aug. 1984), 181--197. Google Scholar
Digital Library
- Microsoft. 2014. Description of How Word Creates Temporary Files. Knowledge Base Article KB211632. Microsoft, Redmond, WA. Available at https://support.microsoft.com/en-us/kb/211632.Google Scholar
- Changwoo Min, Sanidhya Kashyap, Byoungyoung Lee, Chengyu Song, and Taesoo Kim. 2015. Cross-checking semantic correctness: The case of finding file system bugs. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 361--377. Google Scholar
Digital Library
- Luc Moreau and Paolo Missier (Eds.). 2013. PROV-DM: The PROV Data Model. Retrieved January 10, 2019 from http://www.w3.org/TR/2013/REC-prov-dm-20130430/.Google Scholar
- Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, and Margo Seltzer. 2006. Provenance-aware storage systems. In Proceedings of the 2006 USENIX Annual Technical Conference. Google Scholar
Digital Library
- Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and Erez Zadok. 2004. A versatile and user-oriented versioning file system. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST’04). 115--128. Google Scholar
Digital Library
- Neo Technology, Inc. 2016. Bolt Protocol. Protocol Standard Version 1. Neo Technology, Inc. Available at https://boltprotocol.org.Google Scholar
- Neo4j. 2018. Cypher Query Language Reference. Language Reference Version 9. openCypher Project. Available at http://www.opencypher.org/resources.Google Scholar
- Neo4j. 2018. Neo4j Database. Retrieved June 10, 2018 from https://neo4j.com.Google Scholar
- Uwe Pachler. 2012. JPathwatch: A Java Library for Monitoring Directory and File Changes. Retrieved June 4, 2018 from https://jpathwatch.wordpress.com/.Google Scholar
- João Paulo and José Pereira. 2014. A survey and classification of storage deduplication systems. ACM Computing Surveys 47, 1 (June 2014), Article 11, 30 pages. Google Scholar
Digital Library
- Zachary N. J. Peterson and Randal C. Burns. 2005. Ext3cow: A time-shifting file system for regulatory compliance. ACM Transactions on Storage 1, 2 (2005), 190--212. Google Scholar
Digital Library
- Git Project. 2018. Git-log: Show Commit Logs (version 2.16.0 ed.). Software Freedom Conservancy, Brooklyn, NY. Available at https://git-scm.com/docs/git-log.Google Scholar
- Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. 2000. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference (ATEC’00). Google Scholar
Digital Library
- V. Roussev. 2009. Hashing and data fingerprinting in digital forensics. IEEE Security and Privacy 7, 2 (March 2009), 49--55. Google Scholar
Digital Library
- D. J. Santry, M. J. Feeley, N. C. Hutchinson, and A. C. Veitch. 1999. Elephant: The file system that never forgets. In Proceedings of the 7th Workshop on Hot Topics in Operating Systems. 2--7. Google Scholar
Digital Library
- Andreas Schreiber and Regina Struminski. 2017. Visualizing provenance using comics. In Proceedings of the 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2017). Google Scholar
Digital Library
- Margo Seltzer and Nicholas Murphy. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop of Hot Topics in Operating Systems (HotOS’09). Google Scholar
Digital Library
- Keith A. Smith and Margo I. Seltzer. 1997. File system aging—Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’97). ACM, New York, NY, 203--213. Google Scholar
Digital Library
- Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger. 2003. Metadata efficiency in versioning file systems. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST’03). Google Scholar
Digital Library
- R. Spillane, R. Sears, C. Yalamanchili, S. Gaikwad, M. Chinni, and E. Zadok. 2009. Story book: An efficient extensible provenance framework. In Proceedings of the 1st Workshop on Theory and Practice of Provenance (TaPP’09). Google Scholar
Digital Library
- Salmin Sultana and Elisa Bertino. 2013. A file provenance system. In Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy (CODASPY’13). ACM, New York, NY, 153--156. Google Scholar
Digital Library
- Linus Torvalds. 2005. Re: Merge with git-pasky II. Newsgroup: gmane.comp.version-control.git. Retrieved October 28, 2018 from https://web.archive.org/web/20080328131150/http://article.gmane.org/gmane.comp.version-control.git/217.Google Scholar
- Avishay Traeger, Erez Zadok, Nikolai Joukov, and Charles P. Wright. 2008. A nine year study of file system and storage benchmarking. ACM Transactions on Storage 4, 2 (May 2008), Article 5, 56 pages. Google Scholar
Digital Library
- Deepak Venugopal and Guoning Hu. 2008. Efficient signature based malware detection on mobile devices. Mobile Information Systems 4, 1 (Jan. 2008), 33--49. Google Scholar
Digital Library
- Yulai Xie, Kiran-Kumar Muniswamy-Reddy, Darrell D. E. Long, Ahmed Amer, Dan Feng, and Zhipeng Tan. 2011. Compressing provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance (TaPP’11).Google Scholar
- Omry Yadan. 2012. JNotify: File system events library for Java. Retrieved September 26, 2017 from http://jnotify.sourceforge.net/.Google Scholar
Index Terms
On the Lifecycle of the File
Recommendations
The Life of a File: From Cradle to Grave or Eternity
INFOS '16: Proceedings of the 10th International Conference on Informatics and SystemsA file can be called an atomic information element that is found all over in computer world just like dust particles in real world. Files are universally used, ranging from personal data on desktop computers, to organizational data shared over local ...
Performance analysis of RAID-5 disk arrays
HICSS '95: Proceedings of the 28th Hawaii International Conference on System SciencesThe impact of major I/O system parameters on the performance of RAID-5 disk array systems is described in this paper. It is shown that the system throughput is a nonlinear function of the number of disks per rank, number of ranks in the system, ...
OrcFS: Organized Relationships between Components of the File System for Efficient File Retrieval
SYNASC '10: Proceedings of the 2010 12th International Symposium on Symbolic and Numeric Algorithms for Scientific ComputingThe need for efficient organization of files grows with the computer storage capabilities. However, a classical hierarchical file system offers little help in this matter, excepting maybe the case of links and shortcuts. OrcFS proposes a solution to ...






Comments