A Closer Look into IPFS: Accessibility, Content, and Performance

The InterPlanetary File System (IPFS) has recently gained considerable attention. While prior research has focused on understanding its performance characterization and application support, it remains unclear: (1) what kind of files/content are stored in IPFS, (2) who are providing these files, (3) are these files always accessible, and (4) what affects the file access performance. To answer these questions, in this paper, we perform measurement and analysis on over 4 million files associated with CIDs (content IDs) that appeared in publicly available IPFS datasets. Our results reveal the following key findings: (1) Mixed file accessibility: while IPFS is not designed for a permanent storage, accessing a non-trivial portion of files, such as those of NFTs and video streams, often requires multiple retrieval attempts, potentially blocking NFT transactions and negatively affecting the user experience. (2) Dominance of NFT (non-fungible token) and video files: about 50% of stored files are NFT-related, followed by a large portion of video files, among which about half are pirated movies and adult content. (3) Centralization of content providers: a small number of peers (top-50), mostly cloud nodes hosted by tech companies, serve a large portion (95%) of files, deviating from IPFS's intended design goal. (4) High variation of downloading throughput and lookup time: large file retrievals experience lower average throughput due to more overhead for resolving file chunk CIDs, and looking up files hosted by non-cloud nodes takes longer. We hope that our findings can offer valuable insights for (1) IPFS application developers to take into consideration these characteristics when building applications on top of IPFS, and (2) IPFS system developers to improve IPFS and similar systems to be developed for Web3.


INTRODUCTION
The traditional web architecture has been utilizing a client-server model by dividing network nodes into servers and clients.This structure results in many clients depending heavily on one or a few dominant servers for their data needs.This centralization, often required for efficient data management and reliable data accessibility, raises concerns about data and content being concentrated among a few tech giants.As such, Web 3.0 (also known as Web3) envisions that the next-generation web should incorporate a decentralized structure, exemplified by the operation of cryptocurrencies such as Bitcoin, Ethereum and Metaverse, which have frequently reaped the media headlines in the past few years.
The InterPlanetary File System (IPFS), by integrating contentbased addressing and benefiting from a P2P structure, is an early effort towards this vision.By harnessing the power of distributed P2P networks, IPFS aims to replace traditional web protocols, such as Hypertext Transfer Protocol (HTTP), in the next generation of the Internet and enables a more resilient and decentralized alternative to how information is accessed and shared online.
Since its first release in 2015, IPFS is gaining increasing popularity in the field of Web3 technologies.This burgeoning interest has attracted a number of studies that examined its performance [1,6,7], decentralization [3], content duplication [5], as well as its design and implementation, deployment experience [2,7], and potential to support applications such as video streaming [8], among others.While these studies provided insights into various aspects of IPFS, there remains a gap in understanding the actual content stored in IPFS.Furthermore, prior studies [1,6,7] have limitations in their performance analysis, relying solely on dummy files and private clients for experiments, and mostly neglecting large files (over 16MB).Since IPFS operates as a P2P network, it is crucial to assess the user performance in practice.This motivates our study, focusing on the analysis of IPFS public gateway and network traces.As such, our study aims to evaluate the performance of IPFS and address the following questions regarding content and its providers: Q1: What kind of files are stored in IPFS?Answering this question will provide us with a better understanding of whether the content shared in IPFS aligns well with our current web (e.g., Web 2.0).On the other hand, as a storage system, data availability/accessibility is also important.This leads to our second question: Q2: Are data stored in IPFS always accessible?While hosting content across peers can minimize content centralization, it is also desirable to maintain comparable accessibility in a decentralized fashion.It is worth noting that IPFS does not aim to provide permanent file storage.But as a storage system, it is crucial to enable persistent and highly available storage for certain files such as nonfungible tokens (NFTs) files and video streaming files in order to support the transactions or video streaming applications.Lastly, Q3: Who are the content providers?A common concern of traditional web services is that the content is predominantly served by or through a few large organizations, such as tech giants.IPFS aims to counteract this centralized trend by promoting decentralization.Therefore, an analysis of content providers can provide insights into the extent to which IPFS achieves its design goal.In this paper, we start with a list of content IDs (CIDs) extracted from a publicly-available two-week-long gateway log [4] and retrieved 4 million files corresponding to those CIDs.During file retrieval, we also instrument our clients to timestamp the retrieval phases including lookup and downloading.Analysis of these data can shed light on Q4: How is the file retrieval performance?By analyzing the downloaded files and the information obtained during the retrieval process, we set to answer the above questions.Furthermore, our collected information enables us to evaluate the file access performance on a large scale.

MAIN FINDINGS
File Availability Our analysis on file accessibility in IPFS shows mixed results: while the majority of the files corresponding to the 4 million+ CIDs extracted from a gateway log (1-year old) only about half can be downloaded right away and it takes multiple attempts in 6 days to download the other half.Furthermore, a small portion of the files remains inaccessible after repeated attempts for a week.For NFT files, about 20% of corresponding files are not instantly accessible, which can block the business transactions of NFTs.We repeat the experiments on CIDs that are six-month old, one-month old, and zero-day old (crawling the DHT and downloading the found CIDs instantly).Our results show that no matter how "young" the data is, a portion of the CIDs are not instantly available and need multiple attempts to be retrieved.File Types Upon examining the files that we successfully retrieved, we found that currently IPFS has been primarily utilized by NFT and video applications: about 50% of files stored in IPFS are NFT related, followed by a large portion (e.g., 33%+ in terms of total file size) of video files, among which about half are used by services serving adult content or pirated movies.Content Providers By looking into which peers have provided the content during our file retrievals, we found that the content providers are highly centralized.For example, 95% of retrieved files are from top-50 providers, and many of these providers are located in datacenters, serving as either storage nodes or cloud nodes for data such as NFTs.Performance Analysis The file access performance shows high variation.Specifically, the average retrieval time of small files is dominated by their lookup time, which is about 4× of that of large files on average, while the file retrieval throughput of larger files sees a decreasing trend on average.We also find that the lookup time is highly dependent on the content providers: as a higher ratio of small files are stored on non-cloud nodes that are more likely behind NAT, it takes longer to identify these providers.

CONCLUSION
IPFS has emerged as a pioneer distributed storage system for Web 3.0.Its early implementation has attracted lots of users and applications and garnered attention from the research community.While prior studies have characterized IPFS on its design and implementation, geographical participation, and file storage and retrieval performance, this paper has conducted a study aiming to gain a better understanding regarding what content is currently stored on IPFS, what applications are actively using the content served via IPFS, and who are providing the services.Our findings unveil several trends that demand more deliberation and mechanisms for improvements so that IPFS can realize its envisioned goals in the future.By no means our study offers a complete picture of IPFS, a system that continues to evolve actively.However, we hope the trends we have identified will provide valuable insights into the design, implementation, and optimization of a large, decentralized storage service in the era of the next-generation web.