I/O Access Patterns in HPC Applications: A 360-Degree Survey

The high-performance computing I/O stack has been complex due to multiple software layers, the inter-dependencies among these layers, and the different performance tuning options for each layer. In this complex stack, the definition of an “I/O access pattern” has been reappropriated to describe what an application is doing to write or read data from the perspective of different layers of the stack, often comprising a different set of features. It has become common to have to redefine what is meant when discussing a pattern in every new study, as no assumption can be made. This survey aims to propose a baseline taxonomy, harnessing the I/O community’s knowledge over the past 20 years. This definition can serve as a common ground for high-performance computing I/O researchers and developers to apply known I/O tuning strategies and design new strategies for improving I/O performance. We seek to summarize and bring a consensus to the multiple ways to describe a pattern based on common features already used by the community over the years.


INTRODUCTION
In High-Performance Computing (HPC), "I/O access pattern" or "I/O signature" is broadly used to express how an application is performing Input and Output (I/O) operations [21].Although the word is broadly used, there is no globally accepted convention to describe which features define an I/O access pattern and how they differ based on the level in the HPC I/O software stack they are being used to describe.For instance, some studies do not consider temporal features as part of the description of an access pattern [22] but others do [27,219].Moreover, some studies describe the access pattern as seen by high-level libraries [26,41,66], others do so by the I/O middleware [9,216], and still others do so by what the underlying file system is receiving [24,128,237].Hence, it is a common practice to have to redescribe exactly what is meant when discussing the "I/O access pattern" in every new study, as no assumption can be made.
In this article, we aim to propose a baseline taxonomy, harnessing the I/O community's knowledge over the past 20 years, that researchers and application developers can use to define an application's I/O access pattern.This definition can serve as a common ground to apply known I/O tuning strategies as well as to design new strategies for improving I/O performance.The definitions we proposed here seek to summarize and bring a consensus to the multiple ways to describe a pattern based on common features already used by the community over the years.It does not seek to be set in stone but a baseline consensus, which can be extended to accommodate new applications from different domains.
Besides solidifying a taxonomy based on common-ground features, our survey has practical applications for current and future research in the area.For instance, a plethora of existing optimization techniques that seek to improve applications' I/O performance relies on the definition of an access pattern.Despite the strategies they apply, they all work by modifying how an application is accessing its data (i.e., its I/O access pattern) to be more suited to the underlying layer of the I/O stack.Request aggregation, reordering, scheduling, and collective operations [44,59,60,113,197,216] are a few examples of techniques that optimization mechanisms apply at different layers of the I/O stack.In general, such optimizations typically improve the performance for a given system deployment and I/O patterns, but not for all.Moreover, they often rely on correctly applying such techniques to the workload represented by a set of access patterns.
As novel applications from diverse domains are harnessing HPC platforms and the systems are becoming more complex to handle more concurrent applications, it becomes paramount for those systems that seek to auto-tune their parameters to detect the I/O access patterns at runtime accurately.Such detection allows them to make decisions and apply the set of optimization techniques that are specifically designed for the observed patterns.Having an established taxonomy with clearly defined features can help bridge the gap between describing the access patterns and mapping them to existing techniques, allowing, for instance, AI-based and automatic tuning mechanisms to navigate the complex parameter space to find which optimizations and configurations can be applied for an observed I/O access pattern.

Contributions.
Although the term I/O access pattern is used heavily in published literature, there has been no study that encompasses, discusses, and categorizes I/O access patterns to the breadth to which the term is used in HPC.White et al. [219] proposed a taxonomy for temporal I/O patterns of HPC jobs to aid in automatically detecting poorly performing jobs.Boito et al. [21] touched on several key points of the HPC I/O stack, including access pattern extraction.However, both suffer from the same issues of redefining what an access pattern means.Furthermore, those definitions do not encompass all features of I/O and their impact when glancing at the different layers of the parallel I/O stack.In this work, we seek to provide a broader taxonomy for I/O access patterns, taking into account temporal behavior and also integrating other commonly used features as understood by the community, and to describe how those patterns are represented, used, and transformed as we traverse the HPC I/O stack.
The remainder of the article is organized as follows.In Section 2, we discuss the traditional HPC I/O software stack.The discussion of access patterns is split into four sections representing the layers in the stack.In Section 3, we describe the common data models used by scientific applications.Section 4 discusses how those data models are represented by high-level I/O libraries.Section 5 approaches the translation of I/O accesses used by middleware libraries.Finally, in Section 6, we present the perspective of the file system.In Section 8, we present common I/O benchmarks and kernels used to exercise access patterns in different levels of the HPC I/O stack, and in Section 9, we describe popular tools to visualize those patterns by using profiling and log traces.In Section 10, we conclude this survey with a summary of our contributions, existing gaps, and highlight opportunities for further R&D.

HPC I/O STACK
To support the I/O workloads from serial or parallel scientific applications, HPC systems provide a multi-layered software stack, as illustrated in Figure 1.Between the applications and storage hardware, the parallel I/O stack consists of high-level I/O libraries, middleware I/O libraries, optimization layers, and Parallel File Systems (PFS).
While traversing the stack, an access pattern is often reshaped via a series of data transformations originating from distinct abstractions and mappings between the data models used in the layers and the application of optimization techniques (e.g., scheduling, aggregation, and compression) before reaching the file system.Furthermore, some contextual information gets lost in this process.For instance, when requests arrive at the file system layer, the file system is unaware of which application or process the request originated from, or even if the request went through any data transformations and which ones.Consequently, an application may believe that it is accessing its data in one way, whereas something entirely different is happening in reality at the lowest layers of the stack.
High-level I/O libraries are used by applications to provide data models and file management abstractions that facilitate data portability and high performance.Examples of widely adopted libraries are HDF5 [202], NetCDF [119]/PnetCDF [121], and ADIOS [134].Those libraries map the applications' data abstractions into files or objects and encode the data in portable file formats.These libraries allow users to add metadata to describe the data and their data structures.In addition to these parallel I/O high libraries, application domain-specific libraries also exist.Among these, ROOT [5] and FITS [80,217] serve High-Energy Physics (HEP) and astronomy communities, respectively.Applications may also use MPI-IO (Message Passing Interface I/O) [46], POSIX I/O (Portable Operating System Interface I/O) [67], or STDIO (Standard Input and Output) interfaces directly to perform I/O to the file systems.We discuss these interfaces, their challenges, and their known impact on HPC I/O performance in Section 7.1.4.
In MPI-IO, a file is an ordered collection of typed data items.It presents a higher level of data abstraction than POSIX by allowing users to define data models that are natural to the application.Nonetheless, it supports defining complex data patterns for parallel write and read operations using independent and collective I/O calls.Furthermore, it allows taking advantage of optimization opportunities when using collective calls.In POSIX I/O, however, a file is viewed as a sequence of bytes.This interface allows transferring contiguous regions of bytes between the file and memory and non-contiguous regions of bytes from memory to a file by giving full, low-level control of the I/O operations.However, in the context of HPC, there is little in the interface that inherently supports parallel I/O.For instance, POSIX does not easily support collective access to files while leaving it to the programmer to coordinate access and ensure consistency.STDIO, in contrast, abstracts all file operations into operations on (input or output) streams of bytes.It comprises the C stdio.h family of functions [97], such as fopen(), fprintf(), and fscanf().These I/O functions are commonly used in genomics and biology to store sequencing information in text format [182].However, STDIO functions do not directly support random access to data files.Instead, the process relies on the programmer to create a stream, seek the position in the file, and then read/write bytes in sequence from/to the stream.
I/O forwarding [3], initially proposed for Blue Gene and later extended, seeks to reduce the number of (compute) nodes concurrently accessing the PFS servers by creating an additional transparent layer between the compute nodes and the data servers.Instead of the applications accessing the PFS directly, the I/O forwarding technique defines a set of I/O nodes that are responsible for receiving I/O requests from applications and forwarding them to the PFS in a controlled manner, allowing optimization techniques such as request scheduling, aggregation, and compression [2,16,159,190,232], to reshape the pattern and flow of I/O requests to better suit the underlying layers.
In large-scale systems, applications rely on PFS to provide a globally persistent shared storage infrastructure and a global namespace across many distributed storage servers to read and write data to files.A PFS comprises two types of servers with distinct roles: the data servers and the metadata servers.The latter handles information about the files (e.g., sizes and permissions) and their location in the system.Lustre [74,93], IBM Spectrum Scale (previously known as GPFS) [174], BeeGFS [87], and PVFS [36], among others, are commonly used PFS on large-scale HPC systems.To achieve high performance, these file systems harness parallelism by using data striping [188], which consists of partitioning the files and distributing the data into fixed-size chunks across multiple storage nodes.Finally, the PFS servers provide a logical file system abstraction over diverse storage devices such as Hard Disk Drives (HDDs), Solid-State Drives (SSDs), or Redundant Array of Independent Drives (RAID).

Summary #1
The multi-layered software and hardware HPC I/O stack is complex.To access data in HPC systems, applications issue requests that, while traversing the I/O stack, are reshaped via a series of data transformations.These originate from distinct abstractions and mappings between the data models used in each layer combined with optimization techniques applied before reaching the file system and, eventually, the storage hardware.
In the following sections, we discuss the I/O access patterns observed in the HPC stack's layers, from application data models and their I/O requests percolating through the underlying layers until the file systems handle them.

APPLICATION DATA MODELS AND ACCESS PATTERNS
Scientific applications often use data abstractions provided by high-level libraries (e.g., HDF5, NetCDF, ADIOS) to express data structures more naturally to a problem and domain.HPC simulations often describe their data objects using multi-dimensional data or meshes, arbitrary subsets, points and curves, and key values [151,181].Mesh data objects, in particular, can be further represented by structured rectilinear, non-uniform rectilinear, grid-less points, structured (curvilinear), arbitrary polyhedral, constructive solid geometry (CSG), unstructured zoo (UCD), and adaptive mesh refinement (AMR) meshes.In Figure 2, we show these most common high-level data models used by HPC applications.
For instance, physics simulations rely on finite element methods to discretize the simulated domain by splitting it into smaller elements.Numerical methods are then applied to solve differential equations on these elements.These methods often assume that the domain is divided into a structured or unstructured mesh of smaller, simpler elements.The first has some advantages over the latter.It is simpler to use, requiring less memory as its coordinates can be calculated rather than stored.However, in the case of unstructured meshes, computations are irregular, causing problems of indirect, non-strided (i.e., no gaps between successive data accesses), or non-contiguous access to memory [153].Yet structured meshes lack the flexibility to represent complex shapes needed for some domains [14].
Uniform rectilinear meshes (Figure 2(a)) divide the computation domain into a set of rectangular cells and are regular both in topology and geometry.If points and cells are organized into a 1D plane, they are often used to express image data.Volume can be represented by arranging this mesh into multiple stacked planes.Rectilinear grids (Figure 2(b)) differ in their regularity, where the spacing between points may vary (in any of the axes), but the rows and columns are still parallel to the axis of the Cartesian coordinate system.Curvilinear (Figure 2(c)) or structured grids (also known as mapped mesh or body-fitted mesh) have the same topology as a rectilinear grid but allow more variation in the shape of the mesh, as they can be warped into any configuration without overlap or intersection.These grids do not use Cartesian grid lines but a curvilinear coordinate system where an array of point coordinates explicitly represents the geometry.Curvilinear grids provide a more compact memory footprint and are regular in topology but present irregular geometry.They are used for finite difference computations such as flow [64], heat transfer, and combustion simulations [31].Unstructured grids (Figure 2(e)) are a tessellation that conforms to nearly any desired geometry.However, they require more information to be stored and recovered than structured grids, for instance, to express the neighbor connectivity list.Those meshes are used in seismic wave [65,91,225], fluid dynamics [79,184], and heat transfer [130,153] simulations.
The high-level data models can be stored in a file using two data layouts with regard to interleaving: Array of Structures (AoS) or Structure of Arrays (SoA), as depicted in Figure 3.A contiguous pattern in the memory means that multiple arrays are of the same basic data types, such as integer, float, and double.The non-contiguous pattern in memory, also referred to as an AoS or a derived data type, represents compound data types derived from basic data types.The first helps access adjacent work items in contiguous memory locations, whereas the latter is often more intuitive from the developer's perspective, as each structure is kept together.Once that data needs to be persisted in a file, it can use the same strategy or the opposite one used to represent the data in memory.Table 1 depicts this by comparing in-memory and in-file representations when  For illustration purposes, we restrict ourselves to 1D arrays.using HDF5, for instance, to store data in a contiguous or compound fashion.Nonetheless, this representation is also used by other I/O libraries and interfaces such as MPI-IO, where one can define data types to describe both memory and file layout.
As an application's I/O requests need to transverse the I/O stack to ultimately reach the storage system, its I/O pattern is reshaped and transformed by various existing optimizations techniques (e.g., collective buffering and data sieving [9,44,136,199,216], request scheduling [6,16,20,22], and request aggregation [96,197,208]).Often these transformations are transparent to the end user.Thus, what the application believes it is doing might differ from what the other levels of the I/O stack perceive of the application's behavior.Due to that, information related to how the application is accessing its data can be lost throughout the stack.For instance, when requests arrive at a PFS, it is nearly impossible to determine which rank issued that request and whether it was initially contiguous or not.To clarify such an example, if I/O middleware libraries such as MPI-IO apply collective I/O optimization, only the aggregators will issue the requests to the PFS, and only they will know to which ranks they should exchange data about the request.To further complicate the situation, suppose a forwarding layer [2,98] (or any other transparent middleware) is present, possibly merging or scheduling and aggregating requests from multiple compute nodes.In that case, the PFS will not know which application rank originally issued the I/O request.However, when those requests are forwarded to the PFS (using a forwarding layer), the latter will often not know from which compute nodes they originated.Only the I/O nodes will have such information to forward back the data.

Summary #2
Parallel applications rely on data models that are naturally mapped to a problem domain.To be stored in files, data must be transformed by intermediate layers of the HPC I/O stack.Thus, the features we can use to describe an I/O access pattern at the file system level are not the same as the view we have at a higher level in the I/O stack.

ACCESS PATTERNS IN HIGH-LEVEL I/O LIBRARIES
High-level I/O libraries allow HPC applications to express scientific simulation data more naturally instead of being constrained or caught up by system-specific details.HDF5, NetCDF/PnetCDF, ADIOS, ROOT, and FITS are examples of such libraries, each providing a set of APIs to express complex multi-dimensional data, contiguous and non-contiguous data, seeking to attain performance and portability, and also increase productivity.
HDF5 (Hierarchical Data Format, Version 5) is a well-known self-describing file format and an I/O library [202] that provides flexibility, extendibility, and portability.It is used widely in many science domains to manage various data models [28].HDF5 uses the concept of dataspace objects to control data transfer when data is read or written.A dataspace defines the layout of the data (the organization in rows, columns, etc.) in a file and memory.Data is rearranged by the library when the different layouts are used to represent a given dataset in memory or a file.However, both source and destination are stored as contiguous blocks of storage with the elements ordered as defined by the dataspace.HDF5 allows an application to read or write to a portion of a dataset (partial I/O) by using hyperslabs and points.Hyperslabs are portions of datasets whose selection can be a logically contiguous collection of points in a dataspace or a regular pattern of points or blocks in a dataspace.Figure 4 illustrates four types of partial I/O in HDF5.
An HDF5 hyperslab can be viewed as a rectangular pattern defined by four arrays: offsets of the starting location for the hyperslab, the stride or number of elements to separate each element or block to be selected, the number of elements or blocks to select along each dimension, and the size of the blocks selected from the dataspace.Figure 5 depicts a hyperslab selection (left) in a dataset.
NetCDF provides scientific programmers with a self-describing and machine-independent portable format for storing array-oriented data [119].NetCDF-4 is the current version of classical NetCDF file format.NetCDF-4 supports parallel file access to the classic NetCDF and HDF5 files.Parallel I/O to the NetCDF-4 formatted files is supported through the HDF5 library, and that to the classic NetCDF files is supported through PnetCDF.PnetCDF is a high-performance parallel I/O library for accessing NetCDF files providing higher-level data structures (e.g., multi-dimensional arrays of typed data).The NetCDF-4 read and write API functions allow defining hyperslab parameters such as start and count vectors.For instance, the function nc_put_vara_int()-to write an array of integer values to a variable-has arguments to specify the start index for each dimension of an array and corresponding count specifying the edge lengths along each dimension of the block of data values to be written.
ADIOS (Adaptable Input Output System) [134,139] provides an I/O abstraction framework for portable and scalable I/O to aid scientific applications when data transfer volumes exceed the capabilities of traditional file I/O.Different from HDF5, ADIOS is not a hierarchical model but rather sits on a layer of abstraction beneath those.However, it also relies on self-describing data in binarypacked (.bp) format for rapid metadata extraction, but it can use different backend file storage formats such as HDF5 and NetCDF.ADIOS can also extract relevant information from large datasets, transporting and transforming groups of self-describing data variables and attributes across different media.This library uses an external metadata file in XML format to describe variables, types, and the path to take from memory to disk.It also has built-in optimization modules for buffering and scheduling [76].The ADIOS-2 API allows specifying the start and count vectors for setting the offsets and dimensions for the MPI ranks, respectively.
ROOT [5] is an object-oriented C++ framework conceived in the HEP community, designed for storing and analyzing petabytes of data efficiently.ROOT has been used for storing over one exabyte of HEP events [140].In ROOT, objects in memory go under serialization and compression before reaching the binary representation in files.These are self-descriptive files comprised of a header and data following a hierarchical directory format.ROOT can also use columnar representation for data in files, allowing I/O optimizations such as partial reading (i.e., reading only a subset of relevant columns), prefetching, and read-ahead to improve performance.For instance, HEP applications benefit from such file layouts when analyzing many statistically independent collision events.
FITS (Flexible Image Transport System) [80,217] is a standardized data format in astronomy.Initially conceived as a standard interchange format for digital images, FITS files are used as a working data format to store ASCII or binary tabular data, in addition to images and spectra.Files consist of a sequence of one or more Header and Data Units (HDUs).A header is composed of ASCII card images (usually read into a string array variable) that describe the content of the associated data unit, which might be a spectrum (vector), an image (array), or tabular data in ASCII or binary format (often read as a structure).Tabular data cannot appear in the first HDU, whereas image and vector data can be present in any HDU.The HDUs following the first (or primary) HDU are also known as extensions.

Summary #3
High-level I/O libraries present a layer of abstractions so that applications can easily map their data models to files.Several of these libraries are designed to provide portability of file formats as well as a self-describing feature that allows adding metadata to data.I/O access patterns of multi-dimensional data structures at this layer are designed to hide the complexity of converting data models to their file layouts.

ACCESS PATTERNS AT THE I/O MIDDLEWARE LAYER
Before reaching the persistent media, an application's request can go through a series of data transformations enabled by I/O optimizations.For instance, using MPI-IO, if a group of MPI ranks knows which parts of a file each rank is accessing, it becomes possible to merge these requests into a smaller number of larger and more contiguous accesses that span over a large portion of the file.When applied at the client level, this optimization is described as two-phase I/O [53,199] with collective buffering and data sieving [200].Such optimizations effectively change how the application issues its I/O request-that is, it changes its access pattern.
Collective buffering aims to reduce I/O time by making file accesses as large and as contiguous as possible, even if it requires additional communication between the ranks.In two-phase I/O, aggregator processes are responsible for carrying out the writes and reads.Each one manages a chunk of contiguous data from a subset of processes in a file.During the write process, an aggregator gathers data from a subset of processes into contiguous chunks in memory and writes the aggregated data to the file system.During reads, aggregators load part of the file and distribute smaller chunks of data to a subset of processes, as shown in Figure 6.For instance, ROMIO, which is a portable, high-performance implementation of MPI-IO, exposes two user-defined tuning options that can control the application of this technique: the number of processes that actually issue the I/O requests in the I/O phase (cb_nodes), often referred to as aggregators, and the maximum buffer size on each process (cb_buffer_size).These options help define the access pattern perceived by underlying layers of the I/O stack.
Data sieving is another optimization in MPI-IO aiming to reduce I/O latency by making as few requests to the PFS as possible.For read operations, when a process issues non-contiguous requests, instead of reading each piece of data separately, ROMIO reads a single contiguous chunk that ranges from the first to the last requested byte in the file into a temporary buffer in memory.ROMIO provides two user-defined parameters to control the buffer size for reads (ind_rd_buffer_size) and writes (ind_wr_buffer_size) [199].If a user requests a large portion of the file that would not fit in the allocated memory, ROMIO implementation performs the data sieving in parts delimited by the buffer size.The caveat of data sieving is when there are large gaps in access, which can outweigh the costs of reading the extra data.

Summary #4
The middleware layer provides opportunities to apply optimization techniques to transform the data to be more suitable for the underneath file system.Collective buffering and data sieving are two solutions available in MPI-IO to improve data access by reshaping the I/O access pattern.

ACCESS PATTERNS AT THE FILE SYSTEM LAYER
Large-scale HPC systems use PFS to provide a persistent shared storage infrastructure, as discussed in Section 2. A PFS is deployed over a set of dedicated nodes and offers a shared namespace, so applications can seamlessly access remote files.They harness parallelism by breaking the files into chunks or stripes and distributing them across multiple storage nodes to achieve high performance.This operation is often referred to data striping [188].Figure 7(a) illustrates how a file is striped among multiple storage servers, called Object Storage Targets (OSTs) in Lustre.
As stripes can be located in different storage targets, to complete a write/read operation, the PFS might need to access multiple targets.Unaligned requests can also require access to multiple OSTs to complete an operation and introduce inefficiencies [105,127,238] due to false data sharing.Figure 7(b) depicts such scenario.For instance, consider an application issuing 64-KB requests to a file stored with a stripe size of 64KB.If the first 136KB of the file is used for some header representation, all data accesses are shifted by the header.Thus, instead of issuing a single call to a single OST to write/read the data, the PFS client will need to break the request-for instance, in Figure 7(b), to complete the second request (in pink), two targets (OST 2 and 3) should be contacted to access non-contiguous regions of the file stripe to complete the request.It is easy to extrapolate the impact of misaligned requests on larger scales.
Furthermore, because of this centralized shared infrastructure, for clients to access the OSTs, they need to go over the network, which could introduce overhead and contention, especially if the request size is small and bursty.The previous example could cause a lot of smaller requests (64KB) to be issued because of the misalignment.Moreover, in the file system servers, contiguous data access usually yields higher I/O performance than that of non-contiguous ones [231] for both HDDs and SSDs.Zimmer et al. [242], among others, confirm that small and random request patterns negatively impact file system performance.Therefore, applications observe benefits when accessing a file by issuing fewer requests, reducing the high I/O latency.
Another pivotal aspect that has a direct impact on performance when discussing access patterns at the file system layer is the metadata accesses.In Unix-based operating systems, metadata is stored in an index node (i-node) comprising information about ownership, permission, the object's type (e.g., file or directory), size, and modified timestamp [170,192].Furthermore, since PFS tend to rely on POSIX I/O semantics (which were not conceived with parallel accesses in mind), the scalability of metadata accesses is often impaired.For instance, serialization is expected to happen in scenarios where a large number of files are created by multiple processes in a single directory.This is a common pattern observed in HPC applications [1,13,51,205,226].
Moreover, because these PFS tend to adopt the concept of data striping (to allow parallel access and improve performance), before accessing data, the PFS client must fetch permissions and obtain the file layout (including striping locations and sizes) from one of the metadata servers.In the Lustre PFS, a metadata service provides the index, or namespace, for a Lustre file system.The metadata content is stored in volumes called Metadata Targets (MDTs).Since most basic operations involve metadata, it is paramount to ensure scalability of metadata accesses.For instance, prior to Lustre 2.4, only a single MDT could be used to store metadata.The Lustre 2.4 release introduced the concept of DNE (Distributed NamespacE), where the metadata workload could be distributed across multiple MDTs, which usually spread across multiple metadata servers.Nonetheless, metadata servers are often fewer than data servers if not centralized into a single server to avoid complex cache coherence issues and overheads.Needless to say, an application creating or accessing a large number of files might be limited by metadata, possibly impacting other applications in the system due to the shared nature of the metadata servers.
Different approaches [129,145,164,166,176,221] have been proposed to tackle metadata issues covering how to handle, scale, and index metadata efficiently.For instance, Liao et al. [129] present a metadata management system that uses a database to record the information of datasets and manages metadata while providing a suitable I/O interface.Paul et al. [166] propose a metadata indexing and search tool specifically designed for large-scale HPC storage systems.Their solution relies on using an in-tree design with a parallel leveled partitioning approach to partition the file system namespace into disjoint subtrees.They maintain an internal metadata index database that uses a two-level database sharding technique to increase indexing and querying performance, combined with a changelog-based approach to keep track of the metadata changes and reindex the file system.Wu et al. [221] proposed StageFS, a PFS optimized for SSD-based clusters.StageFS stores both the metadata and small files in LSM trees for fast indexing.Seeking to avoid frequent small writes, StageFS uses buffering to better utilize the bandwidth of SSD devices.They demonstrate up to 21.28× performance improvements in metadata operations compared to state-of-the-art solutions.
Due to the shared nature of these storage deployments, multiple concurrent applications submitting a large number of metadata operations simultaneously can easily saturate the shared PFS metadata resources.On that front, MetaFS, proposed by Shaffer and Thain [176], seeks to address the bursts of metadata activity during program loading.It indexes the static metadata content of applications and delivers it in bulk to execution nodes, where it can be cached and queried, essentially trading metadata activity for data transfer.Their approach observed order of magnitude decreases in metadata load on the shared file system.On the same front, Macedo et al. [145] present a storage middleware that enables system administrators to proactively control and ensure QoS over metadata workflows in HPC storage systems.Their solution seeks to avoid saturating the shared metadata resources, which could lead to unresponsiveness of the storage backend and overall performance degradation.

Summary #5
Due to its shared nature, a PFS receives interleaved requests from multiple concurrently running applications.Thus, the I/O access pattern seen by these storage targets can present few resemblances to what the application initially issued.Furthermore, metadata requests play a pivotal role in scalability and performance due to the centralized characteristics of such systems.

ACCESS PATTERN TAXONOMY
HPC applications issue their I/O requests to a file system in diverse ways, depending on how their data was modeled and coded.They also tend to present a consistent I/O behavior, with a few access patterns being repeated multiple times over an extensive period [35,61,62,70,89,137].A better understanding of such patterns and what optimizations are suited for each one can lead to performance improvement on the application side and when considering the system as a whole.Based on that, some features can be used together to describe the application's access pattern.Although there is no globally accepted convention to describe which elements or features define an access pattern, researchers in the HPC I/O area often examine a common subset of factors or parameters-for instance, the file approach (single file or shared file), the number of requests, their sizes, and the spatial locality in the file [16,34,137,138,227].However, other features such as temporal behavior, intensity or burstiness, and overlapping accesses are considered for specific applications or optimization techniques [16,61,187,214,228,229,234].The access pattern does have a direct impact on achieved performance, which justifies the different research efforts put into optimizing data access [30,81,114,138,231].
We seek to provide a taxonomy for I/O access patterns based on collective understanding from the community and its usage over time.We believe that this formalization is helpful to the scientific community, as applications often observe poor I/O performance due to bottlenecks in the system that could be a result of the lack of translation between metric collection, bottleneck detection, and optimization solutions.A defined and globally accepted taxonomy will aid in translating metrics into patterns and guide end users on how to harness the various existing optimization techniques to improve I/O application performance.Figure 8 uses a node-link hierarchical tree diagram of classes positioned in polar coordinates to describe the taxonomy from different layers of the HPC I/O stack.
Furthermore, besides the features used in each layer, an I/O access pattern can be observed from different scopes.Yin et al. [231] classify the access pattern as local, global, or system-wide.The local pattern describes an application's behavior in the context of a process or task, whereas the global pattern describes it at the application level, considering all processes and tasks.However, the system-wide pattern describes the patterns of the diverse concurrent applications when using the shared storage infrastructure or I/O nodes.The local access pattern information is usually employed to identify and apply optimizations on the client side.In contrast, the global access pattern is more suitable for I/O middleware, the forwarding layer, or file system servers since it has an overview of the application's data accesses.The system-wide pattern can also be used in the data servers [23,118,169,187,236] and forwarding layer [2,16,20,159,232] to coordinate accesses and optimize I/O performance of the whole system.

Access Pattern Features
In this section, we discuss the features often used to describe an I/O access pattern and how they are used in the I/O stack.We classify the patterns based on I/O operations, synchronicity, file approach, spatial locality, interfaces, consistency, and temporal behavior.Fig. 8. Taxonomy of features used to describe an I/O access pattern at different layers of the HPC I/O software stack: application side, high-level I/O libraries, middleware layer, and file system.Some features are repeated as they are meaningful across layers of the stack, whereas others are intrinsic to a particular layer.Section 7.2 groups these features based on community usage over the years.

Operation.
We can broadly classify the I/O operations as writes and reads.For append operations, the file offset is first positioned at the end of the file using a seek operation, then a write operation appends the data.The modification of the file offset and the write operation is performed as a single atomic step.

File Approach.
There are various scenarios for executing parallel I/O depending on how many processes (MPI ranks) are performing I/O and on how many files are accessed by the processes.In the first scenario, each process of an application issues its operations to an individual file, which is called the file-per-process approach, as shown in Figure 9(a).This scenario is represented by having multiple files and multiple writers/readers.When the number of processes is too large, instead of accessing a file per process, data can be aggregated to a small subset of processes, and they can access a smaller number of files.This is called the subfiling approach [28,29].Although that might achieve performance by harnessing the parallelism inherited from having multiple data servers, future use of those files for post-processing or analysis will have to access those multiple files to get the required data, as it is scattered.The scalability of this approach is limited when handling metadata operations for extreme-scale applications.
In the second scenario, all processes share a common file (shared file).We can further distinguish such a scenario based on the number of writers.At the opposite extreme of the file-per-process, a single writer (commonly rank 0) receives data from many or all ranks (typically using collective MPI calls), rearranges it, and writes it to a single shared file, as depicted in Figure 9(b).The performance of this approach is limited by the memory available in the aggregator node (to receive and handle the data from the entire application); in addition, it cannot utilize the total available bandwidth to the storage servers effectively.Instead of a single writer, we can have a subset of ranks that aggregate and issue the I/O operations, as in Figure 9(d).This strategy implies communication between each rank and its aggregator, and the latter also has an additional data rearrangement step before dispatching the requests to the storage system.Finally, another known approach is having all ranks write their data to the file in a pre-defined non-overlapping location, avoiding inter-rank communication but relying on implicit coordination, as illustrated in Figure 9(c).This performance of this approach is also limited, as there is no coordination or aggregation of I/O requests between ranks on the same compute node.

Spatial Locality.
The spatial locality or spatiality refers to the file offsets between consecutive I/O accesses.Typical spatial access patterns are contiguous, strided, or random.This feature directly impacts I/O performance because the storage infrastructure (at hardware and software levels) is affected by the sequentiality of the requests [21].For instance, file systems can cache or prefetch data when they predict a regular pattern to avoid the costly seek operations between consecutive I/O requests, thereby improving the I/O performance of HPC applications.
We can define spatial locality of an I/O request by its file offset o f f i and a size size i where i identifies the i th request.If the access to a file is sequential, each process accesses contiguous chunks of the file (Figure 10(a)) and the relation o f f p,i+1 = o f f p,i + size p,i holds for all subsequent requests.However, in a strided (1D, 2D, nD) pattern, each process accesses portions of the data with a fixed-size gap (or stride) between them (Figure 10(b)).The file pointer is incremented by the same amount between each request (i.e., the stride), hence o f f p,i+1 = o f f p,i + stride i , where stride i = p size p,i is often a constant.Furthermore, strided accesses are common when  accessing shared files.For the file-per-process approach, it is fairly common for a file to be accessed contiguously.Despite random access (Figure 10(c)) being less common for traditional HPC [177], novel workloads from machine learning applications present such behavior [43,55,240,241] often due to shuffling data between iterations and epochs, which usually results in a large number of concurrent data writes to the file system.

Interfaces.
Interfaces seek to provide a convenient and easy-to-use way to access resources.In the context of I/O, these have an important role when accessing files (locally or remotely) by defining APIs and semantics.Furthermore, in HPC, these interfaces should strive to balance usability and high performance, often divergent goals.We can consider three main interfaces that are used directly by applications or high-level libraries to express their access patterns: POSIX I/O, MPI-IO, and STDIO.High-level I/O libraries provide various APIs that simplify mapping data models at the application level with MPI-IO and POSIX interfaces.For instance, HDF5 uses the MPI-IO interface for parallel I/O and POSIX IO for sequential applications.ADIOS and PnetCDF use MPI-IO and POSIX similarly.Hereafter, we briefly discuss each interface and summarize their opportunities and challenges in the context of HPC in Table 2. POSIX I/O.POSIX (Portable Operating System Interface) is a set of standards defined by IEEE to maintain compatibility among diverse operating systems, allowing an application to obtain basic services from an operating system.POSIX also defines an I/O API used to interact with the file system.Its I/O interface was first introduced in 1988 in the POSIX.1 specification, and it was designed for local file systems accesses.POSIX.1b[92] introduced asynchronous and synchronous behaviors.Despite the fact it was designed for local file systems that used to support sequential applications, POSIX is widely employed by a wide range of applications due to its portability.However, the portability of POSIX comes with a price when used in HPC.The POSIX semantics define what is and is not guaranteed when its API is used.For instance, it specifies that write operations must be strongly consistent-that is, a write() call is required to block the application execution until the system can guarantee that any following read() calls will actually read the data that was just written.In the case of HPC, these strict requirements introduce complexity for distributed and parallel file systems where remote processes are unaware of what local processes might be modifying in a file and vice versa.HPC centers often provide POSIX-based parallel file systems (e.g., Lustre and GPFS), which adhere to strong consistency semantics forcing sequential accesses [209].The required semantics force many parallel file systems to implement distributed locking mechanisms to ensure consistency, thereby penalizing I/O accesses at a large scale.However, modern HPC applications often do not require such strong consistency guarantees [132,209].
Since POSIX was not designed specifically for HPC applications, it may also impose a burden on the end users.For instance, it is possible to use the shared-file parallel I/O approach.But, the complexity of coordinating parallel accesses, buffering, and flushing is explicitly delegated to the end user.Furthermore, as files are viewed as opaque byte streams, applications are unable to express or hint to the file system about how its data is organized.Such information is essential for data placement strategies and for optimizations.For example, the MPI-IO interface uses such information to express complex accesses and attain high performance.Nonetheless, there were some efforts that sought to extend POSIX I/O to account for HPC needs.Vilayannur et al. [206] designed a proposed POSIX extension to support shared file descriptors/group open, lazy metadata attributes, non-contiguous read/write interfaces, and bulk metadata operations.Such efforts have not been integrated into major storage solutions yet.[68] was proposed as an extension to the MPI standard, defining I/O operations by reusing the message passing concepts of MPI.Writing to a file is like sending a message, and reading from a file is like receiving a message.MPI-IO provides a high-level interface to describe the data partitioning among processes, and a collective interface to describe transfers of global data structures between process memories and files.In addition, it supports asynchronous I/O operations.As a result, MPI-IO allows computation to be overlapped with I/O and enables optimization of physical file layout on storage devices [47].Furthermore, MPI-IO's semantics differ from POSIX's semantics, relaxing some consistency requirements, while offering an atomic mode for applications that rely on stricter semantics.

MPI-IO. MPI-IO
To express flexible I/O access patterns that are natural to the application, MPI-IO relies on MPIderived data types.These are used to represent how data is laid out in the memory and also in the file.Furthermore, there are three orthogonal features to data access in MPI-IO: positioning (explicit offset or implicit file pointer), synchronization (blocking or non-blocking), and coordination (independent or collective).All are expressed using file pointers (individual or shared).
The MPI-IO interface is implemented on top of a portable abstract-device interface for parallel I/O called ADIO [198], which can be optimized for various file systems.ADIO itself is not intended to be used directly by application programmers but rather as an internal to the implementation of some other user-level I/O interfaces.For instance, ROMIO [201] is a high-performance, portable implementation of MPI-IO optimized for non-contiguous access patterns, which are common in parallel applications.It relies on the portability of ADIO to be used with any MPI implementation (ROMIO is often included as a part of several MPI implementations, e.g., MPICH, Cray MPI, and OpenMPI).
As MPI-IO is layered atop POSIX, it generates complex I/O access patterns.The pattern that reaches the file system may greatly differ from what was initially expressed in the scientific application code due to optimizations and transformations (e.g., collective buffering and data sieving [53,199]) as requests traverse the I/O stack.[97] (e.g., fopen, fprintf, and fscanf ).However, STDIO functions do not directly support random access to data.In such cases, the application must open a stream, seek to the desired location in the file, and then write/read bytes in sequence from the stream.

STDIO. The standard I/O library (STDIO) provides a simple and buffered stream I/O interface. It abstracts all file operations into operations on streams of bytes. STDIO comprises the C stdio.h family of functions
Recently, STDIO has been increasingly used for HPC workloads [144,182], especially for genomics and biology production applications that rely on I/O functions to store sequencing information in text format.Analysis of traces from supercomputer facilities confirmed the noticeably increasing use of STDIO across supercomputer platforms and for a wide range of science domains [17].The study also revealed that although STDIO can obtain high bandwidths for some transfer sizes, it consistently delivers lower performance than POSIX does across various transfer sizes in Cori (NERSC) and Summit (OLCF) supercomputers, indicating overall poor I/O performance.

I/O Mode.
The I/O mode refers to how parallel processes (MPI ranks) access a file: each rank individually or collectively (by a subset of all ranks).Collective operations are readily available in interfaces such as MPI-IO, and these operations provide a big picture of the overall data movement across ranks.These functions require all processes that collectively open the same file to participate in the calls, thus allowing optimizations such as collective buffering and data sieving [199] to improve performance by building larger and contiguous accesses to the underlying storage system.
The I/O mode can directly transform the access pattern perceived by the underlying layer when using collective operations.Instead of each rank issuing its individual operations, the aggregate file access region targeted by a collective I/O call is divided among the aggregators into nonoverlapping regions (file regions).In the communication phase, all ranks send their I/O requests to the aggregators based on their file domain.In the I/O phase, aggregators issue the requests to the system.Hence, aggregators effectively merge requests into larger and contiguous ones before percolating to the POSIX layer or the file system.7.1.6Synchronicity.Synchronous or blocking I/O routines are not considered successful before an I/O operation is completed.However, asynchronous or non-blocking I/O operations allow applications to hide the cost associated with I/O operations by overlapping it with computation or communication steps, allowing the application to progress.The latter is becoming popular among scientific applications to access large amounts of data and improve user-perceived performance.POSIX and MPI-IO provide asynchronous APIs to write and read data from files.POSIX standard provides the aio_* calls, whereas MPI-IO has MPI_File_i* calls for independent and collective I/O operations.Some high-level I/O libraries, such as ADIOS and HDF5, also expose those nonblocking interfaces [195].In contrast, data management systems such as PDC (Proactive Data Containers) [194] offer asynchronous data movement to and from their server nodes through network data transfer.Novel object storage file systems such as DAOS (Distributed Asynchronous Object Storage) [124] were built around the asynchronous concept to deliver performance.The synchronicity feature will help shape the temporal behavior of the application's access pattern.

Temporal
Behavior.Toward the automatic detection of poorly performing HPC jobs, Buneci and Reed [27] generated temporal signatures containing performance features from timeseries metrics to group applications into two groups: those that performed as expected and those that did not.They combine high-level states provided by users, based on previous executions, with low-level metrics to detect factors affecting performance.Although their approach uses two I/O metrics to build the signature, they do not focus on that; instead, they focus on the combination of CPU, memory, disk, and network usage.Dorier et al. [61,62] proposed Omnisc'IO, which builds a grammar-based model of any HPC application I/O behavior to predict future use.They seek to predict when I/O operations will occur-that is, the inter-arrival time between requests and how much data will be accessed, including offset and size within the file.To make time-related predictions, Omnisc'IO stores statistics such as minimum and maximum observed time between requests, the average, and variance and relies on weighted inter-arrival average time to react to changes.From those, they can anticipate whether an operation will immediately follow the current one in a predictable amount of time and whether the time before the subsequent operations is more unpredictable.
White et al. [219] placed a particular focus on I/O by proposing a taxonomy for temporal I/O patterns of HPC jobs to aid in automatically detecting poorly performing jobs.They describe the design of a simple heuristic classification algorithm that categorized jobs based on a very coarse measure of when most of the I/O occurred.The authors observed a small number of common I/O access patterns: primary I/O usage near the start of the job, main I/O usage near the end of the job, I/O activity at the beginning and end but not during the job, low I/O at the start or end but high in the middle, regular activity throughout the job, and regular periodic I/O activity.

Consistency.
When checking for overlapping I/O patterns, Wang et al. [211] consider the consistency of I/O operations.They seek to understand whether or not a process ever writes/reads to the same part of a file more than once, whether multiple processes write/read the same part of a file, and the order in which operation occurs in a given offset.For that, they consider read after read (RAR), write after write (WAW), read after write (RAW), and write after read (WAR) metrics to compose the pattern.The consistency policy used by an application can also aid in determining whether caching techniques are feasible or not.

Community-Based Usage Survey
Seeking to understand how I/O access patterns are approached and used by the broad HPC community to describe their applications, we filtered the ACM Digital Library1 and IEEE Xplore2 considering a 20-year window, covering publications between 2000 and 2021 that mention the following: "I/O access pattern, " "I/O characterization, " "I/O characteristic, " or "I/O signature." The text should also refer to "HPC" for any of the terms.After filtering for conferences and journal publications, that search yielded 74 results in ACM and 161 in IEEE.In Figures 11 and 12, we show our queries used in ACM Digital Library and IEEE Xplore, respectively.We classify these papers based on the features used by the authors to describe the I/O access pattern at multiple levels of the I/O stack, as illustrated in Figure 13.Our methodology consisted of looking for common pre-defined keywords in the entire text that is used to describe each I/O access pattern feature.Each manuscript was pre-filtered and manually inspected to avoid false positives.We defined a set of tags corresponding to each feature and its usage.Due to the selection approach and the broad use of the term in correlated areas (e.g., memory), some of the selected papers were not, in fact, relevant to this survey.Therefore, they were later excluded from the analysis.In the end, we considered 146 papers for the analysis presented in this section (62.13% of the 235 results).Specifically, we used the following criteria to filter the initial set of papers: • The keyword should have been used in the experiments or considered IN the proposed technique or solution.• Merely citing or using a keyword does not make the paper fall into that classification (e.g., mentioning HDF5 as an interface for IOR does not make it fall into the HDF5 category unless it was used in the evaluation).• If the feature is used solely to describe related work, that does not make the paper be marked in that category • Some of those keywords' definitions are overloaded to describe features outside the I/O realm-for instance, memory, communication, or even computing (e.g., synchronous/asynchronous). In such cases, they were not considered as relevant in the context of the I/O access pattern.• To avoid bias in classification, if the authors did not clearly state a feature, we assume that they did not consider that (unless it is evident from context); in doubt, we assume that it is not used.
In Figure 13, we summarize our findings.When discussing access patterns, 122 (i.e., 83.56%) of the papers cover data operations, whereas only 48 (32.89%) consider metadata operations.However, these do not go into the depth of describing their metadata I/O patterns in detail.Regarding features, operation (96.57%), request size (67.81%), and spatial locality (65.07%) are the ones taken into account by the majority of the research papers.Despite the file approach being strongly related to the spatial locality, the first is not explicitly addressed in 58.90% of the surveyed papers.Collectiveness and synchronicity are the less targeted features when discussing or describing access patterns, and both are related to I/O optimization techniques.In Figure 14, we depict the intersecting sets of features used in the surveyed research papers, considering the seven features detailed in Figure 13.To summarize all intersections and their distributions, we rely on the UpSet plot, a state-of-the-art visualization technique for the quantitative analysis of intersecting sets and their properties [45,120].From this, it becomes clear how not all relevant features are properly described and considered in most of the works and which ones are commonly considered together (e.g., operation, request size, and spatial locality).Table 3 maps these different features and references those works so that readers can find detailed information on how the features were employed in practice under diverse scenarios and science domains.
We also grouped the research papers according to the interface and high-level libraries they use.For interfaces, the majority (57.53%) used MPI-IO in their experiments or explicitly considered that interface when discussing the applicability of the proposed solution or optimization techniques.POSIX follows up with 47.26% and STDIO is targeted by merely 3.42% of the surveyed papers.

Table 4. Summary of Access Pattern Features Exercised by Each Benchmark and I/O Kernel
The check in orange indicates that h5bench does support asynchronous operations; however, it requires the HDF5 ASYNC VOL Connector [195] to be available and enabled.
However, Bez et al. [18] highlight a widespread use of STDIO across a wide range of science domains in HPC applications on both Summit (OLCF) and Cori (NERSC) supercomputers, suggesting a possible new trend due to the shift from traditional numerical simulations to AI/machine applications for training and inference while processing and producing ever-increasing amounts of scientific data.Regarding high-level libraries, the majority do not explicitly acknowledge using a particular library, although HDF5 is used by 27.40%.

Summary #6
The community has been using common features (e.g., operation, size, and spatiality) to describe an I/O access pattern, with additional information depending on the targeted layer or optimization context.Furthermore, metadata access is often not as detailed as data access.

EXERCISING I/O ACCESS PATTERNS
This section briefly covers existing benchmarks and I/O kernels that are often used in scientific I/O research to exercise access patterns.We describe the features benchmarks used to represent I/O accesses and the I/O workload characteristics of different scientific application kernels.
Table 4 summarizes the benchmarks and I/O kernels used by the community to exercise the HPC I/O stack under diverse data workloads.We group the tools by their representation (exclusively synthetic workloads or extracted as a representative I/O kernel of an application), focus (data or metadata), operation (write or read), support to set the request size, mode (independent or collective operations), temporal behavior, file approach (shared file or file-per-process), and synchronicity (synchronous or asynchronous requests).We also describe the supported I/O interfaces.
IOR [88] is an I/O benchmark to test the performance of parallel storage systems using various interfaces and access patterns.It supports different interfaces or APIs (POSIX, MPI-IO, HDF5, HDFS, S3, NCMPI, IME, MMAP, or RADOS).IOR is flexible enough to express patterns by configuring the operation, the contiguous bytes to write per task (block size), transfer size, number of segments, whether it uses collective or individual operations (where applicable), and whether each task writes to its own file or a shared file.
MADbench2 [25] is an I/O kernel extracted from the MADspec application.MADbench2 allows testing the integrated performance of the I/O, communication, and calculation subsystems of massively parallel architectures under the stresses of a real scientific application.It is derived directly from a large-scale cosmic microwave background data analysis package.It calculates the maximum likelihood angular power spectrum of the cosmic microwave background radiation from a noisy pixelized map of the sky and its pixel-pixel noise correlation matrix.MADbench2 has a regular mode, in which the full code is executed, and an I/O mode where all calculation/communication is replaced with busy work.The kernel has three component functions, each with different access patterns, named S, W, and C. In S, N bin writes each of N pix 2 bytes on N p processors.In W, N bin reads each of N pix 2 bytes on N p processors and N bin writes each of N pix 2 bytes on N p /N дanд processors.In C, N bin 2 /N дanд reads each of N pix 2 bytes N p /N дanд processors, where N p defines the number of processes and N pix sets the size of the pseudo-data, in which all component matrices have N pix × N pix elements.N bin sets the size of the pseudo-dataset composed on N bin component matrices.Finally, N дanд sets the level of gang parallelism, allowing MADbench2 to run as a single or multi-gang.In the former, all matrix operations are carried out by being distributed over all processors.The kernel can use the POSIX or MPI-IO interfaces to synchronously or asynchronously issue its I/O operations to a unique or shared file.
IFER is a microbenchmark similar to IOR but instead seeks to provide insights on I/O contention [227].It splits the ranks into two groups running on two separate sets of nodes to emulate two competing applications.Each group of processes executes a series of collective I/O operations following a pre-defined pattern.Although IFER only provides support for write requests to a shared file by application, it considers two patterns: contiguous and 1D strided.IFER also relies on two additional parameters: the block size, which represents the contiguous bytes to write per process, and the block count.The number of blocks will be continuously written per process in the contiguous pattern.For the strided pattern, the blocks are distributed along the file depending on their offsets.Because its original goal was to study I/O interference, IFER allows users to define the inter-arrival time between the I/O phases.
The S3D I/O kernel [39] is a continuum-scale first principles direct numerical simulation code that solves the compressible governing equations of mass continuity, momenta, energy, and mass fractions of chemical species, including chemical reactions.It creates N checkpoints at regular intervals, where it writes 3D and 4D arrays of doubles into a newly created file.All 3D arrays are partitioned among the MPI processes using block partitioning in all x-y-z dimensions, whereas the fourth dimension (the most significant one) is not partitioned.The kernel can be configured to use PnetCDF blocking or non-blocking APIs.For the latter, a checkpoint has four non-blocking write calls, one per variable, followed by a call to wait and flush the write requests [128].
NAS BT-IO [155] is a benchmark based on the block triagonal (BT) problem of the NAS Parallel Benchmarks (NPB).Each rank is responsible for multiple Cartesian subsets of the dataset, whose number increases as the square root of the number of ranks participating in the computation.The entire solution, consisting of five double-precision words per mesh point, must be written to a file at every five timesteps.In the end, all data belonging to a single timestep must be stored in the same file and must be sorted by vector component, x, y, and z-coordinates.
S3aSim [42] is an I/O kernel based on a sequence similarity search framework.It uses a masterslave parallel programming model with database segmentation, mimicking the mpiBLAST [52] access pattern.Given input query sequences, S3aSim divides up the database sequences into fragments.Workers request a query and fragment information from the master and search the query against the database fragment assigned.The results are sent to the master to be sorted and then written to a single shared file.Without synchronizing after every query, this application uses individual I/O operations to write data to a single shared file.
Parallel I/O Kernels (https://github.com/hpc-io/PIOK)provides the parallel I/O portion of various scientific simulation codes that use HDF5.These kernels have been expanded with h5bench to cover a variety of HDF5 I/O patterns.h5bench [122]  HACC-IO [207] is a kernel extracted from the HACC (Hardware Accelerated Cosmology Code) cosmology framework (Gordon Bell Award Finalist 2012, 2013).It uses the N-body to simulate collisionless fluids under the influence of gravity.The kernel includes the checkpoint, restart, and analysis outputs produced by the simulation.Hence, it is quite I/O intensive.It also supports both POSIX and MPI-IO (with independent and collective operations) interfaces.Regarding the file approach, HACC-IO can be configured to write to a single shared file, a file-per-process, or a mix of both (i.e., file per group).It only takes as an input argument the number of particles (n), where each particle is composed of seven 4-byte floats, an 8-byte integer, and a 2-byte integer.Thus, each process writes/reads n × 38 bytes.
MACSio (Multi-purpose, Application-Centric, Scalable I/O Proxy Application) [151] was built for I/O performance testing and evaluation of tradeoffs in data models, I/O library interfaces, and parallel I/O paradigms for multi-physics HPC applications.It differs from other benchmarks in the sense that it actually constructs and marshals data as real data objects commonly used in scientific computing applications.Hence, MACSio allows closely mimicking I/O workloads from the multiphysics domain, where data object distribution and composition vary within and across parallel processes.It also supports representing the data using multiple file approaches (segmented and strided single shared file, multiple independent files, or file-per-process), using independent and collective operations.
MPI Tile I/O [171] is a benchmark suited to test the performance of an underlying MPI-IO and file system implementation under a non-contiguous access workload.It logically divides a data file into a dense 2D set of tiles based on the number of tiles in the x and y dimensions.It allows the end user to configure the number of elements in each tile dimension and the size of an element.It can express overlap elements by defining how many of them are shared between adjacent tiles in each dimension.MPI Tile I/O has support for collective I/O allowing fine-tuning of the list of nodes involved in the aggregation.
Toward emulating scientific deep learning workloads that are becoming popular on HPC systems, DLIO [55] provides an I/O benchmark suite.DLIO supports various scientific deep learning applications, including Neutrino and Cosmic Tagging with UNet, FFN (Distributed Flood Filling Networks), CNN (Convolutional Neural Networks), CosmoFlow for cosmology datasets, FRNN (Fusion Recurrent Neural NetN), and CANDLE (Cancer Distributed Learning Environment).DLIO allows reading data from different file formats and APIs, such as HDF5, CSV, and tfrecord formats.

Summary #7
A plethora of benchmarks and I/O kernels are available to the community to exercise access patterns at different layers of the stack.There is not a single one that encompasses all features; however, when combined, they cover distinct features, interfaces, and application data models.

PROFILING AND VISUALIZING I/O ACCESS PATTERNS
Darshan [34] is a popular tool to collect I/O profiling information from applications in a lightweight manner.Darshan aggregates I/O profile information to provide valuable insights without adding overhead or perturbing application behavior.It also provides an extended tracing module (DXT) [223] to obtain a fine-grained view of the application behavior to understand I/O performance issues.Once enabled, DXT collects detailed traces from the POSIX and MPI-IO layers reporting the operation (write/read), the rank that issued the call, the segment, the offset in the file, and the size of each request.It also captures the start and end timestamps of all operations issued by each rank.
Recorder [211] is a multi-level I/O tracing framework to capture I/O function calls at multiple levels of the I/O stack, including HDF5, MPI-IO, and POSIX I/O.As a shared library, it requires no modification or recompilation of the application and allows users to control tracing levels.Recorder captures timestamps, function names, and all parameters from intercepted I/O calls using function interposing to intercept I/O calls.
TAU (Tuning and Analysis Utilities) [178] is an integrated toolkit for performance instrumentation, measurement, and analysis.It can capture file I/O (serial and parallel), communication, memory, and CPU.Regarding I/O, TAU can handle profiling and tracing, observing inclusive (including all child regions) and exclusive (for a region only) measurements.It uses library wrapping to characterize I/O performance, which helps automate the instrumentation of external I/O packages and libraries.Thus, TAU can capture POSIX and MPI-IO and instrument libraries such as HDF5.
IOPin [107] proposes a dynamic instrumentation framework to understand the complex interactions across different I/O layers from applications to the underlying PFS.It leverages Pin lightweight binary instrumentation using probe mode to instrument applications and components of the I/O stack, providing a hierarchical view for parallel I/O.Their implementation supports the MPI library and PVFS.Their approach traces and instruments only the process that has been identified by Pin to have the maximum I/O latency.This dynamic instruction reduces the overhead and focuses on detecting only one critical I/O path that affects performance in the stack.The metrics provided by IOPin include latency, disk throughput, number of requests from client to server, and number of disk accesses for each request.However, it does not provide a characterization of each I/O request.
ScalaIOTrace [142] is a multi-level I/O tracing tool based on ScalaTrace [158], an MPI communication tracing framework for parallel applications.ScalaIOTrace supports both MPI-IO and POSIX I/O interposition.MPI-IO tracing relies on the MPI profiling layer (PMPI) to intercept and collect MPI calls.At the same time, POSIX is captured via wrappers using GNUlink time entry interpositioning with domain-specific parameter compression, similar to PMPI.This tracing tool captures I/O events as singletons, vectors, and regular section descriptors to describe the application behavior.Those are stored in a single, lossless, and order-preserving trace file.Their goal is to generate a trace that can be extrapolated into target sizes of nodes and replayed to assert I/O scalability.
Score-P [109] is a measurement tool suite for profiling and event tracing of HPC applications.The instrumentation allows users to insert measurement probes in their codes to collect performance-related data when triggered by linking against several provided runtime libraries for serial execution, OpenMP or MPI parallelism, or hybrid combinations.It also allows selective filtering in both profiling and tracing mode to restrict the recording to specific regions.For I/O operations, Score-P can collect data on POSIX I/O (e.g., open/close), POSIX asynchronous I/O (e.g., aio_read/aio_write), STDIO (e.g., fopen/fclose), and MPI-IO calls.Visualizing Score-P output files using Periscope [12], Scalasca [73], TAU [154], and Vampir [108] is possible.Periscope is an online profiling analysis that evaluates performance properties and tests hypotheses about typical performance problems.Scalasca allows post-mortem analysis of event traces and automatically detects performance-critical situations.It is also possible to use the TAU visualization toolset Fig. 15.Two concurrent IOR instances using DXT Explorer.We depict the access pattern from the high-level library (HDF5) and their corresponding transformations until they reach the POSIX layer and the underlying OSTs in Lustre.
to correlate performance data collected with Score-P or Vampir, which works as a post-mortem interactive event trace visualization software.
DXT Explorer [19] is an interactive web-based log analysis tool to visualize Darshan DXT traces and help in understanding the I/O behavior of applications.The tool adds an interactive component to Darshan trace analysis that can aid researchers, developers, and end users to visually inspect their applications' I/O behavior, zoom in on areas of interest, and have a clear picture of where the I/O problem is.

Gaps in Visualizing Access Pattern Transformations
As discussed in Section 3, the way the application issues its I/O requests will differ from what the intermediate layers and the file system actually perceive.To illustrate the transformations an application's I/O requests undergo as they traverse the stack, we use Darshan traces and DXT Explorer to visualize the I/O access pattern at different levels.
Figure 15 depicts such transformations, and Figure 16 zooms in on the first 2 seconds of the experiment reported in Figure 15.In the experiment in Figure 15, we have two instances of IOR (one in red and another in blue).We executed each one in two non-overlapping sets of 16 compute nodes, with 8 ranks per node, totaling 128 ranks.We configured IOR to write 10 iterations of one segment with a 32-MB block size using 4-MB transfer sizes to a shared file using the HDF5 API and collective MPI operations.Both instances were started simultaneously.We collected profile and tracing data using Darshan.As Darshan Extended Tracing does not yet capture fine-grained information about high-level libraries, such as HDF5, we rely upon manually instrumenting the code to collect timestamps before performing the write operations and after the dataset is completely written to a given file.We condensed both plot facets as all ranks collectively issued the I/O calls to the MPI-IO layer.We represented these collective calls by the star symbol on the y-axis.
As far as the application is concerned, its data in memory is a 1D dataset represented by HDF5.HDF5 will define a hyperslab based on the start offset, count, stride, and block to access the data.A hyperslab represents a portion of the datasets that can be a logically contiguous collection of points in a dataspace or a regular pattern of points or blocks in a dataspace.In our experiments, for a shared file, IOR defines the start offset as offset module segmentSize, count as 1, and a stride and a block equal to the transfer size (i.e., 4 MB).However, once the requests reach the MPI-IO layer, they are further broken down by the four collective aggregators into a larger number of 1-MB POSIX requests, considering the underlying PFS striping configuration before sending them to each storage device.We have defined Lustre to use 1-MB stripes over eight servers to make it easier to visualize.Once we delve into lower levels of the I/O stack, we are to lose contextual information from the applications and start to observe the effect of natural interference in this shared storage infrastructure.For instance, if we glance at one OST, the requests arrive at the storage servers in an interleaved fashion, coming from the two applications that the file system is unaware of.At this point, the original contiguous requests issued by the application using 4-MB requests arrive at the server much smaller (in 1-MB requests) and with a different spatiality (non-contiguous).
Furthermore, it is essential to highlight the inter-application interference caused by other applications sharing those data servers.Figure 15 clearly depicts how two identical applications that started simultaneously begin to diverge in time toward the end of our experiment.Such observation also highlights the importance of taking into account temporal features when discussing access patterns.

Summary #8
Different tools extract and visualize I/O access patterns from coarse-grained profilers to finegrained traces as I/O requests pass through the stack.However, we could not find a complete solution that allows observing patterns and all of their transformations in the context of each layer.Because of the complexity of the current stack, this gap might not easily reflect the root causes of bottlenecks.

CONCLUSION
The HPC I/O stack has been complex due to multiple layers of hardware and software, their various tuning options, and inter-dependencies among the layers.This survey extensively discussed the overloaded "I/O access pattern" terminology used to describe how accesses are done from the major layers of the HPC I/O stack, covering the high-level models used by scientific applications, and how those are represented by high-level I/O libraries and translated by middleware libraries before reaching the PFS.We have also highlighted I/O benchmarks and kernels employed to exercise access patterns in different levels, alongside existing tools to visualize those patterns using profiling and tracing.
Harnessing the I/O community's knowledge over the past 20 years, we surveyed 146 papers from ACM Digital Library and IEEE Xplore to propose a baseline taxonomy that could define an application's I/O access patterns.Our effort targets bringing a consensus to the varying ways to describe a pattern based on features already used by the community over the years, serving as a common ground among the end user, application developers, and system administrators when discussing, proposing, and applying I/O tuning strategies to improve I/O performance.
Furthermore, the existing I/O stack exposes a plethora of tunable parameters and enables different, often complementary, optimization techniques to improve performance.However, there is little to no guidance to developers and end users on how and when to apply them.Besides the lack of knowledge that those options are available and could help for a particular set of access patterns, to the best of our own knowledge, there has not been a single set of instructions to define a set of tuning parameters.Reaching a list of best practices, even for a single system, is challenging due to various factors affecting I/O performance.Finally, not having a common ground to identify and refer to access patterns could add to this complexity and makes it difficult to map I/O access patterns to their performance behaviors and then to optimization strategies.
As the HPC platforms become more complex and specialized to host novel applications from machine learning to scientific workflows, it becomes paramount for those systems that seek to auto-tune their parameters to accurately detect the I/O access patterns at runtime.An established taxonomy can help bridge the gap among metric collection, access patterns representation, and the application of AI-based and automatic tuning mechanisms to navigate the complex parameter space, seeking optimizations and configurations to apply for an observed application workload.
Consequently, despite having tools to collect profiles and metrics about I/O performance and features that can be used to describe the application's access patterns at different layers of the HPC I/O stack, there are still gaps between visualizing and understanding what the application is doing, identifying the bottlenecks, and correctly reshaping its pattern to perform better in the system.Reporting and automatically mapping performance problems into actionable items based on the observed pattern require novel tools, models, and further R&D.

Fig. 1 .
Fig. 1.The traditional HPC I/O software stack that includes several layers of libraries between applications and storage hardware.

Fig. 2 .
Fig. 2. Representation of common high-level data models used by HPC applications in different science domains.

Fig. 3 .
Fig. 3. Data layout in a file for AoS or SoA.

Fig. 4 .
Fig. 4. A visual representation of common hyperslab selections between memory and file representations for partial I/O in HDF5.

Fig. 5 .
Fig. 5. Example of a hyperslab selection for partial I/O operations using the HDF5 library.

Fig. 7 .
Fig. 7. Aligned and misaligned requests to a PFS with file striping.

Fig. 10 .
Fig. 10.Spatial locality of I/O requests in the file.

Fig. 11 .Fig. 12 .
Fig. 11.ACM Digital Library query and filter parameters used in this survey.

Fig. 13 .
Fig. 13.I/O access pattern features, interfaces, and libraries in ACM Digital Library and IEEE Xplore papers.

Fig. 14 .
Fig. 14.Analysis of the set of features used to describe an I/O access pattern in ACM Digital Library and IEEE Xplore research papers.
is a set of HDF5 I/O kernels representing I/O patterns that are commonly used in HDF5 applications on HPC systems.It provides a framework to test, exercise, and tune I/O performance using novel features introduced in HDF5 and understanding how the library performs in different machines under such I/O workloads.It measures I/O performance from various aspects, including the raw and observed I/O time and rate.

Fig. 16 .
Fig.16.I/O request from the two concurrent IOR instances (one in red another in blue) as they arrive in each of the eight Lustre storage servers.We zoom in on the first 2 seconds of the experiment reported in Figure15.

Table 1 .
In-Memory Data Structure and In-File Data Layout Mappings

Table 2 .
Opportunities and Challenges Presented by Each I/O Interface When Used in the Context of HPC