Resources from large-scale high-performance computing (HPC) platforms or large data centers are shared between concurrent applications. Users submit jobs to a batch scheduler, and this resource manager assigns computing nodes to applications according to their requested...
Resources from large-scale high-performance computing (HPC) platforms or large data centers are shared between concurrent applications. Users submit jobs to a batch scheduler, and this resource manager assigns computing nodes to applications according to their requested processing power. In these architectures, the access to persistent data happens through a shared I/O infrastructure including a parallel file system (PFS) deployment over a set of dedicated servers. Differently from processing power, and despite being shared, this data access resource is NOT arbitrated. Hence each application, together with used I/O libraries, will work to achieve its own peak I/O performance, without considering the interference on other concurrent applications. That will often result on contention in the access to the shared infrastructure, which decreases the global I/O performance, makes the applications execution times longer, and therefore wastes expensive computing resources.
While contention is an important issue to global I/O performance, the individual performance obtained by applications depend strongly on their access pattern and on the optimization techniques they use. An example is the use of collective operations from the MPI-IO library, which improve performance in many cases, but decrease performance in others. Moreover, its success requires the correct tuning of parameters such as the buffer size and the number of aggregators. Despite some heuristics implemented into the library, most of the responsibility of using collective operations or not (and of choosing the parameters) still belongs to the developers and users. That is a problem because parallel I/O performance depends on a large number of variables and explaining it is not usually a trivial task. We argue it is not reasonable to ask from users and developers to be proficient on parallel I/O in order to achieve high performance. One of the main arguments for this is the fact I/O access patterns known to have poor performance - such as generating small sparse requests - are still frequently observed in large production machines.
Hence the main objective of this project is to provide a data management layer that works in the context of the whole machine. A middleware - the data manager - will be responsible for all data accesses from the applications, and will work towards two goals:
- improving global metrics of performance by avoiding contention;
- improving individual application performance by adapting the used libraries, parameters, and optimization techniques.
To achieve that, the data manager requires information about applications, which are not typically available in the stateless HPC I/O stack.
The research conducted in the first 10 months of the project focused on providing the groundwork required for the data management layer. That included:
- proposing a reinforcement learning approach to dynamically tune the I/O stack to applications while they are executing;
- proposing a pattern matching approach to classify applications\' access patterns at run time, without requiring prior information about applications, about the system, or even about the techniques being tuned;
- studying traces from large-scale machines to characterize the I/O workload, generate benchmarks, and proposing a methodology to periodically extract information about the workload from such traces.
Results were published in 6 scientific papers, all available online at HAL, and generated data sets were published at Zenodo for open access.
The conducted research extended the state of the art by proposing generic access pattern classification and tuning strategies, which are expected to work on different scenarios. That allows for application performance improvements, which:
- improves the usage of expensive large-scale machines;
- speeds-up the execution of scientific simulations and hence of the scientific process, allowing for advances in many fields.
More info: https://francielizanon.github.io/.