Storage management of Big Data analytics in large and multi-tenant clusters is a complex and time-consuming task. Today, many companies and organizations suffer from lack of automation in their daily Big Data analytics management operations, which hurdles their competitiveness...
Storage management of Big Data analytics in large and multi-tenant clusters is a complex and time-consuming task. Today, many companies and organizations suffer from lack of automation in their daily Big Data analytics management operations, which hurdles their competitiveness and efficiency. This is especially true if we consider heterogeneous workloads and variable, non-anticipated tenant requirements that should be satisfied within stringent time limits. In response to these challenges, Software-Defined Storage (SDS) has recently become a prime candidate to simplify storage management in the cloud.
The main objective of IOStack is to create a Software-defined Storage toolkit for Big Data on top of the OpenStack platform. IOStack will enable efficient execution of virtualized analytics applications over virtualized storage resources thanks to flexible, automated, and low cost data management models based on SDS. In order to achieve this general goal, IOStack focuses on the following objectives:
G-1. Storage and compute disaggregation and virtualization. Virtualizing data analytics to reduce costs implies disaggregation of existing hardware resources. This requires the creation of a virtual model for compute, storage and networking that allows orchestration tools to manage resources in an efficient manner. For the orchestration layer it is essential to provide policy-based provisioning tools so that the provisioning of virtual components for the analytics platform is made according to the set of Quality of Service policies.
G-2. SDS Services for Analytics. The objective is to define, design, and build a stack of SDS data services enabling virtualized analytics with improved performance and usability. Among these services we include native object store analytics that will allow running analytics close to the data without taxing initial migration, data reduction services that will be optimized for the special requirements posed by virtualized analytics platforms, and specialized persistent caching mechanisms, advanced prefetching, and data placement.
G-3. Orchestration and deployment of Big Data analytics services. The objective is to design and build efficient deployment strategies for virtualized analytic-as-a-service instances. In particular, the focus of this work is on data-intensive scalable computing (DISC) systems such as Apache Hadoop and Apache Spark, which enable users to define both batch and latency sensitive analytics. This objective includes the design of scalable algorithms that strive at optimizing a service-wide objective function (e.g., optimize performance, minimize cost, etc...) under heterogeneous workloads.
Directly related to these general objectives, we outline three main software outcomes:
S-1. Create an open SDS toolkit and APIs targeting virtualized data analytics in OpenStack. We will devise a set of novel SDS APIs providing full control of the logical and physical infrastructure required to launch a scalable data analytics platform. This implies contributions and extensions to OpenStack Cinder for managing virtual storage, OpenStack Nova for adaptations of the compute scheduler, and OpenStack Swift for virtualizing object storage.
S-2. Implement SDS services for analytics on top of the previous SDS APIs. We will demonstrate extensions to OpenStack Swift in order to offer true native object store analytics. In particular, we will design and implement a specialized SDS service able to manage computation close to the actual data thanks to Storlets embedded in the object store. In addition, end-users will be able to define QoS policies to instruct the SDS controller to deploy such data services and optimize data flows in analytic experiments based on data reduction and caching techniques.
S-3. Implement efficient deployment strategies on top of Docker. This objective clearly leverages the entire infrastructure created in the aforementioned objectives (SDS APIs and data services). We will create
IOStack activities carried out during this first reporting period (from January 2015 to June 2016) have been aligned with the project objectives, and towards the consecution of the project outcomes. Efforts have been devoted to the analysis and design of the software architecture of the IOStack SDS toolkit, as well as to develop and integrate an early prototype of all IOStack components.
The main results produced during this reporting period can be summarized as follows:
- Design and implementation of Konnector: an SDS framework for block storage (OpenStack Cinder) that enables the interception of storage flows from block volumes in order to optimize storage workloads of analytics applications. Currently, Konnector enables block volumes to be intercepted by combinations of various specialized storage filters built in this project, such as cache, deduplication and compression. Source code publicly available at: https://github.com/iostackproject/Konnector
- Design and implementation of Crystal: the first open, extensible and meta-programmable SDS framework for object storage (OpenStack Swift) that provides simplified policy-based storage management to system administrators. Policies in Crystal automate the enforcement of storage filters (e.g., Storlets) on data objects, which may provision arbitrary services to tenants and containers (compression, data transformations, encryption, bandwidth SLOs). It is available open source in http://crystal-sds.org
- Design and implementation of Zoe, a general purpose cluster management and scheduler that is able to deploy and schedule analytics applications (clusters) that use a variety of large-scale computing frameworks (e.g, Spark, MPI, Hadoop). Zoe exploits Docker Containers and Swarm to deploy the applications. It is available open source in http://zoe-analytics.eu/
- All the IOStack main software outcomes (Crystal, Konnector, Zoe) are presented to the administrator in a simple and integrated interface that extends OpenStack Horizon.
- Spark push-down mechanism: We augmented Spark with the ability to delegate computations related to SQL queries to the storage cluster (OpenStack Swift via Crystal/Storlets). This enables use case companies (GridPocket) to perform queries over their data much faster.
- Proposal and implementation of a new threading model for OpenStack Swift that can increase the performance of I/O and removes all the performance parameters of Swift (Workers and threads). Thanks to this model we implemented a bandwidth differentiation filter able to work using kernel I/O priorities and that can avoid interferences from other I/O processes while maintaining the requested bandwidth. It is available open source in https://github.com/iostackproject/IO-Bandwidth-Differentiation/tree/feature/hard-limiter.
- Extensions to the OpenStack Storlet framework: In IOStack we have open-sourced and improved Storlets as a core component of executing compute tasks on object storage (OpenStack Swift). Storlets is a core component of the filter framework of Crystal. The source code is publicly available at: https://github.com/openstack/storlets.
- Design and development of SDGen: A novel benchmarking tool that enables to generate synthetic data that resembles real data in terms of compression times and ratios. This tool helps to enable realistic synthetic experiments in IOStack. Open source at: https://github.com/iostackproject/SDGen
- Establishment of reference IOStack testbed at Arctur data center with an integrated version of the toolkit. This testbed includes compute server nodes, storage server nodes, self-service virtualized environment for support virtual machines and full networking stack for private and public network.
- Collection of use-case traces and datasets to drive our experiments with real world workloads.
- Publication of research papers in high-level conferences/journals such as USENIX FAST\'15, ACM IMC\'15, IEEE Internet Computing, etc.
- Dissemination of the pr
Despite being only in the mid of the project, IOStack has already achieved important milestones that can have significant impact on the research, industry and open source community.
First, we build the first true SDS platform for OpenStack: the most important open source cloud community. That is, IOStack is designed to separate the control and data planes of the system, as well as to implement the concepts of storage policies and filters. Such design provides flexibility and extensibility to the system, which is a key feature to attract the open source community. Furthermore, our SDS platform embraces both block and object storage and it is integrated within the OpenStack Horizon dashboard to ease its adoption by real companies.
Second, the IOStack platform is already benefiting our use case companies in the management of Big Data analytics. For instance, thanks to the Spark push-down mechanism and the execution of computation close to data filters in Crystal, GridPocket is already executing SparkSQL queries much faster than before (e.g. 30%-60%), which yields important costs savings.
Moreover, the analytics deployment framework, namely Zoe, is already providing fast and automated analytics deployments to our use case companies (Idiada, GridPocket). Zoe has greatly simplified the way these companies used to work, making their daily management of compute cluster more efficient and time-conserving. Even more, datacenter providers such Arctur can now provide their customers with virtualized analytics services in a simplified manner.
In the following stages of the project we aim at consolidating the platform, enabling new cross layer strategies among storage-compute components and promoting IOStack to attract new use case companies.
More info: http://www.iostack.eu.