Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - BigDataStack (High-performance data-centric stack for big data applications and operations)

Teaser

Current analytics frameworks exploit several underlying infrastructure management systems, which however have not been designed and implemented in a “big data context”. Instead, they emphasise on the computational needs and aspects of applications and services to be...

Summary

Current analytics frameworks exploit several underlying infrastructure management systems, which however have not been designed and implemented in a “big data context”. Instead, they emphasise on the computational needs and aspects of applications and services to be deployed. BigDataStack tackles this challenge through a set of offerings towards different stakeholders:
• Data-driven infrastructure management system for infrastructure providers: A frontrunner system ensuring that resource management is fully efficient and optimized for data operations. The system bases all infrastructure management decisions on data aspects affecting the provision of resources.
• Data as a service for data providers and decision makers: It promotes automation and quality and ensures that the provided data are meaningful and fit-for-purpose through approaches for data cleaning, modelling and efficient storage. Unique seamless data analytics is realized in a holistic fashion across multiple data stores.
• Dimensioning workbench for application engineers: The workbench facilitates the identification of the applications data-related properties and their data needs to predict the required underlying resources.
• Process modelling and optimization framework for business analysts: It allows flexible, functionality-based modelling of processes, which are mapped to concrete analytics. The analytics outcomes provide feedback to business analysts towards overall process optimization.
• Data toolkit for data scientists: An environment to ingest analytics functions and to specify their preferences and constraints that are exploited by the infrastructure management system for resources and data management.

Work performed

Towards the realization of BigDataStack objectives the following activities have been performed:
• Analysis of the application, user and technical requirements to drive the architecture specification and the design of individual components.
• SotA analysis both in the beginning of the project and in M11.
• Identification and description of BigDataStack capabilities and core functionalities following the requirements and SotA analysis.
• Compilation of the architecture including cross-layer topics and information flows to ensure coherence.
• Design and implementation:
- Data-driven infrastructure management incorporating approaches for resource provisioning, efficient deployment of services, complete monitoring and runtime resource adaptations.
- Data as a service solution enabling storage in different data stores, analysis in a seamless way across the stores, and provision of services for data quality assessment and data skipping to optimize the performance of analytics.
- Process modelling framework incorporating process mapping to facilitate the needs of analysts in terms of specifying high-level workflows that reflect specific data analytics pipelines.
- Data toolkit allowing scientists to ingest their analytics tasks and set preferences and objectives in analytics pipelines.
- Dimensioning workbench that provides resource estimates for application and data services considering the workloads and data operations.
- Environment as a unique point for visualization and management of all elements of BigDataStack: process modelling, data toolkit, benchmarking and deployment patterns, triple monitoring, runtime adaptations, analytics outcomes.
• Definition and analysis of several use case scenarios, mapped to different functionalities and components of BigDataStack, utilizing actual datasets.
• Compilation of integration plan, driving integration activities that led to integrated prototypes according to the overall architecture.
• Creation of the public website, the dissemination materials for the project, and the dissemination and communication strategy.
• Definition of the overall exploitation strategy and of initial plans from all partners.

Final results

During the reporting period, the conducted research activities have progressed the state of the art in different areas, as highlighted below:
• Data-driven infrastructure management including components (i.e. Kuryr, NVME-MDEV kernel driver) that provide significant performance improvements (i.e. 9 times better) and which have been already contributed upstream.
• Optimized deployment of big data application and data services through a recommendation engine that delivers deployment patterns recommendations achieving 0.5582 NDCG, which is significantly more effective than other deployment recommendation baselines.
• Dynamic orchestration exploiting reinforcement learning to optimize the orchestration of services yielding a 10 to 25% higher precision while being 25% less computationally expensive.
• Triple monitoring engine to monitor besides the underlying resources, the application, data services and operations through an innovative federated model, with a central engine accessing metrics collected by secondary instances specifically deployed to individually monitor data services. Optimized storage based on metrics usage, with performance optimization of about 90%.
• Runtime evaluation of delivered performance through technologies that consider both the metric to evaluate and a series of thresholds / checkpoints (of increasing levels of criticality) to better control the evolution of the indicator.
• Probabilistic, domain-agnostic error detection process on various datasets with a unique approach of “templates” to characterize the dataset tuples in terms of quality. Overall data quality assessment AUC score of 0.94.
• Adaptable distributed storage with adaptations addressing the split of a logical query into multiple queries running across many distributed storage servers, dynamically split of regions (when too big to be efficiently managed) and dynamically migration of regions.
• Seamless analytics framework to manage data on top of heterogeneous data stores, without having to utilize a new query language to retrieve data spanned across the stores. The framework also keeps track of split points to show which part of a database is visible to a transaction in each table, while also moving historic data slices from the database to the object store ensuring the ACID properties.
• Data skipping technology which enables Spark to reduce data ingestion of SQL-based jobs. It was enhanced to allow developers to define their own skipping metadata types in a flexible way. It is the first to natively support arbitrary data types (e.g. geospatial, timeseries or genomic data) and skipping for queries with User Defined Functions. The skipping component has been integrated into the IBM Cloud SQL Query service.
• Application dimensioning considering data workloads to dynamically estimate the corresponding resource needs. The workbench enables service stress testing on a variety of execution platforms in a plug-in manner as well as sequential or parallel execution of the benchmarking tests.
• Process modelling and analytics, allowing the specification of high-level abstract processes in analytics workflows, their mapping towards specific analytics algorithms based on a meta-learning approach and the provision of recommendations during graph design time.
• Data toolkit providing the means to constantly check the interdependency rules during the design and deployment of a valid analytics graph, and to enable specification of requirements concerning both end-to-end graph objectives and parameters of each specific analytic algorithm / task.

Website & more info

More info: http://www.bigdatastack.eu/.