In the past few decades, aggressive technology scaling was the main driving force for the explosive performance improvement of digital systems that radically reshaped the ways that we work, entertain and communicate. New paradigms such as cloud computing and the Internet of...
In the past few decades, aggressive technology scaling was the main driving force for the explosive performance improvement of digital systems that radically reshaped the ways that we work, entertain and communicate. New paradigms such as cloud computing and the Internet of Things (IoT) enable intelligent inter-connection of billions of devices. These devices will generate huge volumes of data (exabytes) that need to be processed and analyzed in centralized or de-centralized data-centers close to the users. The analysis of such data is expected to lead to new scientific discoveries and new applications that will further improve our everyday life. However, such paradigms are at threat since the “free-ride†of technology-scaling seems to come to an end with nano-scale circuits becoming very prone to potential failures due to increased static and dynamic variations of circuit parameters. The adoption of pessimistic margins for addressing such variations along with the stagnant voltage scaling has elevated power as a prime design challenge.
In order to substantially improve the energy-efficiency there is a need to design new error-resilient server ecosystems that are able to deal with the increased hardware variability in a more intelligent way than the conventional pessimistic paradigms. UniServer project turns the table around and puts forth the following question: Why allow the worst operating margins of fabricated chips to artificially constrain the performance and energy of today systems? The reality is that each manufactured processor and each memory module is inherently different and lies on a distinct performance bin, meaning by that each chip has different capabilities in terms of energy-efficiency and performance. According to UniSever overall design target, the computing industry needs to see such heterogeneity not as a problem but as an opportunity to improve energy-efficiency especially in next generation servers. ‘Functional heterogeneity’ has already been adopted in embedded systems and servers with hybrid CPU/GPU/accelerators architectures. Therefore, it is now time to also expose the ‘intrinsic heterogeneity’, harness it and use it to our advantage by redesigning the hardware and software for improving energy-efficiency or performance that is essential for realizing the microservers that are needed in support of the imminent IoT revolution. Based on such observation, UniServer approach plans to substitute the existing conservative margins with the real capabilities of each individual core and memory-array. This will enable us to exceed the energy and performance scaling boundaries adopted in servers.
During this period , the project focused on enabling the project and the targeted technologies by setting up the project management infrastructure, preparing and distributing the world’s-first 64-bit ARM based Server-on-Chip (i.e. X-Gene2 board) to the partners, demystifying its characteristics and capabilities and defining the initial interfaces at the firmware and system software between the Hypervisor and the OpenStack.
In particular, the cores and memories have already being characterised under various conditions with results for cores, caches and DRAM showing significant design margins that can be exploited within the UniServer concept. By month 18, the definition of the Hardware Exposure Interface (HEI) and the error handling procedures have been defined and implemented on the first prototype. Also, a first beta version of the HealthLog and the StressLog monitors has been implemented, while the interface between the Predictor and the other software components of the UniServer platform has been defined and started being ported on the the initial prototype.
We have also quantified the intensity of the use of hypercalls and system calls at the hypervisor level and a fault injection infrastructure which is already used to identify the invariable impact of potential faults on various structures of the system software. Our analysis show that there are necessary steps to enable intelligent, selective protection. and the sensitivity of different data structures and code modules of the hypervisor at both the user and kernel level. To this end, we have started implementing mechanisms to increase the resilience of the hypervisor against CPU faults (functionality migration of sensitive system code to reliable cores) and memory faults (support for heterogeneous reliability memory through different memory zones). Resilient mechanisms and enhanced monitoring capabilities have also been defined and enabled at the OpenStack layer. During this period, all applications have been collected and ported on the UniServer board and initial results have started being collected against metrics of success that have also been defined in this period. The project ideas and results have been published in numerous publications in top tier venues and were disseminated through two organized workshops, numerous talks, the project website and the social media channels.
Overall, the targeted software and hardware ecosystem could improve the energy efficiency of running IoT and Big Data applications by 31x by 2019 based on estimations made by the UniServer consortium. The described ecosystem with the novel technologies could be integrated within classical high-end servers, as well as in newly introduced platforms with server-like characteristics that are based on embedded processors, referred to as micro-servers. Enriching the current servers and micro-servers with the above described hardware and software technologies will help empower the next generation data-centers not only on the cloud but also at the edge.
Besides addressing the power and variability challenge, the envisioned ecosystem also contributes to assure sustainability, programmability and address privacy/security concerns by running the services at the Edge complementary to the Cloud. Services running at the Edge relieve the public network from the Big Data burden and at the same time ensure the required quality-of-service in response latency sensitive IoT services. The complete software ecosystem also allows to seamlessly administer cloud and edge data-centers lessening the programmability effort that will be otherwise required for porting a service to specialized hardware in the cloud. Finally, the ability of edge resources to provide a complete service within a home or the premises of a small enterprise naturally lends itself to improved privacy since the data do not need to be communicated through the public network and reside in third party data-centers.
Overall, the realization of the envisioned error-resilient ecosystem for energy-efficiency is paved with many challenges as detailed above since radically new technologies need to be developed with the assistance of hardware and software developers. UniServer consortium brings together a team of academic institutions and world-leading industrial partners that are actively working towards realizing and evaluating the potential benefits of such a vision which already shows potential to break the conventional pessimistic limits of performance and energy-efficiency.
More info: http://www.uniserver2020.eu/.