Today, we are witnessing the increasing use of the cloud and virtualization technologies, which are a prominent way for the industry to develop mixed-criticality systems (MCSs) and reduce SWaP-C factors (size, weight, power, and cost) by flexibly consolidating multiple critical and non-critical software on the same System-on-a-Chip (SoC). Unfortunately, using virtualization leads to several issues in assessing isolation aspects, especially temporal behaviors, which must be evaluated due to safety-related standards (e.g., EN50128 in the railway domain). This study proposes a systematic approach for verifying temporal isolation properties in virtualized MCSs to characterize and mitigate timing failures, which is a fundamental aspect of dependability. In particular, as proof of the effectiveness of our proposal, we exploited the real-time flavor of Xen hypervisor used to deploy a virtualized 2 out of 2-based MCS scenario provided in the framework of an academic-industrial partnership, in the context of the railway domain. The results point out that virtualization overhead must be carefully tuned in a real industrial scenario according to the several features provided by a specific hypervisor solution. Further, we identify a set of directions toward employing virtualization in industry in the context of ARM-based mixed-criticality systems.
This work proposes a stochastic characterization of resilient 5G architectures, where attributes such as performance and availability play a crucial role. As regards performance, we focus on the delay associated with the Packet Data Unit session establishment, a 5G procedure recognized as critical for its impact on the Quality of Service and Experience of end-users. To formally characterize this aspect, we employ the non-product-form queueing networks framework where: i) main nodes of a 5G architecture have been realistically modeled as G/G/m queues which do not admit analytical solutions; ii) the decomposition method useful to catch subtle quantities involved in the chain of 5G interconnected nodes has been conveniently customized. The results of performance characterization constitute the input of the availability modeling, where we design a hierarchical scheme to characterize the probabilistic failure/repair behavior of 5G nodes combining two formalisms: i) the Reliability Block Diagrams, useful to capture the high-level interconnections between nodes; ii) the Stochastic Reward Networks to model the internal structure of each node. The final result is an optimal resilient 5G setting that fulfills both a performance constraint (e.g., a temporal threshold) and an availability constraint (e.g., the so-called five nines) at the minimum cost, namely, with the smallest number of redundant elements. The theoretical part is complemented by an empirical assessment carried out through Open5GS, a 5G testbed that we have deployed to realistically estimate main performance and availability metrics.
The evolution of industrial environments makes the reconfigurability and flexibility key requirements to rapidly adapt to changeable market needs. Computing paradigm like Edge/Fog computing are able to provide the required flexibility and scalability while guaranteeing low-latencies and response times. Orchestration systems play a key role in these environments, enforcing automatic management of resources and workloads’ lifecycle, and drastically reducing the need for manual interventions. However, they do not currently meet industrial non-functional requirements, such as real-timeliness, determinism, reliability, and support for mixed-criticality workloads. In this paper, we present k4.0s, an orchestration system for Industry 4.0 (I4.0) environments, which enables the support for real-time and mixed-criticality workload. We highlight through experiments the need for novel monitoring approaches and propose a workflow for selecting monitoring metrics, which depends on both workload requirements and hosting node guarantees. We introduce new abstractions for the components of a cluster in order to enable criticality-aware monitoring and orchestration of real-time industrial workloads. Finally, we design an orchestration system architecture that reflects the proposed model, introducing new components and prototyping a Kubernetes-based implementation, moving the first steps for a fully I4.0-enabled orchestration system.
Technological advances in embedded systems and the advent of fog computing led to improved quality of service of applications of cyber-physical systems. In fact, the deployment of such applications on powerful and heterogeneous embedded systems, such as multiprocessors system-on-chips (MPSoCs), allows them to meet latency requirements and real-time operation. Highly relevant to the industry and our reference case-study, the challenging field of nuclear fusion deploys the aforementioned applications, involving high-frequency control with hard real-time and safety constraints. The use of fog computing and MPSoCs is promising to achieve safety, low latency, and timeliness of such control. Indeed, on one hand, applications designed according to fog computing distribute computation across hierarchically organized and geographically distributed edge devices, enabling timely anomaly detection during high-frequency sampling of time series, and, on the other hand, MPSoCs allow leveraging fog computing and integrating monitoring by deploying tasks on a flexible platform suited for mixed-criticality software, leading to so-called mixed criticality systems (MCSs). However, the integration of such software on the same MPSoC opens challenges related to predictability and reliability guarantees, as tasks interfering with each other when accessing the same shared MPSoC resources may introduce non-deterministic latency, possibly leading to failures on account of deadline overruns. Addressing the design, deployment, and evaluation of MCSs on MPSoCs, we propose a model-based system development process that facilitates the integration of real-time and monitoring software on the same platform by means of a formal notation for modeling the design and deployment of MPSoCs. The proposed notation allows developers to leverage embedded hypervisors for monitoring real-time applications and guaranteeing predictability by isolation of hardware resources. Providing evidence of the feasibility of our system development process and evaluating the industry-relevant class of nuclear fusion applications, we experiment with a safety-critical case-study in the context of the ITER nuclear fusion reactor. Our experimentation involves the design and evaluation of several prototypes deployed as MCSs on a virtualized MPSoC, showing that deployment choices linked to the monitor placement and virtualization configurations (e.g., resource allocation, partitioning, and scheduling policies) can significantly impact the predictability of MCSs in terms of Worst-Case Execution Times and other related metrics.
In modern telecommunication networks, services are provided through Service Function Chains (SFC), where network resources are implemented by leveraging virtualization and containerization technologies. In particular, the possibility of easily adding or removing network resources has prompted service providers to redefine some concepts including performance and availability. In line with this new trend, we propose a performability study of a multi-provider containerized IP Multimedia Subsystem (cIMS), an SFC-like infrastructure used in the core part of 4G/5G networks to handle multimedia sessions. On the one hand, performance issues are tackled by modeling each cIMS node in terms of a G/G/m queueing system to derive the Call Setup Delay (CSD), a performance metric related to the user-end experience in multimedia communications. On the other hand, availability issues are addressed through the Multi-State System (MSS) formalism, to take into account different performance rates of the system. Then, we devise an algorithm called PE-MUGF (Performability Evaluation through Multidimensional Universal Generating Function) to identify the minimum-redundancy cIMS configuration which meets given performance and availability targets at the same time. Finally, an extensive experimental analysis based on Clearwater, a containerized IMS testbed, allows us to estimate most of system parameters whose robustness is evaluated through a sensitivity analysis.
In the railway domain, rolling stock maintenance affects service operation time and efficiency. Minimizing train unavailability is essential for reducing capital loss and operational costs. To this aim, prediction of failures of rolling stock equipment is crucial to proactively trigger proper maintenance activities. Indeed, predictive maintenance is a golden example of the digital transformation within Industry 4.0, which affects several engineering processes in the railway domain. Nowadays, it may leverage artificial intelligence and machine learning algorithms to forecast failures and schedule the optimal time for maintenance actions. Generally, rail systems deteriorate gradually over time or fail directly, leading to data that vary extremely slowly. Indeed, ML approaches for predictive maintenance should consider this type of data to accurately predict and forecast failures. This paper proposes a methodology based on Long Short-Term Memory deep learning algorithms for predictive maintenance of railway rolling stock equipment. The methodology allows us to properly learn long-term dependencies for gradually changing data, and both predicting and forecasting failures of rail equipment. In the framework of an academic-industrial partnership, the methodology is experimented on a train traction converter cooling system, demonstrating its applicability and benefits. The results show that it outperforms state-of-the-art methods, reaching a failure prediction and forecasting accuracy over 99%, with a false alarm rate of ~0.4% and a mean absolute error in the order of 10-4, respectively.
Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system’s internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and “off-the-shelf” distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non-session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time (~ 114 seconds) compared to the OpenStack logging mechanisms.
Nowadays, most telecommunication services adhere to the Service Function Chain (SFC) paradigm, where network functions are implemented via software. In particular, container virtualization is becoming a popular approach to deploy network functions and to enable resource slicing among several tenants. The resulting infrastructure is a complex system composed by a huge amount of containers implementing different SFC functionalities, along with different tenants sharing the same chain. The complexity of such a scenario lead us to evaluate two critical metrics: the steady-state availability (the probability that a system is functioning in long runs) and the latency (the time between a service request and the pertinent response). Consequently, we propose a latency-driven availability assessment for multi-tenant service chains implemented via Containerized Network Functions (CNFs). We adopt a multi-state system to model single CNFs and the queueing formalism to characterize the service latency. To efficiently compute the availability, we develop a modified version of the Multidimensional Universal Generating Function (MUGF) technique. Finally, we solve an optimization problem to minimize the SFC cost under an availability constraint. As a relevant example of SFC, we consider a containerized version of IP Multimedia Subsystem, whose parameters have been estimated through fault injection techniques and load tests.
In this work, we present a novel fault injection solution (ThorFI) for virtual networks in cloud computing infrastructures. ThorFI is designed to provide non-intrusive fault injection capabilities for a cloud tenant, and to isolate injections from interfering with other tenants on the infrastructure. We present the solution in the context of the OpenStack cloud management platform, and release this implementation as open-source software. Finally, we present two relevant case studies of ThorFI, respectively in an NFV IMS and of a highavailability cloud application. The case studies show that ThorFI can enhance functional tests with fault injection, as in 4%-34% of the test cases the IMS is unable to handle faults; and that despite redundancy in virtual networks, faults in one virtual network segment can propagate to other segments, and can affect the throughput and response time of the cloud application as a whole, by about 3 times in the worst case.
Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.
Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.
Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering, thus avoiding the need for deep domain knowledge and reducing the effort to perform the analysis. In all cases, the proposed approach provides better performance than unsupervised clustering when no feature engineering is applied to the data. Moreover, the distribution of failure modes from the proposed approach is closer to the actual frequency of the failure modes.
Virtualization technology is nowadays adopted in security-critical embedded systems to achieve higher performance and more design flexibility. However, it also comes with new security threats, where attackers leverage timing covert channels to exfiltrate sensitive information from a partition using a trojan. This paper presents a novel approach for the experimental assessment of timing covert channels in embedded hypervisors, with a case study on security assessment of a commercial hypervisor product (Wind River VxWorks MILS), in cooperation with a licensed laboratory for the Common Criteria security certification. Our experimental analysis shows that it is indeed possible to establish a timing covert channel, and that the approach is useful for system designers for assessing that their configuration is robust against this kind of information leakage.
Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.
Protocol violation bugs in storage device drivers are a critical threat for data integrity, since these bugs can silently corrupt the commands and data flowing between the OS and storage devices. Due to their nature, these bugs are notoriously difficult to find by traditional testing. In this paper, we propose a run-time monitoring approach for storage device drivers, in order to detect I/O protocol violations that would otherwise silently escalate in corruptions of users' data. The monitoring approach detects violations of I/O protocols by automatically learning a reference model from failure-free execution traces. The approach focuses on selected portions of the storage controller interface, in order to achieve a good trade-off in terms of low performance overhead and high coverage and accuracy of failure detection. We assess these properties on three real-world storage device drivers from the Linux kernel, through fault injection and stress tests. Moreover, we show that the monitoring approach only requires few minutes of training workload, and that it is robust to differences between the operational and the training workloads.
Network Function Virtualization (NFV) envisions the use of cloud computing and virtualization technology to reduce costs and innovate network services. However, this paradigm shift poses the question whether NFV will be able to fulfill the strict performance and dependability objectives required by regulations and customers. Thus, we propose a dependability benchmark to support NFV providers at making informed decisions about which virtualization, management, and application-level solutions can achieve the best dependability. We define in detail the use cases, measures, and faults to be injected. Moreover, we present a benchmarking case study on two alterna- tive, production-grade virtualization solutions, namely VMware ESXi/vSphere (hypervisor-based) and Linux/Docker (container- based), on which we deploy an NFV-oriented IMS system. Despite the promise of higher performance and manageability, our experiments suggest that the container-based configuration can be less dependable than the hypervisor-based one, and point out which faults NFV designers should address to improve dependability.
Emerging industrial applications require highly reconfigurable and flexible environments to effectively respond to dynamic market demands while complying with rigorous nonfunctional requirements. The orchestration of virtualized industrial components over edge/fog computing infrastructures allows the required reconfigurability to be achieved by enforcing automated management of hardware resources and applications. However, existing orchestration systems fall short of meeting crucial nonfunctional requirements, such as determinism, reliability, and application/criticality awareness, preventing their use in industrial environments. In this short paper, we conduct a preliminary timing analysis on failover mechanisms used in Kubernetes in order to identify sources of nondeterminism, which are fundamental to be mitigated in an industrial mixed-criticality scenario.
Nowadays, industries are looking into virtualization as an effective means to build safe applications, thanks to the isolation it can provide among virtual machines (VMs) running on the same hardware. In this context, a fundamental issue is understanding to what extent the isolation is guaranteed, despite possible (or induced) problems in the virtualization mechanisms. Uncovering such isolation issues is still an open challenge, especially for hardware-assisted virtualization, since the search space should include all the possible VM states (and the linked hypervisor state), which is prohibitive. In this paper, we propose IRIS, a framework to record (learn) sequences of inputs (i.e., VM seeds) from the real guest execution (e.g., OS boot), replay them as-is to reach valid and complex VM states, and finally use them as valid seed to be mutated for enabling fuzzing solutions for hardware-assisted hypervisors. We demonstrate the accuracy and efficiency of IRIS in automatically reproducing valid VM behaviors, with no need to execute guest workloads. We also provide a proof-of-concept fuzzer, based on the proposed architecture, showing its potential on the Xen hypervisor.
Railway signaling systems provide numerous critical functions at different safety level, to correctly implement the entire transport ecosystem. Today, we are witnessing the increasing use of the cloud and virtualization technologies in such mixed-criticality systems, with the main goal of reducing costs, improving reliability, while providing orchestration capabilities. Unfortunately, virtualization includes several issues for assessing temporal isolation, which is critical for safety-related standards like EN50128. In this short paper, we envision leveraging the real-time flavor of a general-purpose hypervisor, like Xen, to build the Railway Signaling as a Service (RSaaS) systems of the future. We provide a preliminary background, highlighting the need for a systematic evaluation of the temporal isolation to demonstrate the feasibility of using general-purpose hypervisors in the safety-critical context for certification purposes.
Time predictable edge cloud is seen as the answer for many arising needs in Industry 4.0 environments, since it is able to provide flexible, modular, and reconfigurable services with low latency and reduced costs. Orchestration systems are becoming the core component of clouds since they take decisions on the placement and lifecycle of software components. Current solutions start introducing real-time containers support for time predictability; however, these approaches lack of determinism as well as support for workloads requiring multiple levels of assurance/criticality.In this paper, we present k4.0s, an orchestration model for real-time and mixed-criticality environments, including timeliness, criticality and network requirements. The model leverages new abstractions for node and jobs, e.g., node assurance, and requires novel monitoring strategies. We sketch an implementation of the proposal based on Kubernetes, and present an experimentation motivating the need for node assurance levels and adequate monitoring.
We advance a performability assessment of a multi-tenant containerized IP Multimedia Subsystem (cIMS), i.e.: one and the same infrastructure is shared among different providers (or tenants). Specifically, we: i) model each cIMS node (a.k.a. Containerized Network Function - CNF) through the Multi-State System (MSS) formalism to capture the dimensionality of the multi-tenant arrangement, and characterize each tenant through queueing theory attributes to catch latency-dependent performance aspects; ii) afford an availability analysis of cIMS by means of an extended version of the Universal Generating Function (UGF) technique, dubbed Multidimensional UGF (MUGF); iii) solve an optimization problem to retrieve the cIMS deployment minimizing costs while guaranteeing high availability requirements. The whole assessment is supported by an experiment based on the containerized IMS platform Clearwater which we deploy to derive some realistic system parameters by means of fault injection techniques.
Real-time containers are a promising solution to reduce latencies in time-sensitive cloud systems. Recent efforts are emerging to extend their usage in industrial edge systems with mixed-criticality constraints. In these contexts, isolation becomes a major concern: a disturbance (such as timing faults or unexpected overloads) affecting a container must not impact the behavior of other containers deployed on the same hardware. In this paper, we propose a novel architectural solution to achieve isolation in real-time containers, based on real-time co-kernels, hierarchical scheduling, and timedivision networking. The architecture has been implemented on Linux patched with the Xenomai co-kernel, extended with a new hierarchical scheduling policy, named SCHED_DS, and integrating the RTNet stack. Experimental results are promising in terms of overhead and latency compared to other Linux-based solutions. More importantly, the isolation of containers is guaranteed even in presence of severe co-located disturbances, such as faulty tasks (elapsing more time than declared) or high CPU, network, or I/O stress on the same machine
Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an “off-the-shelf” distributed system, showing that the approach can be applied with high failure detection coverage.
In this paper, we present a new fault injection tool (ProFIPy) for Python software. The tool is designed to be programmable, in order to enable users to specify their software fault model, using a domain-specific language (DSL) for fault injection. Moreover, to achieve better usability, ProFIPy is provided as software-as-a-service and supports the user through the configuration of the faultload and workload, failure data analysis, and full automation of the experiments using container- based virtualization and parallelization.
In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.
Many research areas in software engineering, such as mutation testing, automatic repair, fault localization, and fault injection, rely on empirical knowledge about recurring bug-fixing code changes. Previous studies in this field focus on what has been changed due to bug-fixes, such as in terms of code edit actions. However, such studies did not consider where the bug-fix change was made (i.e., the context of the change), but knowing about the context can potentially narrow the search space for many software engineering techniques (e.g., by focusing mutation only on specific parts of the software). Furthermore, most previous work on bug-fixing changes focused on C and Java projects, but there is little empirical evidence about Python software. Therefore, in this paper we perform a thorough empirical analysis of bug-fixing changes in three OpenStack projects, focusing on both the what and the where of the changes. We observed that all the recurring change patterns are not oblivious with respect to the surrounding code, but tend to occur in specific code contexts.
Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.
We argue for novel techniques to understand how cloud systems can fail, by enhancing fault injection with distributed tracing and anomaly detection techniques.
Bugs affecting storage device drivers include the so-called protocol violation bugs, which silently corrupt data and commands exchanged with I/O devices. Protocol violations are very difficult to prevent, since testing device driver is notoriously difficult. To address them, we present a monitoring approach for device drivers (MoIO) to detect HO protocol violations at run-time. The approach infers a model of the interactions between the storage device driver, the OS kernel, and the hardware (the device driver protocol) by analyzing execution traces. The model is then used as a reference for detecting violations in production. The approach has been designed to have a low overhead and to overcome the lack of source code and protocol documentation. We show that the approach is feasible and effective by applying it on the SATA/AHCI storage device driver of the Linux kernel, and by performing fault injection and long-running tests.
Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. However, the “softwarization” of network functions raises reliability concerns, as they will be exposed to faults in commodity hardware and software components. In this paper, we propose a methodology for the dependability evaluation and benchmarking of NFV Infrastructures (NFVIs), based on fault injection. We discuss the application of the methodology in the context of a virtualized IP Multimedia Subsystem (IMS), and the pitfalls in the design of a reliable NFVI.
The next-generation Industrial Internet of Things (IIoT) requires smart devices featuring rich connectivity, local intelligence, and autonomous behavior. We review representative solutions, highlighting aspects that are the most relevant for integration in IIoT solutions.
In this paper, we propose the concept of Partitioned Containers, born from the convergence between containers and partitioning hypervisors. While the former is a key virtualization technology for cloud platforms, the latter is attracting interest in industrial settings, since they can provide strong isolation between applications, as mandated by safety standards. Our idea is to combine the advantages of both, fostering the adoption of cloud technologies in industrial settings toward a “container-everywhere” vision. The paper proposes an architecture and the challenges for the realization of the concept.
Orchestration systems are becoming a key component to automatically manage distributed computing resources in many fields with criticality requirements like Industry 4.0 (14.0). However, they are mainly linked to OS-level virtualization, which is known to suffer from reduced isolation. In this paper, we propose RunPHI with the aim of integrating partitioning hypervisors, as a solution for assuring strong isolation, with OS-level orchestration systems. The purpose is to enable container orchestration in mixed-criticality systems with isolation requirements through partitioned containers.
Partitioning hypervisor solutions are becoming increasingly popular, to ensure stringent security and safety requirements related to isolation between co-hosted applications and to make more efficient use of available hardware resources. However, assessment and certification of isolation requirements remain a challenge and it is not trivial to understand what and how to test to validate these properties. Although the high-level requirements to be verified are mentioned in the different security- and safety-related standards, there is a lack of precise guidelines for the evaluator. This guidance should be comprehensive, generalizable to different products that implement partitioning, and tied specifically to lower-level requirements. The goal of this work is to provide a systematic framework that addresses this need.
Nowadays, a feature-rich automotive vehicle offers several technologies to assist the driver during his trip and guarantee an amusing infotainment system to the other passengers, too. Consolidating worlds at different criticalities is a welcomed challenge for car manufacturers that have recently tried to leverage virtualization technologies due to reduced maintenance, deployment, and shipping costs. For this reason, more and more mixed-criticality systems are emerging, trying to assure compliance with the ISO 26262 Road Vehicle Safety standard. In this short paper, we provide a preliminary investigation of the certification capabilities for Jailhouse, a popular open-source partitioning hypervisor. To this aim, we propose a testing methodology and showcase the results, pointing out when the software gets to a faulting state, deviating from its expected behavior. The ultimate goal is to picture the right direction for the hypervisor towards a potential certification process.
A promising approach for designing critical embedded systems is based on virtualization technologies and multi-core platforms. These enable the deployment of both real-time and general-purpose systems with different criticalities in a single host. Integrating virtualization while also meeting the real-time and isolation requirements is non-trivial, and poses significant challenges especially in terms of certification. In recent years, researchers proposed hardware-assisted solutions to face issues coming from virtualization, and recently the use of Operating System (OS) virtualization as a more lightweight approach. Industries are hampered in leveraging this latter type of virtualization despite the clear benefits it introduces, such as reduced overhead, higher scalability, and effortless certification since there is still lack of approaches to address drawbacks. In this position paper, we propose the usage of Intel's CPU security extension, namely SGX, to enable the adoption of enclaves based on unikernel, a flavor of OS-level virtualization, in the context of real-time systems. We present the advantages of leveraging both the SGX isolation and the unikernel features in order to meet the requirements of safety-critical real-time systems and ease the certification process.
This paper presents the design of ADaRTA, an aging detection and rejuvenation tool for Android. The tool is a software agent which i) performs selective monitoring of system processes and of trends in system performance indicators; ii) detects the aging state and estimates the time-to-aging-failure, through heuristic rules; iii) schedules and applies rejuvenation, based on the estimated time-to-aging-failure. The agent rules and parameters have been defined for ease of configuration and tuning by device designers. A stress testing experiment is discussed, showing ADaRTA’s configurability for the device under test, and the ability of detecting the aging state to prevent device enter a failure state.
The analysis of fault injection experiments can be a cumbersome task. These experiments can generate large volumes of data (e.g., message traces), which a human analyst needs to inspect to understand the behavior of the system under failure. This paper introduces the FailViz tool for visualizing fault injection experiments, which points out relevant events for interpreting the failures. We also present a motivating example in the context of OpenStack, and point out future research directions.
Network Function Virtualization (NFV) is an emerging networking paradigm that offers new ways of creating, deploying, and managing networking services, by turning physical network functions into virtualized one. The NFV paradigm heavily relies on cloud computing and virtualization technologies to provide carrier-grade services. The certification process of NFV systems is an open and critical question to ensure that the delivered network service provides specific guarantees about performance and dependability. In this paper, we propose potential guidelines for evaluating the reliability of NFV Infrastructures (NFVIs), with the aim of verifying whether NFVIs satisfy its reliability and performance requirements even in presence of faults. The guidelines are described as a set of key practices to be followed, in terms of inputs, activities, and outputs. These practices are intended to be conducted by companies that want to evaluate the reliability of their NFVI against quantitative performance, availability, and fault tolerance objectives, and to get precise feedback on how to improve its fault tolerance.
Nowadays, Cloud Computing is a fundamental paradigm that provides computational resources as a service, on which users heavily rely. Cloud computing infrastructures behave as an ecosystem, where several actors play a crucial role. Unfortunately Cloud Computing Ecosystems (CCEs) are often affected by outages, such as those experienced by Amazon Web Service in the last years, that result from component faults that propagate through the whole CCE. Thus, there is still a need for approaches to improve CCEs' reliability. This paper discusses both existing approaches and open challenges for the dependability evaluation of CCEs, and the need for novel techniques and methodologies to prevent fault propagation within CCEs as a whole.
Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. Nevertheless, the "notarization" of network functions imposes software reliability concerns on future networks, which will be exposed to software issues arising from virtualization technologies. In this paper, we discuss the challenges for reliability in NFVIs, and present an industrial research project on their reliability assurance, which aims at developing novel fault injection technologies and systematic guidelines for this purpose.
The lack of tools that can fit in existing development practices and processes hampers the adoption of Software Fault Injection (SFI) in real-world projects. This paper presents an ongoing work towards an SFI tool integrated in the Eclipse IDE, and designed for usability.
A novel method and related system for injecting faults into a software application using a Domain Specific Language (DSL)
Network Function Virtualization (NFV) is an emerging networking paradigm that aims to reduce costs and time-to-market, improve manageability, and foster competition and innovative services. NFV exploits virtualization and cloud computing technologies to turn physical network functions into Virtualized Network Functions (VNFs), which will be implemented in software, and will run as Virtual Machines (VMs) on commodity hardware located in high-performance data centers, namely Network Function Virtualization Infrastructures (NFVIs). The NFV paradigm relies on cloud computing and virtualization technologies to provide carrier-grade services, i.e., the ability of a service to be highly reliable and available, within fast and automatic failure recovery mechanisms. The availability of many virtualization solutions for NFV poses the question on which virtualization technology should be adopted for NFV, in order to fulfill the requirements described above. Currently, there are limited solutions for analyzing, in quantitative terms, the performance and reliability tradeoffs, which are important concerns for the adoption of NFV. This thesis deals with assessment of the reliability and of the performance of NFV systems. It proposes a methodology, which includes context, measures, and faultloads, to conduct dependability benchmarks in NFV, according to the general principles of dependability benchmarking. To this aim, a fault injection framework for the virtualization technologies has been designed and implemented for the virtualized technologies being used as case studies in this thesis. This framework is successfully used to conduct an extensive experimental campaign, where we compare two candidate virtualization technologies for NFV adoption: the commercial, hypervisor-based virtualization platform VMware vSphere, and the open-source, containerbased virtualization platform Docker. These technologies are assessed in the context of a high-availability, NFV-oriented IP Multimedia Subsystem (IMS). The analysis of experimental results reveal that i) fault management mechanisms are crucial in NFV, in order to provide accurate failure detection and start the subsequent failover actions, and ii) fault injection proves to be valuable way to introduce uncommon scenarios in the NFVI, which can be fundamental to provide a high reliable service in production.