Publications

Journal papers

Criticality-Aware Monitoring and Orchestration for Containerized Industry 4.0 Environments

Barletta, M., Cinque, M., Luigi De Simone, and Della Corte, R.
Journal PapersACM Transactions on Embedded Computing Systems, Early Access

Abstract

The evolution of industrial environments makes the reconfigurability and flexibility key requirements to rapidly adapt to changeable market needs. Computing paradigm like Edge/Fog computing are able to provide the required flexibility and scalability while guaranteeing low-latencies and response times. Orchestration systems play a key role in these environments, enforcing automatic management of resources and workloads’ lifecycle, and drastically reducing the need for manual interventions. However, they do not currently meet industrial non-functional requirements, such as real-timeliness, determinism, reliability, and support for mixed-criticality workloads. In this paper, we present k4.0s, an orchestration system for Industry 4.0 (I4.0) environments, which enables the support for real-time and mixed-criticality workload. We highlight through experiments the need for novel monitoring approaches and propose a workflow for selecting monitoring metrics, which depends on both workload requirements and hosting node guarantees. We introduce new abstractions for the components of a cluster in order to enable criticality-aware monitoring and orchestration of real-time industrial workloads. Finally, we design an orchestration system architecture that reflects the proposed model, introducing new components and prototyping a Kubernetes-based implementation, moving the first steps for a fully I4.0-enabled orchestration system.

Multi-Provider IMS Infrastructure with Controlled Redundancy: A Performability Evaluation

Luigi De Simone, Di Mauro, Mario and Longo, Maurizio and Natella, Roberto and Postiglione, Fabio
Journal PapersIEEE Transactions on Network and Service Management, Early Access

Abstract

In modern telecommunication networks, services are provided through Service Function Chains (SFC), where network resources are implemented by leveraging virtualization and containerization technologies. In particular, the possibility of easily adding or removing network resources has prompted service providers to redefine some concepts including performance and availability. In line with this new trend, we propose a performability study of a multi-provider containerized IP Multimedia Subsystem (cIMS), an SFC-like infrastructure used in the core part of 4G/5G networks to handle multimedia sessions. On the one hand, performance issues are tackled by modeling each cIMS node in terms of a G/G/m queueing system to derive the Call Setup Delay (CSD), a performance metric related to the user-end experience in multimedia communications. On the other hand, availability issues are addressed through the Multi-State System (MSS) formalism, to take into account different performance rates of the system. Then, we devise an algorithm called PE-MUGF (Performability Evaluation through Multidimensional Universal Generating Function) to identify the minimum-redundancy cIMS configuration which meets given performance and availability targets at the same time. Finally, an extensive experimental analysis based on Clearwater, a containerized IMS testbed, allows us to estimate most of system parameters whose robustness is evaluated through a sensitivity analysis.

LSTM-based Failure Prediction for Railway Rolling Stock Equipment

Luigi De Simone, E. Caputo, M. Cinque, A. Galli, V. Moscato, S. Russo, G. Cesaro, V. Criscuolo, G. Giannini
Journal PapersElsevier Expert Systems with Applications (ESWA), Volume 222, 15 July 2023, 119767

Abstract

In the railway domain, rolling stock maintenance affects service operation time and efficiency. Minimizing train unavailability is essential for reducing capital loss and operational costs. To this aim, prediction of failures of rolling stock equipment is crucial to proactively trigger proper maintenance activities. Indeed, predictive maintenance is a golden example of the digital transformation within Industry 4.0, which affects several engineering processes in the railway domain. Nowadays, it may leverage artificial intelligence and machine learning algorithms to forecast failures and schedule the optimal time for maintenance actions. Generally, rail systems deteriorate gradually over time or fail directly, leading to data that vary extremely slowly. Indeed, ML approaches for predictive maintenance should consider this type of data to accurately predict and forecast failures. This paper proposes a methodology based on Long Short-Term Memory deep learning algorithms for predictive maintenance of railway rolling stock equipment. The methodology allows us to properly learn long-term dependencies for gradually changing data, and both predicting and forecasting failures of rail equipment. In the framework of an academic-industrial partnership, the methodology is experimented on a train traction converter cooling system, demonstrating its applicability and benefits. The results show that it outperforms state-of-the-art methods, reaching a failure prediction and forecasting accuracy over 99%, with a false alarm rate of ~0.4% and a mean absolute error in the order of 10-4, respectively.

Run-time failure detection via non-intrusive event analysis in a large-scale cloud computing platform

Cotroneo, D., Luigi De Simone, Liguori, P., Natella, R.
Journal PapersElsevier Journal of Systems and Software (JSS), 111611.

Abstract

Cloud computing systems fail in complex and unforeseen ways due to unexpected combinations of events and interactions among hardware and software components. These failures are especially problematic when they are silent, i.e., not accompanied by any explicit failure notification, hindering the timely detection and recovery. In this work, we propose an approach to run-time failure detection tailored for monitoring multi-tenant and concurrent cloud computing systems. The approach uses a non-intrusive form of event tracing, without manual changes to the system’s internals to propagate session identifiers (IDs), and builds a set of lightweight monitoring rules from fault-free executions. We evaluated the effectiveness of the approach in detecting failures in the context of the OpenStack cloud computing platform, a complex and “off-the-shelf” distributed system, by executing a campaign of fault injection experiments in a multi-tenant scenario. Our experiments show that the approach detects the failure with an score (0.85) and accuracy (0.77) higher than the ones provided by the OpenStack failure logging mechanisms (0.53 and 0.50) and two non-session-aware run-time verification approaches (both lower than 0.15). Moreover, the approach significantly decreases the average time to detect failures at run-time (~ 114 seconds) compared to the OpenStack logging mechanisms.

A Latency-Driven Availability Assessment for Multi-Tenant Service Chains

Luigi De Simone, M. Di Mauro, R. Natella; F. Postiglione
Journal PapersIEEE Transactions on Services Computing ( Early Access ), 2022

Abstract

Nowadays, most telecommunication services adhere to the Service Function Chain (SFC) paradigm, where network functions are implemented via software. In particular, container virtualization is becoming a popular approach to deploy network functions and to enable resource slicing among several tenants. The resulting infrastructure is a complex system composed by a huge amount of containers implementing different SFC functionalities, along with different tenants sharing the same chain. The complexity of such a scenario lead us to evaluate two critical metrics: the steady-state availability (the probability that a system is functioning in long runs) and the latency (the time between a service request and the pertinent response). Consequently, we propose a latency-driven availability assessment for multi-tenant service chains implemented via Containerized Network Functions (CNFs). We adopt a multi-state system to model single CNFs and the queueing formalism to characterize the service latency. To efficiently compute the availability, we develop a modified version of the Multidimensional Universal Generating Function (MUGF) technique. Finally, we solve an optimization problem to minimize the SFC cost under an availability constraint. As a relevant example of SFC, we consider a containerized version of IP Multimedia Subsystem, whose parameters have been estimated through fault injection techniques and load tests.

ThorFI: A Novel Approach for Network Fault Injection as a Service

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersElsevier Journal of Network and Computer Applications, 2022

Abstract

In this work, we present a novel fault injection solution (ThorFI) for virtual networks in cloud computing infrastructures. ThorFI is designed to provide non-intrusive fault injection capabilities for a cloud tenant, and to isolate injections from interfering with other tenants on the infrastructure. We present the solution in the context of the OpenStack cloud management platform, and release this implementation as open-source software. Finally, we present two relevant case studies of ThorFI, respectively in an NFV IMS and of a highavailability cloud application. The case studies show that ThorFI can enhance functional tests with fault injection, as in 4%-34% of the test cases the IMS is unable to handle faults; and that despite redundancy in virtual networks, faults in one virtual network segment can propagate to other segments, and can affect the throughput and response time of the cloud application as a whole, by about 3 times in the worst case.

Virtualizing Mixed-Criticality Systems: A Survey on Industrial Trends and Issues

M. Cinque, D. Cotroneo, Luigi De Simone, S. Rosiello
Journal PapersFuture Generation Computer Systems (FGCS), 2021

Abstract

Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.

Software micro-rejuvenation for Android mobile systems

D. Cotroneo, Luigi De Simone, R. Natella, R. Pietrantuono, S. Russo
Journal PapersJournal of Systems and Software (JSS), 2021

Abstract

Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

D. Cotroneo, Luigi De Simone, P. Liguori, R. Natella
Journal PapersJournal of Systems and Software (JSS), 2021

Abstract

Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering, thus avoiding the need for deep domain knowledge and reducing the effort to perform the analysis. In all cases, the proposed approach provides better performance than unsupervised clustering when no feature engineering is applied to the data. Moreover, the distribution of failure modes from the proposed approach is closer to the actual frequency of the failure modes.

Timing Covert Channel Analysis of the VxWorks MILS Embedded Hypervisor under the Common Criteria Security Certification

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersElsevier Computers & Security (COSE), 2021

Abstract

Virtualization technology is nowadays adopted in security-critical embedded systems to achieve higher performance and more design flexibility. However, it also comes with new security threats, where attackers leverage timing covert channels to exfiltrate sensitive information from a partition using a trojan. This paper presents a novel approach for the experimental assessment of timing covert channels in embedded hypervisors, with a case study on security assessment of a commercial hypervisor product (Wind River VxWorks MILS), in cooperation with a licensed laboratory for the Common Criteria security certification. Our experimental analysis shows that it is indeed possible to establish a timing covert channel, and that the approach is useful for system designers for assessing that their configuration is robust against this kind of information leakage.

Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

D. Cotroneo, Luigi De Simone, P. Liguori, R. Natella
Journal PapersIEEE Transactions on Dependable and Secure Computing (TDSC), 2020

Abstract

Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.

Run-Time Detection of Protocol Bugs in Storage I/O Device Drivers

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersIEEE Transactions on Reliability (TR), 2018

Abstract

Protocol violation bugs in storage device drivers are a critical threat for data integrity, since these bugs can silently corrupt the commands and data flowing between the OS and storage devices. Due to their nature, these bugs are notoriously difficult to find by traditional testing. In this paper, we propose a run-time monitoring approach for storage device drivers, in order to detect I/O protocol violations that would otherwise silently escalate in corruptions of users' data. The monitoring approach detects violations of I/O protocols by automatically learning a reference model from failure-free execution traces. The approach focuses on selected portions of the storage controller interface, in order to achieve a good trade-off in terms of low performance overhead and high coverage and accuracy of failure detection. We assess these properties on three real-world storage device drivers from the Linux kernel, through fault injection and stress tests. Moreover, we show that the monitoring approach only requires few minutes of training workload, and that it is robust to differences between the operational and the training workloads.

NFV-Bench: A Dependability Benchmark for Network Function Virtualization Systems

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersIEEE Transactions on Network and Service Management (TNSM), 2017

Abstract

Network Function Virtualization (NFV) envisions the use of cloud computing and virtualization technology to reduce costs and innovate network services. However, this paradigm shift poses the question whether NFV will be able to fulfill the strict performance and dependability objectives required by regulations and customers. Thus, we propose a dependability benchmark to support NFV providers at making informed decisions about which virtualization, management, and application-level solutions can achieve the best dependability. We define in detail the use cases, measures, and faults to be injected. Moreover, we present a benchmarking case study on two alterna- tive, production-grade virtualization solutions, namely VMware ESXi/vSphere (hypervisor-based) and Linux/Docker (container- based), on which we deploy an NFV-oriented IMS system. Despite the promise of higher performance and manageability, our experiments suggest that the container-based configuration can be less dependable than the hypervisor-based one, and point out which faults NFV designers should address to improve dependability.

Conference papers

IRIS: a Record and Replay Framework to Enable Hardware-assisted Virtualization Fuzzing

C. Cesarano; M. Cinque; D. Cotroneo; Luigi De Simone; G. Farina
Conference PapersProc. The 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2023), Porto, Portugal, June 27-30, 2023 (Preprint)

Abstract

Nowadays, industries are looking into virtualization as an effective means to build safe applications, thanks to the isolation it can provide among virtual machines (VMs) running on the same hardware. In this context, a fundamental issue is understanding to what extent the isolation is guaranteed, despite possible (or induced) problems in the virtualization mechanisms. Uncovering such isolation issues is still an open challenge, especially for hardware-assisted virtualization, since the search space should include all the possible VM states (and the linked hypervisor state), which is prohibitive. In this paper, we propose IRIS, a framework to record (learn) sequences of inputs (i.e., VM seeds) from the real guest execution (e.g., OS boot), replay them as-is to reach valid and complex VM states, and finally use them as valid seed to be mutated for enabling fuzzing solutions for hardware-assisted hypervisors. We demonstrate the accuracy and efficiency of IRIS in automatically reproducing valid VM behaviors, with no need to execute guest workloads. We also provide a proof-of-concept fuzzer, based on the proposed architecture, showing its potential on the Xen hypervisor.

On Temporal Isolation Assessment in Virtualized Railway Signaling as a Service Systems

Cotroneo D., Luigi De Simone, Natella R.
Conference PapersIn 2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) (pp. 1-5)

Abstract

Railway signaling systems provide numerous critical functions at different safety level, to correctly implement the entire transport ecosystem. Today, we are witnessing the increasing use of the cloud and virtualization technologies in such mixed-criticality systems, with the main goal of reducing costs, improving reliability, while providing orchestration capabilities. Unfortunately, virtualization includes several issues for assessing temporal isolation, which is critical for safety-related standards like EN50128. In this short paper, we envision leveraging the real-time flavor of a general-purpose hypervisor, like Xen, to build the Railway Signaling as a Service (RSaaS) systems of the future. We provide a preliminary background, highlighting the need for a systematic evaluation of the temporal isolation to demonstrate the feasibility of using general-purpose hypervisors in the safety-critical context for certification purposes.

Introducing k4. 0s: a Model for Mixed-Criticality Container Orchestration in Industry 4.0

Barletta, M., Cinque, M., Luigi De Simone, Della Corte, R.
Conference PapersIn 2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) (pp. 1-6)

Abstract

Time predictable edge cloud is seen as the answer for many arising needs in Industry 4.0 environments, since it is able to provide flexible, modular, and reconfigurable services with low latency and reduced costs. Orchestration systems are becoming the core component of clouds since they take decisions on the placement and lifecycle of software components. Current solutions start introducing real-time containers support for time predictability; however, these approaches lack of determinism as well as support for workloads requiring multiple levels of assurance/criticality.In this paper, we present k4.0s, an orchestration model for real-time and mixed-criticality environments, including timeliness, criticality and network requirements. The model leverages new abstractions for node and jobs, e.g., node assurance, and requires novel monitoring strategies. We sketch an implementation of the proposal based on Kubernetes, and present an experimentation motivating the need for node assurance levels and adequate monitoring.

Performability assessment of containerized multi-tenant IMS through multidimensional UGF

Luigi De Simone, M. Di Mauro, R. Natella, F. Postiglione
Conference PapersProc. 18th IEEE International Conference on Network and Service Management (CNSM 2022), 31 October - 4 November 2022, Thessaloniki, Greece

Abstract

We advance a performability assessment of a multi-tenant containerized IP Multimedia Subsystem (cIMS), i.e.: one and the same infrastructure is shared among different providers (or tenants). Specifically, we: i) model each cIMS node (a.k.a. Containerized Network Function - CNF) through the Multi-State System (MSS) formalism to capture the dimensionality of the multi-tenant arrangement, and characterize each tenant through queueing theory attributes to catch latency-dependent performance aspects; ii) afford an availability analysis of cIMS by means of an extended version of the Universal Generating Function (UGF) technique, dubbed Multidimensional UGF (MUGF); iii) solve an optimization problem to retrieve the cIMS deployment minimizing costs while guaranteeing high availability requirements. The whole assessment is supported by an experiment based on the containerized IMS platform Clearwater which we deploy to derive some realistic system parameters by means of fault injection techniques.

Achieving Isolation in Mixed-criticality Industrial Edge Systems with Real-time Containers

M Barletta, M. Cinque, Luigi De Simone, R. Della Corte
Conference Papers34th Euromicro Conference on Real-Time Systems (ECRTS 2022), Modena, Italy, July 5-8, 2022

Abstract

Real-time containers are a promising solution to reduce latencies in time-sensitive cloud systems. Recent efforts are emerging to extend their usage in industrial edge systems with mixed-criticality constraints. In these contexts, isolation becomes a major concern: a disturbance (such as timing faults or unexpected overloads) affecting a container must not impact the behavior of other containers deployed on the same hardware. In this paper, we propose a novel architectural solution to achieve isolation in real-time containers, based on real-time co-kernels, hierarchical scheduling, and timedivision networking. The architecture has been implemented on Linux patched with the Xenomai co-kernel, extended with a new hierarchical scheduling policy, named SCHED_DS, and integrating the RTNet stack. Experimental results are promising in terms of overhead and latency compared to other Linux-based solutions. More importantly, the isolation of containers is guaranteed even in presence of severe co-located disturbances, such as faulty tasks (elapsing more time than declared) or high CPU, network, or I/O stress on the same machine

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
Conference PapersProceedings of the International Workshop on Artificial Intelligence for IT Operations (AIOPS), PrePrint, Virtual Conference, 14 Dec 2020

Abstract

Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an “off-the-shelf” distributed system, showing that the approach can be applied with high failure detection coverage.

ProFIPy: Programmable Software Fault Injection as-a-Service

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
Conference PapersProceedings of the 50th International Conference on Dependable Systems and Networks (DSN), PrePrint, June 29 - July 2 ,2020, Valencia, Spain
Acceptance rate: 16.5% (48/291)

Abstract

In this paper, we present a new fault injection tool (ProFIPy) for Python software. The tool is designed to be programmable, in order to enable users to specify their software fault model, using a domain-specific language (DSL) for fault injection. Moreover, to achieve better usability, ProFIPy is provided as software-as-a-service and supports the user through the configuration of the faultload and workload, failure data analysis, and full automation of the experiments using container- based virtualization and parallelization.

Enhancing Failure Propagation Analysis in Cloud Computing Systems

D. Cotroneo, Luigi De Simone, P.Liguori, R. Natella and Nematollah Bidokhti
Conference PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.

Analyzing the context of bug-fixing changes in the OpenStack cloud computing platform

D. Cotroneo, Luigi De Simone, A.K. Iannillo, R. Natella, S. Rosiello and Nematollah Bidokhti
Conference PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

Many research areas in software engineering, such as mutation testing, automatic repair, fault localization, and fault injection, rely on empirical knowledge about recurring bug-fixing code changes. Previous studies in this field focus on what has been changed due to bug-fixes, such as in terms of code edit actions. However, such studies did not consider where the bug-fix change was made (i.e., the context of the change), but knowing about the context can potentially narrow the search space for many software engineering techniques (e.g., by focusing mutation only on specific parts of the software). Furthermore, most previous work on bug-fixing changes focused on C and Java projects, but there is little empirical evidence about Python software. Therefore, in this paper we perform a thorough empirical analysis of bug-fixing changes in three OpenStack projects, focusing on both the what and the where of the changes. We observed that all the recurring change patterns are not oblivious with respect to the surrounding code, but tend to occur in specific code contexts.

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

D. Cotroneo, Luigi De Simone, P.Liguori, R. Natella and Nematollah Bidokhti
Conference PapersProceedings of the The 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 26-30 Aug. 2019, Tallin, Estonia

Abstract

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Enhancing the Analysis of Error Propagation and Failure Modes in Cloud Systems

D. Cotroneo, Luigi De Simone, A. Di Martino, P. Liguori, R. Natella
Conference PapersProceedings of the 29th International Symposium on Software Reliability Engineering (ISSRE), 15-18 Oct. 2018, Memphis, TN, USA

Abstract

We argue for novel techniques to understand how cloud systems can fail, by enhancing fault injection with distributed tracing and anomaly detection techniques.

MoIO: Run-time monitoring for I/O protocol violations in storage device drivers

D. Cotroneo, Luigi De Simone, F. Fucci, R. Natella
Conference PapersProceedings of the 26th International Symposium on Software Reliability Engineering (ISSRE), Pages 472 - 483, 2-5 Nov. 2015, Gaithersbury, MD, USA

Abstract

Bugs affecting storage device drivers include the so-called protocol violation bugs, which silently corrupt data and commands exchanged with I/O devices. Protocol violations are very difficult to prevent, since testing device driver is notoriously difficult. To address them, we present a monitoring approach for device drivers (MoIO) to detect HO protocol violations at run-time. The approach infers a model of the interactions between the storage device driver, the OS kernel, and the hardware (the device driver protocol) by analyzing execution traces. The model is then used as a reference for detecting violations in production. The approach has been designed to have a low overhead and to overcome the lack of source code and protocol documentation. We show that the approach is feasible and effective by applying it on the SATA/AHCI storage device driver of the Linux kernel, and by performing fault injection and long-running tests.

Dependability Evaluation and Benchmarking of Network Function Virtualization Infrastructures

D. Cotroneo, Luigi De Simone, A.K. Iannillo, A. Lanzaro, R. Natella
Conference PapersProceedings of the 2015 1st IEEE Conference on Network Softwarization (NetSoft), Pages 1 - 9, 13-17 Apr. 2015, London, UK
BEST PAPER AWARD

Abstract

Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. However, the “softwarization” of network functions raises reliability concerns, as they will be exposed to faults in commodity hardware and software components. In this paper, we propose a methodology for the dependability evaluation and benchmarking of NFV Infrastructures (NFVIs), based on fault injection. We discuss the application of the methodology in the context of a virtualized IP Multimedia Subsystem (IMS), and the pitfalls in the design of a reliable NFVI.

Magazine Papers

Virtualization Over Multiprocessor Systems-on-Chip: An Enabling Paradigm for the Industrial Internet of Things

A. Cilardo, M. Cinque, Luigi De Simone, N. Mazzocca
Magazine PapersIEEE Computer, vol. 55, no. 10, pp. 35-47, 2022.

Abstract

The next-generation Industrial Internet of Things (IIoT) requires smart devices featuring rich connectivity, local intelligence, and autonomous behavior. We review representative solutions, highlighting aspects that are the most relevant for integration in IIoT solutions.

Short Papers (Workshop, tool paper, etc.)

RunPHI: Enabling Mixed-criticality Containers via Partitioning Hypervisors in Industry 4.0

Barletta, M., Cinque, M., Luigi De Simone, Della Corte, R., Farina, G., Ottaviano, D.
Short PapersIn 2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) (pp. 134-135)

Abstract

Orchestration systems are becoming a key component to automatically manage distributed computing resources in many fields with criticality requirements like Industry 4.0 (14.0). However, they are mainly linked to OS-level virtualization, which is known to suffer from reduced isolation. In this paper, we propose RunPHI with the aim of integrating partitioning hypervisors, as a solution for assuring strong isolation, with OS-level orchestration systems. The purpose is to enable container orchestration in mixed-criticality systems with isolation requirements through partitioned containers.

Towards Assessing Isolation Properties in Partitioning Hypervisors

Cesarano, C., Cotroneo, D., Luigi De Simone
Short PapersIn 2022 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) (pp. 193-200)

Abstract

Partitioning hypervisor solutions are becoming increasingly popular, to ensure stringent security and safety requirements related to isolation between co-hosted applications and to make more efficient use of available hardware resources. However, assessment and certification of isolation requirements remain a challenge and it is not trivial to understand what and how to test to validate these properties. Although the high-level requirements to be verified are mentioned in the different security- and safety-related standards, there is a lack of precise guidelines for the evaluator. This guidance should be comprehensive, generalizable to different products that implement partitioning, and tied specifically to lower-level requirements. The goal of this work is to provide a systematic framework that addresses this need.

Certify the uncertified: Towards assessment of virtualization for mixed-criticality in the automotive domain

Cinque M., Luigi De Simone, Marchetta A.
Short PapersIn 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W) (pp. 8-11)

Abstract

Nowadays, a feature-rich automotive vehicle offers several technologies to assist the driver during his trip and guarantee an amusing infotainment system to the other passengers, too. Consolidating worlds at different criticalities is a welcomed challenge for car manufacturers that have recently tried to leverage virtualization technologies due to reduced maintenance, deployment, and shipping costs. For this reason, more and more mixed-criticality systems are emerging, trying to assure compliance with the ISO 26262 Road Vehicle Safety standard. In this short paper, we provide a preliminary investigation of the certification capabilities for Jailhouse, a popular open-source partitioning hypervisor. To this aim, we propose a testing methodology and showcase the results, pointing out when the software gets to a faulting state, deviating from its expected behavior. The ultimate goal is to picture the right direction for the hypervisor towards a potential certification process.

Isolating Real-Time Safety-Critical Embedded Systems via SGX-based Lightweight Virtualization

Luigi De Simone, G. Mazzeo
Short PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

A promising approach for designing critical embedded systems is based on virtualization technologies and multi-core platforms. These enable the deployment of both real-time and general-purpose systems with different criticalities in a single host. Integrating virtualization while also meeting the real-time and isolation requirements is non-trivial, and poses significant challenges especially in terms of certification. In recent years, researchers proposed hardware-assisted solutions to face issues coming from virtualization, and recently the use of Operating System (OS) virtualization as a more lightweight approach. Industries are hampered in leveraging this latter type of virtualization despite the clear benefits it introduces, such as reduced overhead, higher scalability, and effortless certification since there is still lack of approaches to address drawbacks. In this position paper, we propose the usage of Intel's CPU security extension, namely SGX, to enable the adoption of enclaves based on unikernel, a flavor of OS-level virtualization, in the context of real-time systems. We present the advantages of leveraging both the SGX isolation and the unikernel features in order to meet the requirements of safety-critical real-time systems and ease the certification process.

A Configurable Software Aging Detection and Rejuvenation Agent for Android

D. Cotroneo, Luigi De Simone, R. Natella, R. Pietrantuono and Stefano Russo
Short PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

This paper presents the design of ADaRTA, an aging detection and rejuvenation tool for Android. The tool is a software agent which i) performs selective monitoring of system processes and of trends in system performance indicators; ii) detects the aging state and estimates the time-to-aging-failure, through heuristic rules; iii) schedules and applies rejuvenation, based on the estimated time-to-aging-failure. The agent rules and parameters have been defined for ease of configuration and tuning by device designers. A stress testing experiment is discussed, showing ADaRTA’s configurability for the device under test, and the ability of detecting the aging state to prevent device enter a failure state.

FailViz: A Tool for Visualizing Fault Injection Experiments in Distributed Systems

D. Cotroneo, Luigi De Simone, P.Liguori, R. Natella and Nematollah Bidokhti
Short PapersProceedings of the 15th European Dependable Computing Conference, 17-20 September 2019, Naples, Italy

Abstract

The analysis of fault injection experiments can be a cumbersome task. These experiments can generate large volumes of data (e.g., message traces), which a human analyst needs to inspect to understand the behavior of the system under failure. This paper introduces the FailViz tool for visualizing fault injection experiments, which points out relevant events for interpreting the failures. We also present a motivating example in the context of OpenStack, and point out future research directions.

Dependability Certification Guidelines for NFVIs through Fault Injection

D. Cotroneo, Luigi De Simone, R. Natella
Short PapersProceedings of the 29th International Symposium on Software Reliability Engineering (ISSRE), 15-18 Oct. 2018, Memphis, TN, USA

Abstract

Network Function Virtualization (NFV) is an emerging networking paradigm that offers new ways of creating, deploying, and managing networking services, by turning physical network functions into virtualized one. The NFV paradigm heavily relies on cloud computing and virtualization technologies to provide carrier-grade services. The certification process of NFV systems is an open and critical question to ensure that the delivered network service provides specific guarantees about performance and dependability. In this paper, we propose potential guidelines for evaluating the reliability of NFV Infrastructures (NFVIs), with the aim of verifying whether NFVIs satisfy its reliability and performance requirements even in presence of faults. The guidelines are described as a set of key practices to be followed, in terms of inputs, activities, and outputs. These practices are intended to be conducted by companies that want to evaluate the reliability of their NFVI against quantitative performance, availability, and fault tolerance objectives, and to get precise feedback on how to improve its fault tolerance.

Towards Fault Propagation Analysis in Cloud Computing Ecosystems

Luigi De Simone
Short Papers2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pages 156 - 161,3-6 Nov. 2014, Naples, Italy
BEST PRESENTATION AWARD

Abstract

Nowadays, Cloud Computing is a fundamental paradigm that provides computational resources as a service, on which users heavily rely. Cloud computing infrastructures behave as an ecosystem, where several actors play a crucial role. Unfortunately Cloud Computing Ecosystems (CCEs) are often affected by outages, such as those experienced by Amazon Web Service in the last years, that result from component faults that propagate through the whole CCE. Thus, there is still a need for approaches to improve CCEs' reliability. This paper discusses both existing approaches and open challenges for the dependability evaluation of CCEs, and the need for novel techniques and methodologies to prevent fault propagation within CCEs as a whole.

Network Function Virtualization: Challenges and Directions for Reliability Assurance

D. Cotroneo, Luigi De Simone, A.K. Iannillo, A. Lanzaro, R. Natella, Jiang Fan, Wang Ping
Short Papers2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pages 37 - 42, 3-6 Nov. 2014, Naples, Italy

Abstract

Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. Nevertheless, the "notarization" of network functions imposes software reliability concerns on future networks, which will be exposed to software issues arising from virtualization technologies. In this paper, we discuss the challenges for reliability in NFVIs, and present an industrial research project on their reliability assurance, which aims at developing novel fault injection technologies and systematic guidelines for this purpose.

Improving Usability of Fault Injection

D. Cotroneo, Luigi De Simone, A.K. Iannillo, A. Lanzaro, R. Natella
Short Papers2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pages 530 - 532, 3-6 Nov. 2014, Naples, Italy

Abstract

The lack of tools that can fit in existing development practices and processes hampers the adoption of Software Fault Injection (SFI) in real-world projects. This paper presents an ongoing work towards an SFI tool integrated in the Eclipse IDE, and designed for usability.

Patents

Metodo di Iniezione Guasti e Relativo Sistema

R. Natella, D. Cotroneo, Luigi De Simone, S. Rosiello, N. Bidokhti
Italian Patent for Industrial Invention
Patent number: 102019000019807
Owners: R. Natella, D. Cotroneo
Inventors: R. Natella, D. Cotroneo, L. De Simone, S. Rosiello, N. Bidokhti
Date of filing: October 25, 2019
Date of patent: October 27, 2021

Abstract

A novel method and related system for injecting faults into a software application using a Domain Specific Language (DSL)

PhD Thesis

Dependability Benchmarking of Network Function Virtualization

Luigi De Simone
PhD Thesis
DOI: 10.6093/UNINA/FEDOA/11855

Abstract

Network Function Virtualization (NFV) is an emerging networking paradigm that aims to reduce costs and time-to-market, improve manageability, and foster competition and innovative services. NFV exploits virtualization and cloud computing technologies to turn physical network functions into Virtualized Network Functions (VNFs), which will be implemented in software, and will run as Virtual Machines (VMs) on commodity hardware located in high-performance data centers, namely Network Function Virtualization Infrastructures (NFVIs). The NFV paradigm relies on cloud computing and virtualization technologies to provide carrier-grade services, i.e., the ability of a service to be highly reliable and available, within fast and automatic failure recovery mechanisms. The availability of many virtualization solutions for NFV poses the question on which virtualization technology should be adopted for NFV, in order to fulfill the requirements described above. Currently, there are limited solutions for analyzing, in quantitative terms, the performance and reliability tradeoffs, which are important concerns for the adoption of NFV. This thesis deals with assessment of the reliability and of the performance of NFV systems. It proposes a methodology, which includes context, measures, and faultloads, to conduct dependability benchmarks in NFV, according to the general principles of dependability benchmarking. To this aim, a fault injection framework for the virtualization technologies has been designed and implemented for the virtualized technologies being used as case studies in this thesis. This framework is successfully used to conduct an extensive experimental campaign, where we compare two candidate virtualization technologies for NFV adoption: the commercial, hypervisor-based virtualization platform VMware vSphere, and the open-source, containerbased virtualization platform Docker. These technologies are assessed in the context of a high-availability, NFV-oriented IP Multimedia Subsystem (IMS). The analysis of experimental results reveal that i) fault management mechanisms are crucial in NFV, in order to provide accurate failure detection and start the subsequent failover actions, and ii) fault injection proves to be valuable way to introduce uncommon scenarios in the NFVI, which can be fundamental to provide a high reliable service in production.