Filter by type:

Sort by year:

ThorFI: A Novel Approach for Network Fault Injection as a Service

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersElsevier Journal of Network and Computer Applications, 2022

Abstract

In this work, we present a novel fault injection solution (ThorFI) for virtual networks in cloud computing infrastructures. ThorFI is designed to provide non-intrusive fault injection capabilities for a cloud tenant, and to isolate injections from interfering with other tenants on the infrastructure. We present the solution in the context of the OpenStack cloud management platform, and release this implementation as open-source software. Finally, we present two relevant case studies of ThorFI, respectively in an NFV IMS and of a highavailability cloud application. The case studies show that ThorFI can enhance functional tests with fault injection, as in 4%-34% of the test cases the IMS is unable to handle faults; and that despite redundancy in virtual networks, faults in one virtual network segment can propagate to other segments, and can affect the throughput and response time of the cloud application as a whole, by about 3 times in the worst case.

Virtualization over Multiprocessor System-on-Chip: an Enabling Paradigm for Industrial IoT

A. Cilardo, M. Cinque, Luigi De Simone, N. Mazzocca
Journal PapersIEEE Computer, 2022

Abstract

The next-generation Industrial Internet of Things (IIoT) inherently requires smart devices featuring rich connectivity, local intelligence, and autonomous behavior. Emerging Multiprocessor System-on-Chip (MPSoC) platforms along with comprehensive support for virtualization will represent two key building blocks for smart devices in future IIoT edge infrastructures. We review representative existing solutions, highlighting the aspects that are most relevant for integration in IIoT solutions. From the analysis, we derive a reference architecture for a general virtualization-ready edge IIoT node. We then analyze the implications and benefits for a concrete use case scenario and identify the crucial research challenges to be faced to bridge the gap towards full support for virtualization-ready IIoT nodes.

Virtualizing Mixed-Criticality Systems: A Survey on Industrial Trends and Issues

M. Cinque, D. Cotroneo, Luigi De Simone, S. Rosiello
Journal PapersFuture Generation Computer Systems (FGCS), 2021

Abstract

Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.

Software micro-rejuvenation for Android mobile systems

D. Cotroneo, Luigi De Simone, R. Natella, R. Pietrantuono, S. Russo
Journal PapersJournal of Systems and Software (JSS), 2021

Abstract

Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.

Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning

D. Cotroneo, Luigi De Simone, P. Liguori, R. Natella
Journal PapersJournal of Systems and Software (JSS), 2021

Abstract

Identifying the failure modes of cloud computing systems is a difficult and time-consuming task, due to the growing complexity of such systems, and the large volume and noisiness of failure data. This paper presents a novel approach for analyzing failure data from cloud systems, in order to relieve human analysts from manually fine-tuning the data for feature engineering. The approach leverages Deep Embedded Clustering (DEC), a family of unsupervised clustering algorithms based on deep learning, which uses an autoencoder to optimize data dimensionality and inter-cluster variance. We applied the approach in the context of the OpenStack cloud computing platform, both on the raw failure data and in combination with an anomaly detection pre-processing algorithm. The results show that the performance of the proposed approach, in terms of purity of clusters, is comparable to, or in some cases even better than manually fine-tuned clustering, thus avoiding the need for deep domain knowledge and reducing the effort to perform the analysis. In all cases, the proposed approach provides better performance than unsupervised clustering when no feature engineering is applied to the data. Moreover, the distribution of failure modes from the proposed approach is closer to the actual frequency of the failure modes.

Timing Covert Channel Analysis of the VxWorks MILS Embedded Hypervisor under the Common Criteria Security Certification

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersElsevier Computers & Security (COSE), 2021

Abstract

Virtualization technology is nowadays adopted in security-critical embedded systems to achieve higher performance and more design flexibility. However, it also comes with new security threats, where attackers leverage timing covert channels to exfiltrate sensitive information from a partition using a trojan. This paper presents a novel approach for the experimental assessment of timing covert channels in embedded hypervisors, with a case study on security assessment of a commercial hypervisor product (Wind River VxWorks MILS), in cooperation with a licensed laboratory for the Common Criteria security certification. Our experimental analysis shows that it is indeed possible to establish a timing covert channel, and that the approach is useful for system designers for assessing that their configuration is robust against this kind of information leakage.

Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

D. Cotroneo, Luigi De Simone, P. Liguori, R. Natella
Journal PapersIEEE Transactions on Dependable and Secure Computing (TDSC), 2020

Abstract

Cloud computing systems fail in complex and unexpected ways, due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.

Run-Time Detection of Protocol Bugs in Storage I/O Device Drivers

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersIEEE Transactions on Reliability (TR), 2018

Abstract

Protocol violation bugs in storage device drivers are a critical threat for data integrity, since these bugs can silently corrupt the commands and data flowing between the OS and storage devices. Due to their nature, these bugs are notoriously difficult to find by traditional testing. In this paper, we propose a run-time monitoring approach for storage device drivers, in order to detect I/O protocol violations that would otherwise silently escalate in corruptions of users' data. The monitoring approach detects violations of I/O protocols by automatically learning a reference model from failure-free execution traces. The approach focuses on selected portions of the storage controller interface, in order to achieve a good trade-off in terms of low performance overhead and high coverage and accuracy of failure detection. We assess these properties on three real-world storage device drivers from the Linux kernel, through fault injection and stress tests. Moreover, we show that the monitoring approach only requires few minutes of training workload, and that it is robust to differences between the operational and the training workloads.

NFV-Bench: A Dependability Benchmark for Network Function Virtualization Systems

D. Cotroneo, Luigi De Simone, R. Natella
Journal PapersIEEE Transactions on Network and Service Management (TNSM), 2017

Abstract

Network Function Virtualization (NFV) envisions the use of cloud computing and virtualization technology to reduce costs and innovate network services. However, this paradigm shift poses the question whether NFV will be able to fulfill the strict performance and dependability objectives required by regulations and customers. Thus, we propose a dependability benchmark to support NFV providers at making informed decisions about which virtualization, management, and application-level solutions can achieve the best dependability. We define in detail the use cases, measures, and faults to be injected. Moreover, we present a benchmarking case study on two alterna- tive, production-grade virtualization solutions, namely VMware ESXi/vSphere (hypervisor-based) and Linux/Docker (container- based), on which we deploy an NFV-oriented IMS system. Despite the promise of higher performance and manageability, our experiments suggest that the container-based configuration can be less dependable than the hypervisor-based one, and point out which faults NFV designers should address to improve dependability.

Towards Runtime Verification via Event Stream Processing in Cloud Computing Infrastructures

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
Conference PapersProceedings of the International Workshop on Artificial Intelligence for IT Operations (AIOPS), PrePrint, Virtual Conference, 14 Dec 2020

Abstract

Software bugs in cloud management systems often cause erratic behavior, hindering detection, and recovery of failures. As a consequence, the failures are not timely detected and notified, and can silently propagate through the system. To face these issues, we propose a lightweight approach to runtime verification, for monitoring and failure detection of cloud computing systems. We performed a preliminary evaluation of the proposed approach in the OpenStack cloud management platform, an “off-the-shelf” distributed system, showing that the approach can be applied with high failure detection coverage.

ProFIPy: Programmable Software Fault Injection as-a-Service

Domenico Cotroneo, Luigi De Simone, Pietro Liguori, Roberto Natella
Conference PapersProceedings of the 50th International Conference on Dependable Systems and Networks (DSN), PrePrint, June 29 - July 2 ,2020, Valencia, Spain
Acceptance rate: 16.5% (48/291)

Abstract

In this paper, we present a new fault injection tool (ProFIPy) for Python software. The tool is designed to be programmable, in order to enable users to specify their software fault model, using a domain-specific language (DSL) for fault injection. Moreover, to achieve better usability, ProFIPy is provided as software-as-a-service and supports the user through the configuration of the faultload and workload, failure data analysis, and full automation of the experiments using container- based virtualization and parallelization.

Isolating Real-Time Safety-Critical Embedded Systems via SGX-based Lightweight Virtualization

Luigi De Simone, G. Mazzeo
Conference PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

A promising approach for designing critical embedded systems is based on virtualization technologies and multi-core platforms. These enable the deployment of both real-time and general-purpose systems with different criticalities in a single host. Integrating virtualization while also meeting the real-time and isolation requirements is non-trivial, and poses significant challenges especially in terms of certification. In recent years, researchers proposed hardware-assisted solutions to face issues coming from virtualization, and recently the use of Operating System (OS) virtualization as a more lightweight approach. Industries are hampered in leveraging this latter type of virtualization despite the clear benefits it introduces, such as reduced overhead, higher scalability, and effortless certification since there is still lack of approaches to address drawbacks. In this position paper, we propose the usage of Intel's CPU security extension, namely SGX, to enable the adoption of enclaves based on unikernel, a flavor of OS-level virtualization, in the context of real-time systems. We present the advantages of leveraging both the SGX isolation and the unikernel features in order to meet the requirements of safety-critical real-time systems and ease the certification process.

A Configurable Software Aging Detection and Rejuvenation Agent for Android

D. Cotroneo, Luigi De Simone, R. Natella, R. Pietrantuono and Stefano Russo
Conference PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

This paper presents the design of ADaRTA, an aging detection and rejuvenation tool for Android. The tool is a software agent which i) performs selective monitoring of system processes and of trends in system performance indicators; ii) detects the aging state and estimates the time-to-aging-failure, through heuristic rules; iii) schedules and applies rejuvenation, based on the estimated time-to-aging-failure. The agent rules and parameters have been defined for ease of configuration and tuning by device designers. A stress testing experiment is discussed, showing ADaRTA’s configurability for the device under test, and the ability of detecting the aging state to prevent device enter a failure state.

Enhancing Failure Propagation Analysis in Cloud Computing Systems

D. Cotroneo, Luigi De Simone, P.Liguori, R. Natella and Nematollah Bidokhti
Conference PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

In order to plan for failure recovery, the designers of cloud systems need to understand how their system can potentially fail. Unfortunately, analyzing the failure behavior of such systems can be very difficult and time-consuming, due to the large volume of events, non-determinism, and reuse of third-party components. To address these issues, we propose a novel approach that joins fault injection with anomaly detection to identify the symptoms of failures. We evaluated the proposed approach in the context of the OpenStack cloud computing platform. We show that our model can significantly improve the accuracy of failure analysis in terms of false positives and negatives, with a low computational cost.

Analyzing the context of bug-fixing changes in the OpenStack cloud computing platform

D. Cotroneo, Luigi De Simone, A.K. Iannillo, R. Natella, S. Rosiello and Nematollah Bidokhti
Conference PapersProceedings of the 30th International Symposium on Software Reliability Engineering (ISSRE), PrePrint, 28-31 Oct. 2019, Berlin, Germany

Abstract

Many research areas in software engineering, such as mutation testing, automatic repair, fault localization, and fault injection, rely on empirical knowledge about recurring bug-fixing code changes. Previous studies in this field focus on what has been changed due to bug-fixes, such as in terms of code edit actions. However, such studies did not consider where the bug-fix change was made (i.e., the context of the change), but knowing about the context can potentially narrow the search space for many software engineering techniques (e.g., by focusing mutation only on specific parts of the software). Furthermore, most previous work on bug-fixing changes focused on C and Java projects, but there is little empirical evidence about Python software. Therefore, in this paper we perform a thorough empirical analysis of bug-fixing changes in three OpenStack projects, focusing on both the what and the where of the changes. We observed that all the recurring change patterns are not oblivious with respect to the surrounding code, but tend to occur in specific code contexts.

FailViz: A Tool for Visualizing Fault Injection Experiments in Distributed Systems

D. Cotroneo, Luigi De Simone, P.Liguori, R. Natella and Nematollah Bidokhti
Conference PapersProceedings of the 15th European Dependable Computing Conference, 17-20 September 2019, Naples, Italy

Abstract

The analysis of fault injection experiments can be a cumbersome task. These experiments can generate large volumes of data (e.g., message traces), which a human analyst needs to inspect to understand the behavior of the system under failure. This paper introduces the FailViz tool for visualizing fault injection experiments, which points out relevant events for interpreting the failures. We also present a motivating example in the context of OpenStack, and point out future research directions.

How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform

D. Cotroneo, Luigi De Simone, P.Liguori, R. Natella and Nematollah Bidokhti
Conference PapersProceedings of the The 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 26-30 Aug. 2019, Tallin, Estonia

Abstract

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to high-severity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

Dependability Certification Guidelines for NFVIs through Fault Injection

D. Cotroneo, Luigi De Simone, R. Natella
Conference PapersProceedings of the 29th International Symposium on Software Reliability Engineering (ISSRE), 15-18 Oct. 2018, Memphis, TN, USA

Abstract

Network Function Virtualization (NFV) is an emerging networking paradigm that offers new ways of creating, deploying, and managing networking services, by turning physical network functions into virtualized one. The NFV paradigm heavily relies on cloud computing and virtualization technologies to provide carrier-grade services. The certification process of NFV systems is an open and critical question to ensure that the delivered network service provides specific guarantees about performance and dependability. In this paper, we propose potential guidelines for evaluating the reliability of NFV Infrastructures (NFVIs), with the aim of verifying whether NFVIs satisfy its reliability and performance requirements even in presence of faults. The guidelines are described as a set of key practices to be followed, in terms of inputs, activities, and outputs. These practices are intended to be conducted by companies that want to evaluate the reliability of their NFVI against quantitative performance, availability, and fault tolerance objectives, and to get precise feedback on how to improve its fault tolerance.

Enhancing the Analysis of Error Propagation and Failure Modes in Cloud Systems

D. Cotroneo, Luigi De Simone, A. Di Martino, P. Liguori, R. Natella
Conference PapersProceedings of the 29th International Symposium on Software Reliability Engineering (ISSRE), 15-18 Oct. 2018, Memphis, TN, USA

Abstract

We argue for novel techniques to understand how cloud systems can fail, by enhancing fault injection with distributed tracing and anomaly detection techniques.

MoIO: Run-time monitoring for I/O protocol violations in storage device drivers

D. Cotroneo, Luigi De Simone, F. Fucci, R. Natella
Conference PapersProceedings of the 26th International Symposium on Software Reliability Engineering (ISSRE), Pages 472 - 483, 2-5 Nov. 2015, Gaithersbury, MD, USA

Abstract

Bugs affecting storage device drivers include the so-called protocol violation bugs, which silently corrupt data and commands exchanged with I/O devices. Protocol violations are very difficult to prevent, since testing device driver is notoriously difficult. To address them, we present a monitoring approach for device drivers (MoIO) to detect HO protocol violations at run-time. The approach infers a model of the interactions between the storage device driver, the OS kernel, and the hardware (the device driver protocol) by analyzing execution traces. The model is then used as a reference for detecting violations in production. The approach has been designed to have a low overhead and to overcome the lack of source code and protocol documentation. We show that the approach is feasible and effective by applying it on the SATA/AHCI storage device driver of the Linux kernel, and by performing fault injection and long-running tests.

Dependability Evaluation and Benchmarking of Network Function Virtualization Infrastructures

D. Cotroneo, Luigi De Simone, A.K. Iannillo, A. Lanzaro, R. Natella
Conference PapersProceedings of the 2015 1st IEEE Conference on Network Softwarization (NetSoft), Pages 1 - 9, 13-17 Apr. 2015, London, UK
BEST PAPER AWARD

Abstract

Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. However, the “softwarization” of network functions raises reliability concerns, as they will be exposed to faults in commodity hardware and software components. In this paper, we propose a methodology for the dependability evaluation and benchmarking of NFV Infrastructures (NFVIs), based on fault injection. We discuss the application of the methodology in the context of a virtualized IP Multimedia Subsystem (IMS), and the pitfalls in the design of a reliable NFVI.

Towards Fault Propagation Analysis in Cloud Computing Ecosystems

Luigi De Simone
Conference Papers2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pages 156 - 161,3-6 Nov. 2014, Naples, Italy
BEST PRESENTATION AWARD

Abstract

Nowadays, Cloud Computing is a fundamental paradigm that provides computational resources as a service, on which users heavily rely. Cloud computing infrastructures behave as an ecosystem, where several actors play a crucial role. Unfortunately Cloud Computing Ecosystems (CCEs) are often affected by outages, such as those experienced by Amazon Web Service in the last years, that result from component faults that propagate through the whole CCE. Thus, there is still a need for approaches to improve CCEs' reliability. This paper discusses both existing approaches and open challenges for the dependability evaluation of CCEs, and the need for novel techniques and methodologies to prevent fault propagation within CCEs as a whole.

Network Function Virtualization: Challenges and Directions for Reliability Assurance

D. Cotroneo, Luigi De Simone, A.K. Iannillo, A. Lanzaro, R. Natella, Jiang Fan, Wang Ping
Conference Papers2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pages 37 - 42, 3-6 Nov. 2014, Naples, Italy

Abstract

Network Function Virtualization (NFV) is an emerging solution that aims at improving the flexibility, the efficiency and the manageability of networks, by leveraging virtualization and cloud computing technologies to run network appliances in software. Nevertheless, the "notarization" of network functions imposes software reliability concerns on future networks, which will be exposed to software issues arising from virtualization technologies. In this paper, we discuss the challenges for reliability in NFVIs, and present an industrial research project on their reliability assurance, which aims at developing novel fault injection technologies and systematic guidelines for this purpose.

Improving Usability of Fault Injection

D. Cotroneo, Luigi De Simone, A.K. Iannillo, A. Lanzaro, R. Natella
Conference Papers2014 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), Pages 530 - 532, 3-6 Nov. 2014, Naples, Italy

Abstract

The lack of tools that can fit in existing development practices and processes hampers the adoption of Software Fault Injection (SFI) in real-world projects. This paper presents an ongoing work towards an SFI tool integrated in the Eclipse IDE, and designed for usability.