Home Page of Stefano Rosiello

DRACO: Distributed Resource-aware Admission Control for Large-Scale, Multi-Tier Systems

D. Cotroneo, R. Natella and S. Rosiello

Journal PapersJournal of Parallel and Distributed Computing, Volume 192, October 2024, issn 0743-7315

Abstract

Modern distributed systems are designed to manage overload conditions, by throttling the traffic in excess that cannot be served through overload control techniques. However, the adoption of large-scale NoSQL datastores make systems vulnerable to unbalanced overloads, where specific datastore nodes are overloaded because of hot-spot resources and hogs. In this paper, we propose DRACO, a novel overload control solution that is aware of data dependencies between the application and the datastore tiers. DRACO performs selective admission control of application requests, by only dropping the ones that map to resources on overloaded datastore nodes, while achieving high resource utilization on non-overloaded datastore nodes. We evaluate DRACO on two case studies with high availability and performance requirements, a virtualized IP Multimedia Subsystem and a distributed fileserver. Results show that the solution can achieve high performance and resource utilization even under extreme overload conditions, up to 100x the engineered capacity.

Workflow for the Assessment of ITER Plasma Control System Design

L. Pangione, T. Ravensbergen, L. Zabeo, P. de Vries, G. De Tommasi, M. Cinque, S. Rosiello

Journal PapersIEEE Transactions on Plasma Science, Apr 2024, issn 1939-9375

Abstract

The plasma control system (PCS) for the first nonactive ITER operation phase will require simultaneous active monitoring and control of many continuous and discrete quantities. Considering the unique challenges ITER will face, all the controllers will be integrated and deployed with very little experimental time dedicated to PCS tuning and development. In order to maximize the efficiency of the ITER PCS design, a formal system engineering approach has been adopted. In a simplified way, the design process starts with the definition of the requirements. Functionalities are then designed and developed in order to meet these requirements. As a last step in the design process, it is important to assess that all the designed functionalities meet the associated requirements and that all the requirements are covered. The many different control functions will be designed and implemented in ITER PCS simulation platform (PCSSP) by different designing teams, both internal and external to ITER Organization. Although each team will be responsible for the independent assessment of the modules they deliver, an extra step is, nevertheless, necessary to guarantee that all the modules still continue to work when connected together. Therefore, integrated assessments will be built from independent assessments and will prove the controllers continue to meet the requirements. For this reason, it is necessary to have a unified workflow for the assessments performed by all the different designing teams. In fact, in order to guarantee a smooth integration assessment, it is important that all the assessments follow the same rules, use the same tools, are provided with the correct information, and are performed on the same platform. In this article, we present the proposed assessment workflow for ITER PCS components and some early impressions gathered from assessments of first delivered modules.

Strategy to Systematically Design and Deploy the Iter Plasma Control System: A System Engineering and Model-Based Design Approach

P.C. de Vries, M. Cinque, G. De Tommasi, W. Treutterer, D. Humphreys, M. Walker, F. Felici, I. Gomez, L. Zabeo, T. Ravensbergen, L. Pangione, F. Rimini, S. Rosiello, Y. Gribov, M. Dubrov, A. Vu, I. Carvalho, W.R. Lee, T. Tak, A. Zagar, R. Gunion, R. Pitts, M. Mattei, A. Pironti, M. Ariola, F. Pesamosca, O. Kudlacek, G. Raupp, G. Pautasso, R. Nouailletas, Ph. Moreau, D. Weldon

Journal PapersFusion Engineering and Design, July 2024, Volume 204, issn 114464

Abstract

The paper details the process of developing the ITER Plasma Control System (PCS), that is, how to design and deploy it systematically, in the most efficient and effective manner. The integrated nature of the ITER PCS, with its multitude of coupled control functions, and its long-term development, calls for a different approach than the design and short-term deployment of individual controllers. It requires, in the first place, a flexible implementation strategy and system architecture that allows system re-configuration and optimization throughout its development. Secondly, a model-based system engineering approach is carried out, for the complete PCS development, i.e. both its design and deployment. It requires clear definitions for both the PCS role and its functionality, as well as definitions of the design and deployment process itself. The design and deployment process is shown to allow tracing the relationships of the many individual design and deployment aspects, such as system requirements, assumed operation use-cases and response models, and eventually verification and functional validation of the system design. The functional validation will make use of a dedicated PCS simulation platform that includes the description of the control function design as well as plant, actuator and sensor models that enable the simulation of these functions. By establishing a clear understanding of the interconnected steps involved in designing, implementing, commissioning, and operating the system, a more systematic approach is achieved. This ensures the completion of a comprehensive design that can be deployed efficiently.

Workflow for the assessment of ITER Plasma Control System design for PFPO-1 phase

L. Pangione, T. Ravensbergen, L. Zabeo, P. de Vries, G. De Tommasi, M. Cinque, S. Rosiello

Conference PapersProc. 30th IEEE Symposium on Fusion Engineering (SOFE 2023), Oxford, UK, July 2023.

Abstract

The Plasma Control System (PCS) for the ITER PFPO-1 phase will require simultaneous active monitoring and control of many continuous and discrete quantities. Considering the unique challenges ITER will face, all the controllers will be integrated and deployed with very little experimental time dedicated to PCS tuning and development. In order to maximize the efficiency of the ITER PCS design, a formal system engineering approach has been adopted. In a simplified way, the design process starts with the definition of the requirements. Functionalities are then designed and developed in order to meet these requirements. As a last step in the design process, it is important to assess that all the designed functionalities meet the associated requirements and that all the requirements are covered. The multiple controllers will be designed and implemented in ITER PCS Simulation Platform by different designing teams, both internal and external to ITER Organization. Although each team will be responsible for the independent assessment of the modules they deliver, an extra step is nevertheless necessary to guarantee that all the modules still continue to work when connected together. Therefore, integrated assessments will be built from independent assessments and will prove the controllers continue to meet the requirements. For this reason, it is necessary to have a unified workflow for the assessments performed by all the different teams. In fact, in order to guarantee a smooth integration assessment, it is important that all the assessments follow the same rules, use the same tools, are provided with the correct information, and are performed on the same platform. In this paper, we present the proposed assessment workflow for ITER PCS components and some early impressions gathered from assessments of first delivered modules.

System-Engineering approach for the ITER PCS design: The correction coils current controller case study

G. De Tommasi, M. Cinque, D. Ottaviano, A. Pironti, S. Rosiello , F. Villone

Journal PapersFusion Engineering and Design, Volume 185, Oct 2022, issn 0920-3796

Abstract

The Plasma Control System (PCS) is in charge of robustly controlling the evolution of plasma parameters against model uncertainties and disturbances, with the aim of achieving the envisaged goals and performance. The PCS design process follows a System-Engineering approach to support all the design phases, from control algorithms specifications to the verification and validation tests for various components. A case study concerning the design of the Correction Coils Current Controllers is presented in this paper. The aim is to show on a smaller scale the approach that is applied to the entire design of the PCS, by highlighting its effectiveness, from the refinement of the requirements, up to their validation in the simulation environment.

An unsupervised approach to discover filtering rules from diagnostic logs

M. Cinque, R. Della Corte, G. Farina and S. Rosiello

Conference PapersProc. 33rd IEEE International Symposium on Software Reliability Engineering, October 2022, Charlotte, North Carolina

Abstract

Diagnostic logs represent the main source of information about the system runtime. However, the presence of faults typically leads to multiple errors propagating within system components, which requires analysts to dig into cascading messages for root cause analysis. This is exacerbated in complex systems, such as railway systems, composed by several devices generating high amount of logs. Filtering allows dealing with large data volumes, leading practitioners to focus on interesting events, i.e., events that should be further investigated by analysts. This paper proposes an unsupervised approach to discover filtering rules from diagnostic logs. The approach automatically infers potential events correlations, representing them as fault-trees enriched with scores. Trees define filtering rules highlighting the interesting events, while scores allow prioritizing their analysis. The approach has been applied in a preliminary railway case study, which encompasses more than 710k events generated by on-board train equipment during operation.

AID4TRAIN: Artificial Intelligence-based Diagnostics for TRAins and INdustry 4.0

M. Cinque, R. Della Corte, G. Farina and S. Rosiello

Conference PapersDependable Computing – EDCC 2022 Workshops. EDCC 2022. Communications in Computer and Information Science, vol 1656. Springer, Cham, September 2022, Zaragoza, Spain

Abstract

Diagnostic data logs generated by systems components represent the main source of information about the system run-time behavior. However, as faults typically lead to multiple reported errors that propagate to other components, the analysts’ work is hardened by digging in cascading diagnostic messages. Root cause analysis can help to pinpoint faults from the failures occurred during system operation but it is unpractical for complex systems, especially in the context of Industry 4.0 and Railway domains, where smart control devices continuously generate high amount of logs. The AID4TRAIN project aims to improve root cause analysis in both Industry 4.0 and Railway domains leveraging AI techniques to automatically infer a fault model of the target system from historical diagnostic data, which can be integrated with the system experts knowledge. The resulting model is then leveraged to create log filtering rules to be applied on previously unseen diagnostic data to identify the root cause of the occurred problem. This paper introduces the AID4TRAIN framework and its implementation at the current project stage. Further, a preliminary case study in the railway domain is presented.

Virtualizing mixed-criticality systems: A survey on industrial trends and issues

M. Cinque, D. Cotroneo, L. De Simone, S. Rosiello

Journal PapersFuture Generation Computer Systems, Volume 129, April 2022, Pages 315-330

Abstract

Virtualization is gaining attraction in the industry as it promises a flexible way to integrate, manage, and re-use heterogeneous software components with mixed-criticality levels, on a shared hardware platform, while obtaining isolation guarantees. This work surveys the state-of-the-practice of real-time virtualization technologies by discussing common issues in the industry. In particular, we analyze how different virtualization approaches and solutions can impact isolation guarantees and testing/certification activities, and how they deal with dependability challenges. The aim is to highlight current industry trends and support industrial practitioners to choose the most suitable solution according to their application domains.

Dependability Evaluation of Middleware Technology for Large-scale Distributed Caching

D. Cotroneo, R. Natella, S. Rosiello

Conference PapersProc. 31th IEEE International Symposium on Software Reliability Engineering (ISSRE 2020) October 2020, Coimbra, Portugal

Abstract

Distributed caching systems (e.g., Memcached) are widely used by service providers to satisfy accesses by millions of concurrent clients. Given their large-scale, modern distributed systems rely on a middleware layer to manage caching nodes, to make applications easier to develop, and to apply load balancing and replication strategies. In this work, we performed a dependability evaluation of three popular middleware platforms, namely Twemproxy by Twitter, Mcrouter by Facebook, and Dynomite by Netflix, to assess availability and performance under faults, including failures of Memcached nodes and congestion due to unbalanced workloads and network link bandwidth bottlenecks. We point out the different availability and performance trade-offs achieved by the three platforms, and scenarios in which few faulty components cause cascading failures of the whole distributed system.

Dependability Assessment of the Android OS through Fault Injection

D. Cotroneo, A.K. Iannillo, R. Natella, S. Rosiello

Journal PapersIEEE Transactions on Reliability, vol. 70, no. 1, pp. 346-361

Abstract

The reliability of mobile devices is a challenge for vendors, since the mobile software stack has significantly grown in complexity. In this paper, we study how to assess the impact of faults on the quality of user experience in the Android mobile OS through fault injection. We first address the problem of identifying a realistic fault model for the Android OS, by providing to developers a set of lightweight and systematic guidelines for fault modeling. Then, we present an extensible fault injection tool (AndroFIT) to apply such fault model on actual, commercial Android devices. Finally, we present a large fault injection experimentation on three Android products from major vendors, and point out several reliability issues and opportunities for improving the Android OS.

Analyzing the Context of Bug-Fixing Changes in the OpenStack Cloud Computing Platform

D. Cotroneo, L. De Simone, A.K. Iannillo, R. Natella, S. Rosiello , N. Bidokhti

Conference PapersProc. 30th IEEE International Symposium on Software Reliability Engineering (ISSRE 2019) October 2019, Berlin, Germany

Abstract

Many research areas in software engineering, such as mutation testing, automatic repair, fault localization, and fault injection, rely on empirical knowledge about recurring bug-fixing code changes. Previous studies in this field focus on what has been changed due to bug-fixes, such as in terms of code edit actions. However, such studies did not consider where the bug-fix change was made (i.e., the context of the change), but knowing about the context can potentially narrow the search space for many software engineering techniques (e.g., by focusing mutation only on specific parts of the software). Furthermore, most previous work on bug-fixing changes focused on C and Java projects, but there is little empirical evidence about Python software. Therefore, in this paper we perform a thorough empirical analysis of bug-fixing changes in three OpenStack projects, focusing on both the what and the where of the changes. We observed that all the recurring change patterns are not oblivious with respect to the surrounding code, but tend to occur in specific code contexts.

Overload Control for Virtual Network Functions under CPU Contention

D. Cotroneo, R. Natella, S. Rosiello

Journal PapersElsevier Future Generation Computer Systems (Volume 99, October 2019, Pages 164-176)

Abstract

In this paper, we analyze the problem of overloads caused by physical CPU contention in cloud infrastructures, from the perspective of time-critical applications (such as Virtual Network Functions) running at guest level. We show that guest-level overload control solutions to counteract traffic spikes (e.g., traffic throttling) are counterproductive against overloads caused by CPU contention. We then propose a general guest-level solution to protect applications from overloads also in the case of CPU contention. We reproduced the phenomena on a IP Multimedia Subsystem (IMS) testbed based on OpenStack on top of KVM. The results show that the approach can dynamically adapt the service throughput to the actual system capacity in both cases of traffic spikes and CPU contention, by guaranteeing at the same time the IMS latency requirements.

A Fault Correlation Approach to Detect Performance Anomalies in Virtual Network Function Chains

D. Cotroneo, R. Natella, S. Rosiello

Conference PapersProc. 28th IEEE International Symposium on Software Reliability Engineering (ISSRE 2017) October 2017, Toulouse, France

Abstract

Network Function Virtualization is an emerging paradigm to allow the creation, at software level, of complex network services by composing simpler ones. However, this paradigm shift exposes network services to faults and bottlenecks in the complex software virtualization infrastructure they rely on. Thus, NFV services require effective anomaly detection systems to detect the occurrence of network problems. The paper proposes a novel approach to ease the adoption of anomaly detection in production NFV services, by avoiding the need to train a model or to calibrate a threshold. The approach infers the service health status by collecting metrics from multiple elements in the NFV service chain, and by analyzing their (lack of) correlation over the time. We validate this approach on an NFV-oriented Interactive Multimedia System, to detect problems affecting the quality of service, such as the overload, component crashes, avalanche restarts and physical resource contention.

NFV-Throttle: An Overload Control Framework for Network Function Virtualization

D. Cotroneo, R. Natella, S. Rosiello

Journal PapersIEEE Transactions on Network and Service Management ( Volume: 14, Issue: 4, Dec. 2017 )

Abstract

Network function virtualization (NFV) aims to provide high-performance network services through cloud computing and virtualization technologies. However, network overloads represent a major challenge. While elastic cloud computing can partially address overloads by scaling on-demand, this mechanism is not quick enough to meet the strict high-availability requirements of "carrier-grade" telecom services. Thus, in this paper we propose a novel overload control framework (NFV-Throttle) to protect NFV services from failures due to an excess of traffic in the short term, by filtering the incoming traffic toward virtual network functions (VNFs) to make the best use of the available capacity, and to preserve the QoS of traffic flows admitted in the network. Moreover, the framework has been designed to fit the service models of NFV, including VNFaaS and NFVIaaS. We present an extensive experimental evaluation on the NFV-oriented Clearwater IMS, showing that the solution is robust and able to sustain severe overload conditions with a very small performance overhead.

Combining Black Box Testing with White Box Code Analysis: A Heterogeneous Approach for Testing Enterprise SaaS Applications

S. Rosiello , A. Choudhary, A. Roy, R. Ganesan

Conference PapersProc. 25th IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW 2017) October 2014, Naples, Italy

Abstract

Faulty enterprise applications may be producing incorrect outputs or performing below service expectations due to code vulnerabilities that do not show up in standard code analyzers (e.g., CAST [1]). A tester can figure out such functionality or performance issues by black box functional testing and fault injection. Then based on the specific test scenario, targeted white box code analysis can be done to figure out the code errors causing the application functionality or performance issue. In this paper, we use such a heterogeneous testing approach that combines black box testing with white box code analysis for testing an enterprise licensing application (ELA). We describe experiments designed to uncover functionality and performance issues in ELA and then explore the corresponding code errors causing the issues. We find that our approach is effective in faster detection and fixing of application performance and functionality errors than simple white box code analysis.

Email	stefano (dot) rosiello (at) unina (dot) it
Address	DESSERT lab - DIETI - Via Claudio, 21 - 80125 - NAPOLI PALAZZINA 3/A - 4 piano - Stanza 4.07
Phone	+39.081.76.83820
Skype	sterosiello

Performance Failures

Anomaly Detection

Overload Control

Filter by type:

DRACO: Distributed Resource-aware Admission Control for Large-Scale, Multi-Tier Systems

Abstract

Workflow for the Assessment of ITER Plasma Control System Design

Abstract

Strategy to Systematically Design and Deploy the Iter Plasma Control System: A System Engineering and Model-Based Design Approach

Abstract

Workflow for the assessment of ITER Plasma Control System design for PFPO-1 phase

Abstract

System-Engineering approach for the ITER PCS design: The correction coils current controller case study

Abstract

An unsupervised approach to discover filtering rules from diagnostic logs

Abstract

AID4TRAIN: Artificial Intelligence-based Diagnostics for TRAins and INdustry 4.0

Abstract

Virtualizing mixed-criticality systems: A survey on industrial trends and issues

Abstract

Dependability Evaluation of Middleware Technology for Large-scale Distributed Caching

Abstract

Dependability Assessment of the Android OS through Fault Injection

Abstract

Analyzing the Context of Bug-Fixing Changes in the OpenStack Cloud Computing Platform

Abstract

Overload Control for Virtual Network Functions under CPU Contention

Abstract

A Fault Correlation Approach to Detect Performance Anomalies in Virtual Network Function Chains

Abstract

NFV-Throttle: An Overload Control Framework for Network Function Virtualization

Abstract

Combining Black Box Testing with White Box Code Analysis: A Heterogeneous Approach for Testing Enterprise SaaS Applications

Abstract

Teaching Assistant

Past Thesis Advisor

Find me on ...