Connecting safety-critical systems to the cloud
Rapid innovation and incorporating AI driven functionality are just two examples how high-tech systems can benefit by being connected to the cloud. The TRANSACT project investigates the transformation of safety-critical cyber-physical systems from localised standalone systems into safe and secure distributed solutions leveraging edge and cloud computing towards such benefits. Nonetheless, this only flies when safety and performance of the system as whole are ensured. Consider a hospital, in which adding a cloud connection to medical imaging equipment, e.g. for minimally invasive treatment of patients, has the potential to support the task of a surgeon for treatment of patients.
While safety-critical applications in such medical imaging (e.g. live X-ray imaging) will remain deployed in the device, mission-critical functionality, such as non-real-time image processing could be deployed advantageously in the cloud, as shown in Figure 1. During treatment, surgeons then may request the cloud to perform the latest innovation such as an 3D image processing analysis, to provide additional insight for the medical procedure at hand.
Considering safety and performance aspects is essential when such complex (cloud-based) image processing comes together with the need for application responsiveness during the live image guided treatment of patients. The cloud processing must not keep the surgeon (nor patient) waiting. Furthermore, in case of e.g. internet outages, the surgeon must always be able to revert to a well-defined degraded performance way of working.
So, which concepts are needed to ensure safety (mission-criticality) and performance for such use cases? The TRANSACT project has undertaken an investigation of necessary concepts in such distributed solutions to ensure safety and performance from an end-user perspective.
13 safety and performance concepts to connect systems to the cloud
In the first year, the TRANSACT project has performed a thorough selection and evaluation of end-to-end safety and performance concepts for distributed safety-critical CPS solutions. ESI (TNO) (www.esi.nl) coordinated this effort in the TRANSACT project.
This evaluation has reflected relevant safety and performance concerns, the TRANSACT reference architecture, and needs stemming from the use cases. The result is a public TRANSACT deliverable (TRANSACT D3.1 deliverable), reporting on necessary concepts for managing performance and safety for deploying and running distributed applications for safety- critical device-edge-cloud type systems.
The selected concepts are clustered into three main categories: application concepts, cross-cutting concepts, and platform concepts (see Figure 2). For each of these categories, a number of concept classes have been identified. Within each concept class, then the selected concepts and methods are described.
On application level, the following concepts have been identified:
- Concepts for application service level agreements. Service level expectations need to be specified and managed. In particular when deploying AI services in the cloud, concepts need to be developed to fit with use in distributed safety-critical CPS solutions.
- Concepts for operational modes and change transition. As edge and cloud in general have no absolute availability guarantees, on-device fallback functionality and seamless change-over to fallback functionality need to be managed, so that at all times safety is warranted. This requires applications to offer various operating modes, be scalable, and support seamless change of operation modes.
- Concepts for scalable applications. Scalability of applications is a one of the key enablers for operational mode management and change transition. A concept is identified to achieve this by trading-off an application’s Quality-of-Service level and the resources available to the application.
- Concepts for ensuring data integrity. In distributed safety-critical CPS solutions, safety and privacy related data leaves the confines of the device, and vice versa external data is imported for use in safety-critical or mission-critical functions. This requires the data integrity to be established and safe-guarded.
- Concepts for AI monitoring. The cloud is an excellent place to deploy advanced AI services. Yet, post deployment, AI model performance can degrade over time for various reasons including data drift . Concepts (and challenges) are identified to monitor AI performance in service, to ensure fit for use, or to initiate retraining.
- Concepts for safe and secure modular updates. To reap the benefits of the cloud, it must be possible to update functionality across the device-edge-cloud continuum incrementally in a safe, secure, and modular manner.
To address cross-cutting concerns, i.e. to ensure that applications use the platform properly, the following concepts are identified:
- Concepts for predictable end-to-end performance. Ensuring end-2-end performance across the device-edge-cloud continuum requires performance analysis, monitoring, prediction and management across a distributed device-edge-cloud solution.
- Concepts for safety and risk analysis. Traditional methods such FMEA typically consider impact of (individual) component failures. However, in distributed solutions, also undesired interaction between otherwise fine components could cause safety issues. Novel concepts are identified which can identify and analyse such safety risks.
- Concepts for health and safety monitoring. When safety-critical or mission-critical functions are offloaded to the edge or cloud, then monitoring at run-time is necessary to ensure that the total system is operating safely. Such monitoring can encompass the availability of the device, edge, cloud, as monitoring liveness, responsiveness and integrity of applications.
- Concepts for requalification and continuous safety assessment. A key aspect for achieving rapid innovation is that the necessary proof for safety assurance of incremental updates can be provided efficiently, i.e. to be suitable for low-effort, reduced cost, frequent re-qualification.
Finally, on platform level, a distributed cloud-edge-device platform needs to offer the necessary predictable and scalable configuration for the distributed system be performant and to function safely. The following concepts are identified:
- Concepts for scalable platforms and run-time scaling strategies. Cloud platforms provide an efficient infrastructure to build high-performing and scalable distributed applications as provide scaling options to accommodate high or low demanding workflows. Concepts are identified to manage the dynamic demand of a varying number of devices by a (scalable) cloud platform.
- Concept for platform service level specifications. Concepts are identified to ensure proper usage of a well configured platform’s resources and assets such that a distributed device edge-cloud solution fulfils safety, performance, security, and privacy requirements.
- Concepts for safety-critical platforms. When safety-critical or mission-critical functionality is offloaded to the edge or cloud, then this requires special means and measures to make these parts of the distributed solution (and their interlinked communication and network) fit to support such safety-critical or mission-critical functionality.
In the TRANSACT D3.1 deliverable, each concept is described using a homogenous structure. First, the concept overview is presented, followed by how the particular concepts fits in the TRANSACT reference architecture. Then an example of application of the concept is given (in context of a particular use case), and lastly the challenges for application of the concept in the TRANSACT device-edge-cloud continuum type of systems are listed. These form the basis of further investigation in scope of the TRANSACT project.
These identified safety and performance concepts now form the basis for further work in year 2 and 3 of the TRANSACT project: both to extend these concepts where necessary (in task T3.3), and turn them into (potentially domain-specific) solutions to be incorporated in selected TRANSACT use case demonstrators. These demonstrators will validate the selected concepts and solutions for validation.
For more information
The result of this investigation is available in the public TRANSACT D3.1 deliverable: “D8 (D3.1) Selection of concepts for end-to-end safety and performance for distributed CPS solutions”. A companion deliverable D3.2, highlighting complementary security and privacy concepts, and further public deliverables are available on the TRANSACT project resources page: transact-ecsel.eu/resources/