Design Principles for resilient cyber-physical Early Warning Systems - Challenges, Experiences, Design Patterns, and Best Practices

Tuesday, 16 December 2014
Joachim Wächter1, Stephan Gensch1 and Bettina Schnor2, (1)Helmholtz Centre Potsdam GFZ German Research Centre for Geosciences, Potsdam, Germany, (2)University Potsdam, Institute of Computer Science, Potsdam, Germany
Early warning systems (EWS) are safety-critical IT-infrastructures that serve the purpose of potentially saving lives or assets by observing real-world phenomena and issuing timely warning products to authorities and communities. An EWS consists of sensors, communication networks, data centers, simulation platforms, and dissemination channels. The components of this cyber-physical system may all be affected by both natural hazards and malfunctions of components alike. Resilience engineering so far has mostly been applied to safety-critical systems and processes in transportation (aviation, automobile), construction and medicine. Early warning systems need equivalent techniques to compensate for failures, and furthermore means to adapt to changing threats, emerging technology and research findings. We present threats and pitfalls from our experiences with the German and Indonesian tsunami early warning system, as well as architectural, technological and organizational concepts employed that can enhance an EWS’ resilience.

The current EWS is comprised of a multi-type sensor data upstream part, different processing and analysis engines, a decision support system, and various warning dissemination channels. Each subsystem requires a set of approaches towards ensuring stable functionality across system layer boundaries, including also institutional borders. Not only must services be available, but also produce correct results. Most sensors are distributed components with restricted resources, communication channels and power supply. An example for successful resilience engineering is the power capacity based functional management for buoy and tide gauge stations. We discuss various fault-models like cause and effect models on linear pathways, interaction of multiple events, complex and non-linear interaction of assumedly reliable subsystems and fault tolerance means implemented to tackle these threats.