Stochastic Methods for Analysis of Complex Hardware-Software Systems

In this paper we consider Markov analysis of models of complex software and hardware systems. A Markov analysis tool can be used during verification processes of models of avionics systems. In the introduction we enumerate main advantages and disadvantages of Markov analysis. For example, with Markov analysis, unlike other approaches, such as fault tree analysis and dependency diagram analysis, it is possible to analyze models of systems that are able to recovery. The main drawback of this approach is an exponential growth of models size with number of components in analyzed system. It makes Markov analysis barely used in practice. The other important problem is to develop a new algorithm for translating a model of a system to a model suitable for Markov analysis (Markov chain), since the existing solutions have significant limitations on the architecture of analyzed systems. Next we give a brief description of the context – AADL modeling language with Error Model Annex library, MASIW framework, and also give an explanation of Markov analysis method. In a main section we suggest an algorithm for translating a system model into a Markov chain, partially solving the problem of exponential growth of Markov chain. Then follows a description of further steps, and some heuristics that allow to extremely reduce running time of the algorithm. In this paper we also consider other Markov analysis tools and their features. As a result, we suggest a Markov analysis tool that can be effectively use in practice.


Introduction
In this paper we consider a task related to verification of models of software and hardware systems. Such systems can be, for example, control systems for airplanes, ships, medical equipment, etc. The price of error in these systems is very high, but they are too complicated for "manually" analysis. Therefore such systems are modeled before implementation. On the stages of design, development, and verification of the models, it is necessary to constantly investigate system safety. At present, three main methods of system safety assessment [1] are widely used: fault tree analysis, dependency diagram analysis, and Markov analysis. Each method has its own advantages and disadvantages. In this paper, Markov analysis is considered. Markov analysis works with a Markov chain [2] -a stochastic process, which can be represented as a directed graph with weighted edges. Vertices of Markov chain represent different states, and edges are labeled by probabilities of a transition between states. The main drawback of Markov analysis is a size of Markov chains, which increases exponentially with number of components in the system. In addition, it is necessary to develop an algorithm, that takes system model and translate it to the Markov chain. These problems make Markov analysis not so popular as the other methods, and number of tools that use Markov analysis for complex systems is relatively small. However, such approach has its advantages: Markov analysis allows to look at the entire system, to consider not only causes and probabilities of certain single failure, but also investigate how various failures affect the system in the aggregate. Also Markov analysis, unlike the other approaches, allows to analyze selfrecovering systems, since return to operational state is natural for Markov chains. Thus, the task of development the Markov analysis tool of complex hardwaresoftware systems is quite important and relevant.

AADL and Error Model Annex
Architecture Analysis & Design Language (AADL) [3] is a language, that widely used for describing models of real-time hardware and software systems. Its features include description of both hardware (so-called execution platform) and software components of an analyzed system, and various connections between them. The models, described in AADL, may be used for documentation, for various kinds of analysis and for code generation. Error Model Annex [4] is an extension of AADL, that allows to simulate appearance and propagation of errors in the system. For each component, a modeller can add a description of component's behavior states, for example, operational and failed. Transitions between system states are triggered by randomly occured error events and internal errors propagated from other components. An error propagation condition may depend on certain behavior state of the component, some error events, or error propagated from environment. Each propagated error has its own type, that allows to control what is exactly happened in the system. Also transitions between states can be defined implicitly -a state of some component may be a composite state of its subcomponents. AADL and Error Model Annex together describe not only an architecture, but also error behavior of systems. It becomes possible to evaluate such properties of models as safety, reliability, the availability of its various states and ability to recover from them.

MASIW
MASIW [5] is an open-source framework for designing and analyzing of integrated modular avionics systems, that use AADL as a modelling language. The project designed as plugins for Eclipse IDE, includes a variety of tools for working with AADL and Error Model Annex models. There is a big number of different analysis tools, for example, a fault tree analysis tool, but there is no Markov analysis tool.

Markov analysis
Any model subjected to Markov analysis must be represented as a Markov chain. A Markov chain can be represented in the form of a directed graph with vertices containing system states, and edges labeled with intensities of transitions between corresponding states. A Markov chain has the property of Markov process -a probability of a transition to any state depends only on a current state and a moment in time, and previous transitions are unimportant (can be characterized as memorylessness). Markov models can be divided into models with discrete and continuous time, as well as time-homogeneous (also called stationary) and time-inhomogeneous. In timehomogeneous Markov chains, the intensities of transitions are constant, while in timeinhomogeneous Markov chains they depend on time. In time-homogeneous Markov chains, transitions occur according to the binomial (or fixed) distribution for discretetime chains, and according to the Poisson distribution for continuous-time chains.
To determine the behavior of an analyzing system, it is necessary to specify a system of differential equations. The equations follow from the Markov chain. For all Markov processes (and a Markov chain, in particular) we have the Kolmogorov-Chapman equation [6]: And for discrete-time chains, a system of difference equations Denote by a certain initial state of the system, and consider equations (2)-(3) in case when = . Denote by ( )the function ( ) ( ⁄ ). Then, the previous equations takes the following form: In addition, initial conditions appear: Thus, we obtain the Cauchy problem [7]. The solution of this problem is a set of probabilistic functions of being a system in a definite state. This is the result of Markov analysis. In this paper, we consider only the analysis of time-homogeneous Markov chains and models, as the most common ones. However, all results can be applied to timeinhomogeneous models, with the only difference being that intensities of Markov chain transitions depend on time, and they need to be stored as formulas, not as numbers.

Problem
The goal of this work is a development and implementation of a Markov analysis tool for complex hardware-software systems models within the MASIW framework. The tool takes input of some system model and a set of time points. At the output, the analyzer provides the probabilities of being the system in each of its possible states at moments of time, defined by user. The main problem is to create a Markov chain on the basis of the original model. First, we need an algorithm that creates a correct Markov chain corresponding to the input data. Secondly, the result chain should be of acceptable size, so that the program can work for acceptable time in limited memory. After a construction of a Markov chain, further action reduces to solving a Cauchy problem with a system of linear differential equations. An analytical solution of the Cauchy problem is too complicated, resource-intensive, and result is difficult to comprehend, so we use numerical methods.

Markov chain
The primary task is to translate an AADL model into a Markov chain. In particular, it is necessary to find out what to regard as a chain node and what generates transitions between system states. Obviously, the node must contain the state of the system, which is a combination of the states of all system components (the states of the components are described in the model as behavior states). However, if we take as a node any of all possible combinations, then the number of nodes will be no less than 2 , where is the number of system components. Real systems often contain more than 20 components, that, on the one hand, are few, but on the other hand, results in size of such Markov chain outside available memory. We suggest the following solution of this problem. Let us exclude from the chain all unreachable states of the system, which, as practice shows, are the vast majority. First, some states of the system are unreachable by definition of ananalyzing model. For example, the state of some component may completely depend on the states of its subcomponents. Accordingly, the component can not be in a failed state, while all its subcomponents are in operational states. Second, the failure of some components entails an almost immediate failure of others -for example, a breakdown of a processor entails a failure of all processes running on it. Thus, the state in which the processor is broken, but the processes on it are still working, though reachable in theory, at the very moment of the failure, but instantly replaced by another state. Thus, we suggest the following approach. We assume that speed of error propagation between components is extremely small in comparison with time of system operation (which, in practice, is the case -for a unit of time measurement usually takes hour and even a day). Let us define a stable state of the sistem as a state, that does not change until new error events occur in the system and its components. We consider as nodes of designed Markov chain only the stable states of the system. The sets of arising events generate transitions between nodes of the chain. For the sake of saving memory, we insert only reachable states to the Markov chain, and build it dynamically, from the initial state of the system, which is a combination of the initial states of the components. In each new node it is necessary to analyze transitions from the current state of the system. The state can change for some event or combination of events. So, we perform complete search for all possible sets of events -either of them can initiate a new transition. The probability of occurrence of each set of events is easily calculated, since each event contains information about its probability distribution. This is a multiplication product of probabilities of occurrence or negation of occurrence of each of the events, since all events are independent. The total probability of all sets of events, according to the law of total probability, should be equal to 1.

196
The algorithm is completed when all nodes of Markov chain are analyzed, starting from the node corresponding to the initial state of the system. markovChain.addNode(initialStateNode) queue.add(initialStateNode) while not queue.isEmpty() do analyzeNode(queue.head()) queue.add(newNodes) end while The analysis of each node of Markov chain looks like this: all possible sets of error events are searched, for each of them we calculate a stable state of the system into which the given set leads, and then a new transition (and, if necessary, a new node) is added to the chain.
for each errorEventSet in possibleSets do state = currentNode.getState() repeat watchedStates.add(state) state = calculateState(state, errorEventSet) until watchedStates.contains(state) node = markovChain.addNode(state) markovChain.addTransition( currentNode, node, errorEventSet.getProbability()) watchedStates.clear() end for In the above algorithm, the state of the system is considered stable if we have already reached it before. This correctly handles the case when the state of the system has not changed -we have reached the same state as in the previous step. However, in theory, in a self-recovering systems, cycling may occur if an event with a failure and an event with component recovery occur simultaneously. With this condition, the loop stops, but this situation is not handled correctly. One of the main opportunities for further improvement of the algorithm is to improve the condition for achieving a stable state of the system.

Calculation of new states
In the previous paragraph, a general algorithm for constructing a chain was described, omitting the details of calculating new states of the system. To find out exactly how the system has changed, it is enough to go through all its components, and see what transitions between states are triggered for a given set of events and the current state of the system. The triggered transition is immediately applied to the system, and the algorithm step is completed. where variables corresponding to components behavior states, error events, and propagated errors, have value of true or false, depending on whether the system is in this state, whether an error event has occurred or whether an error of the specified type has propagated. As soon as some component of the system changes its state, it means that we obtain a new state of the system, and the step of the algorithm is completed. If none of the transitions is triggered, then the system state has not changed, which is noticed by the algorithm described in the previous section.

Construction and solution of the Cauchy problem
After construction of a Markov chain, the final stage of the Markov analysis of the system is to construct a system of equations and solve the Cauchy problem. As mentioned earlier, each node of the Markov chain generates a differential equation (4) (or similar difference equation (5)). To save memory, it is not necessary to store the system of equations -the equation for any node can be easily constructed dynamically, passing through all transitions entering into this node and outgoing from it. The resulting Cauchy problem can be solved by a numerical method from the Runge-Kutta [8] family of methods. In the analyzer, two methods are implemented: the Euler method, for discrete-time Markov chains, and the fourth-order Runge-Kutta method, for continuous-time Markov chains. The type of the chain is determined in advance, according to probability distributions of error events. An algorithm for calculating the variation of the function ( )on each time interval delta t, taking into account the dynamic construction of the equation (Euler's method): for each node in markovChain.getNodes() do i = indexOf(node) res = 0 for each transition in node.getInTransitions() do k = indexOf(transition.getNode()) res += transition.getProbability() * pPrev[k] end for for each transition in node.getOurTransitions() do res -= transition.getProbability() * pPrev[i] end for pCur[i] = pPrev[i] + delta t * res end for Also, the value of the vector of probability functions P (t) is saved at every time point defined by user. As soon as values at each necessary time point are calculated, the algorithm is completed.

Getting Analysis Results
Since number of system states in Markov chain can be very large, the result of analysis in the form of probabilities of being the system in each of them is practically impossible for reading. Considering that each system has its root component, we group all system states according to the states of the root component. In this case, all the probability functions within the same group are summed up: for each node in chainNodes do i = indexOf(node) state = node.getSystemState().get(rootComp) analysisResult.get(state) += p[i] end for After this, for each state of the root component, the probability of being the system in a this state at given time points is obtained. This is the desired result of the Markov analysis of the system.

Algorithm acceleration
Despite a partial solution of the problem of exponential growth of Markov chain size, the running time of full version of the algorithm still grows exponentially -due to a thorough search of all possible combinations of error events. Thus, we use some heuristics in the final program, accelerating the algorithm. First, we limit the search of combinations of events. Since the probability of occurrence of one event is usually extremely small, the situation in which several events occur simultaneously is practically impossible. Therefore, a very small numerical parameter, limiting the probability of the combination of events under consideration, was added to the program. If the probability of occurrence of the set of events is less than this parameter, then the effect of the set of events on the system is not considered. This solution significantly reduced the running time of the program, without much loss of accuracy of the result. The second solution relates to system's ability to self-recovery. In practice, there are few examples of self-recovering systems, and, in most cases, even a short-term failure of the system itself means fatal consequences. Accordingly, if the analyzed system has come to failed state, its further changes are not interesting to us -no matter what else can fail in the already failed system. Therefore, we introduce a set of states of the root component, that are considered as «absolutely» failed. If some node of Markov chain has failed state of the root component, then we do not analyze transitions from it. If analyzed system is not self-recovering, the result of the program remains the same, but is obtained in much shorter time. Both modifications of the program are optional, as they may change final result in some cases, but their application reduces the operating time by several orders of magnitude. For example, a complete analysis of a system containing 24 components revealed 919 states of the Markov chain and took 1 hour. Limiting the frequency of the events considered by the number 10 gave a significant gain -the same set of states of the Markov chain and the same result of the analysis were obtained in 7 minutes. Since the system under test was not self-recovering system, the analysis with the stop-on-failed option was correct, and got the same result in 10 seconds. Setting relevant parameters allows to significantly accelerate work of the analyzer. One of the further options for improving the tool can be automatic detection and selection of optimizing parameters.

Related works
Markov analysis of AADL and Error Model Annex models is usually applied to systems consisting of only one component. Such algorithms doesnt consider error propagation mechanism and composite states, and limited by root component. The tool from OSATE [9] framework, created for export AADL model into Markov chain model for PRISM [10] toolset, which provide further steps of Markov analysis, supports only the first nesting level of the component hierarchyand does not support different types of propagated errors. In addition, there were some problems associated with the syntactic correctness of the final PRISM model.

Conclusion
In this paper we present a new Markov analysis tool, and in particular, an algorithm for translating AADL and Error Model Annex models into Markov chains. In addition, there were added some improvement for accelerating the algorithm, which make it possible to effectively use the tool in practice. Karnov