Mining Hybrid UML Models from Event Logs of SOA Systems

In the paper we consider a method for mining so-called “hybrid” UML models, that refers to software process mining. Models are built from execution traces of information systems with service-oriented architecture (SOA), given in the form of event logs. While common reverse engineering techniques usually require the source code, which is often unavailable, our approach deals with event logs which are produced by a lot of information systems, and some heuristic parameters. Since an individual type of UML diagrams shows only one perspective of a system’s model, we propose to mine a combination of various types of UML diagrams (namely, sequence and activity), which are considered together with communication diagrams. This allows us to increase the expressive power of the individual diagram. Each type of diagram correlates with one of three levels of abstraction (workflow, interaction and operation), which are commonly used while considering web-service interaction. The proposed algorithm consists of four tasks. They include splitting an event log into several parts and building UML sequence, activity and communication diagrams. We also propose to encapsulate some insignificant or low-level implementation details (such as internal service operations) into activity diagrams and connect them with a more general sequence diagram by using interaction use semantics. To cope with a problem of immense size of synthesized UML sequence diagrams, we propose an abstraction technique based on regular expressions. The approach is evaluated by using a developed software tool as a Windowsapplication in C#. It produces UML models in the form of XML-files. The latter are compatible with well-known Sparx Enterprise Architect and can be further visualized and utilized by that tool.


Introduction
Nowadays we use information systems everywhere.They are used not only at home to increase the comfort of our life but also to support business processes.The complexity of the systems is growing together with the complexity of processes and tasks.Moreover, a lot of systems interact with each other.There is an increasing chance of error as the complexity of the system increases.If the system finds these errors, they are written into so-called event logs together with other information about system execution.The logs store a lot of information during the work of the system.On the one hand, manual processing of the logs is almost impossible because of their size and lack of structure.On the other hand, the event logs are an inestimable source of knowledge about real-life system behavior.Tools, which help to obtain this knowledge in suitable form for analytics are extremely useful.Different approaches, such as modeling, development within the standardized life cycle, testing, quality assurance (QA), verification, etc., are applied to improve the system quality and error correction.Using combinations of these instruments (for example, testing and verification, modeling and reverse engineering with continuous delivery) gives good results.New tools, modeling tools in particular, help to make the process more convenient and more effective.Models are built on different life cycle stages.In the classic approach, an architect models an information system based on the customer's requirements.However, the implemented system often differs from previously developed models because the system is developed faster than its models.Developers may sometimes make mistakes and may need to spend additional time on critical situations and deadlines.This means that the design and implementation of some components is not completed properly.When there is no complete model of a system, reverse engineering techniques can be applied to extract the necessary information from the system and build an appropriate model.It allows us to obtain models of a real-life system automatically or semiautomatically.These models correspond to a developed system rather than to an initial plan and initial models.Such models aim both to understand a structure/behavior of a real system and to eliminate any inadequacy of a real model as compared to the initial model.This also makes it easier to fix errors in the system.There are a number of approaches and tools aimed for this purpose.Most of them require the source code of a system to perform analysis.It is not always possible because of different reasons: the source code may not be available to analysts; it is impossible to get the last copy of code or it can be lost.Moreover, different work groups can develop different system components which complicates centralized collection of source code.Unlike existing reverse engineering approaches that use source code, we propose an approach that works with system execution traces which can be extracted from event logs.Our approach can be considered as a particular implementation of Process Mining [1], a discipline aimed to discover, analyze and improve business processes and their models.Our approach also includes features that are relevant to software engineering.Hence, we refer to it as software process mining [2].
Process mining usually uses process models such as Petri nets, BPMN, Fuzzy maps, etc. which are produced by applying different algorithms such as α-algorithm [1], [3], [4], NLP-algorithm [5] or fuzzy miner [6] respectively.However, these models are not perfectly suitable for software developers.In the software engineering area, more specific approaches such as the Unified Modeling Language (UML) [7] are more common.The most common approaches deal with static class diagrams, statecharts, sequence and activity diagrams considering them as more descriptive than other.According to UML 2.5, there are two groups of diagrams: structural and behavioral.In this work we primarily focus on the behavioral group, in particular, on sequence, activity and communication diagrams.Modern approaches to the development of information systems make out small reusable well-defined pieces of code, which are commonly refered to as services.Systems, using services as a main component, are based on service-oriented architecture (SOA) [8].Services from heterogeneous SOA-systems are developed using different languages, environments and tools, but they work in a single information space.Mining unified models of those systems is a challenge and has some difficulties.For example, none of the popular reverse engineering tools works with all languages used for web-service development [9].As almost all systems produce event logs which contain information about interesting system components, it is possible to build models including all of these components.It simplifies the process of reverse engineering and allows us to expand its application area.In the paper, we consider event logs written by SOA-systems.Our goal is to expand the applicability of UML-based models for SOA-systems by developing new approaches and tools for mining such models from event logs.UML standard describes different types of models which suit different modeling aspects of an information system.Nevertheless, there are situations when analysts would like to use expressive opportunities of several diagram types.UML 2.5 does not describe such diagrams, and it does not forbid them either.In our paper, we propose a new approach to UML-modeling, which includes mining a so-called hybrid diagram that comprises elements of UML sequence and UML activity diagrams.To illustrate the proposed approach, consider the following example.

Motivating example
We consider an event log (Table I) produced by an online banking information system with service-oriented architecture.The log contains a number of traces corresponding to individual instances of a business process maintained by the information system.Our goal is to obtain a UML model that represents some behavioral aspects of the system from different perspectives [9].Each row of Table I  By applying a method [9] to the example log, we obtain a UML sequence diagram as depicted in Figure 1 representing the overall process.The diagram contains all possible details (excluding operation parameters) of the behavior of the system as it is represented in the event log.Along with regular messages which connect two different lifelines (depicted as vertical dash lines), the diagram also contains a number of self-calls represented as labeled loop arrows, e.g.GetCardInfo, GetCard.These self-calls are not important for studying the model from a more abstract perspective.
In contrast, they are important when modeling the process of the individual service or another SOA component.A distinctive feature of SOA, which is considered, is that processes call other processes and services while services do not call other participants.To demonstrate this feature, it is important to show the interaction between one selected service and its direct services-neighbors which the service communicates with.A UML communication diagram suits thispurpose.Example diagrams for Card::Operations andCard::OperationData processes from example event logare depicted in Figures 5 and 6 respectively.We can see that these processes are called by other processes and call both different services and themselves.
We developed a tool that builds hybrid diagrams of UML sequence and activity diagrams automatically.Moreover, the tool is able to build a UML communication diagram for a selected SOA component.

Related work
Reverse engineering of behavioral UML diagrams is not a new area.There are a number of works [11], [12], [13], [14], about building the UML diagrams based on static source code analysis.Besides, there are some CASE tools [15], [16], [17], [18], which can be used for reverse engineering of sequence and activity UML diagrams.
There is also a plug-in [19] for NetBeans development environment that is able to build different types of behavioral models from Java source code.However, all of the methods and tools mentioned above use static program analysis (getting models from source code without execution) for their work.As it was considered earlier, source code and all of its versions are not always available for analysis.Hence, these tools and methods are useless in this case.Furthermore, none of these tools is able to infer models from the code written in most popular languages used for developing SOA information systems.Moreover, SOA architectures are often developed with various programming languages.For example, some modules can be written in C#, whereas others can be developed in Java; they can interact with LAMP service, so a single CASE tool cannot produce models for that system.Mining diagrams from event logs solves this problem.In [20], [21], [22], approaches to building models based on execution traces are proposed.One related work [20] analyzes a single trace using meta-models of an event log trace and a UML sequence diagram (UML SD).The trace includes information not only about invocation of methods but also about loops and conditions, 162 which makes easier recognition of fragments such as iteration, alternatives and options.However, logs of information systems do not usually include this information, so it is necessary to modify the source code to apply this approach.
There is a description of the mining UML sequence diagrams method based on several execution traces in [22].The authors propose to use a labeled transition system (LTS) as an intermediate model to present one trace and an algorithm to merge LTSs built by several traces.After that, the LTS is transformed into a UML sequence diagram.Moreover, LTS can be used to build a Petri net that can then be converted into a UML activity diagram [23].This conversion possibility can be used to apply different process mining algorithms for receiving a UML activity diagram.The approach to mining hierarchical UML sequence diagrams is proposed in [9] (see Section III-D).
In [24], the authors describe a framework which allows not only behavioral but also static UML diagrams to be built.Their framework generates execution traces by itself from Java source code.After that, the framework is able to build UML activity diagrams from traces, but it requires source code for its work.
Process mining proposes to use three abstraction levels for mining models for web services interaction [25]: workflow, interaction and operation.At the operation level, only one service is considered in order to look at its internal behavior and functionality.At the interaction level, they consider not only one selected service but also its direct callers and callees.Finally, the overall services interaction is covered at the workflow level.We apply all of these levels to service-oriented architecture in the paper.Furthermore, research on service mining was described in [26].The author builds different Petri nets for different services (considered at the operation level) and then combines them by places.Thus, he builds a generalized model which refers to the workflow level.
The rest of the paper is organized as follows.Section II gives definitions.Section III introduces our approach to mining hybrid UML models.Section IV contains a description of tool implementation.Section V concludes the paper and gives directions for further research.

Preliminaries
() is the powerset over some set X; Λ is a set of all possible string labels.

Mining Hybrid UML Models
The authors in [25] propose definitions of three levels of abstraction: operation, interaction and workflow.The levels are used for consideration of web service interaction.It motivated us to use different types of UML diagrams which demonstrate features of these levels.In the following sections, we consider which UML diagrams suit each abstraction level and why.

Operation and workflow abstraction levels
Operation level of abstraction shows what is happening inside one isolated service.Activities outside the service are not considered at the operation level; the only process participants are services.Using a UML sequence diagram leads to a large number of self-calls and "snowball models".It makes the diagram less readable and less understandable.A UML activity diagram suits this purpose since it allows us to demonstrate the complex relationships between operations inside a single participant.Figure 3 shows an example of a UML activity diagram for service Card::OperationData.
A business process, provided by services, is represented at a workflow abstraction level.There are a lot of participants, so it is useful to use a UML sequence diagram for this level.The diagram is suitable to present not only a sequence of business process actions but also participants of this process and their interaction.An example for event log L1 is depicted in Figure 1.To bind different abstraction levels, it is necessary to connect them.Our proposal is to use hybrid UML diagrams to represent and connect operation and workflow

Interaction abstraction level
This level shows interaction of one selected service or process with its nearest neighbors.For a given service, its nearest neighbors are caller and callee services.A UML sequence diagram does not fully suit for representing this level as well as an activity diagram.In the former case, a UML sequence diagram contains a time perspective on which no relation can be mapped.Thus, this leads us to have a "blind" diagram.In the latter case, it does not support multiple participants which is important for this abstraction level.We propose to use UML communication diagrams for depicting processes occurring in SOA system at interaction abstraction level.An example of such a diagram for Card::Operations and Card::OperationData from an event log example is presented in Figures 5 and 6.


A UML sequence diagram is built from a workflow part of an event log using the method proposed in [9] (see Section III-D) extended by a number of necessary ref fragments used for connecting with corresponding activity diagrams.

Mining UML sequence diagrams
To mine a UML sequence diagram we use a method proposed in [9].There, we propose an approach to mining UML sequence diagrams with different levels of abstraction.It consists of three steps.The first step of the approach is mapping event log attributes onto UML sequence diagram components.There are two functions for mapping attributes onto lifelines and message parameters.The smaller the SOA element we choose for mapping onto lifelines, the lower the abstraction level we receive.The second step is set to build a smaller model by applying regular expressions for merging similar messages and lifelines on a diagram.For example, we have two messages with the following parameters: GetPlaseAndDate, op=BP Billing Transfer and GetPlaseAndDate, op=Retail.They differ in op value, thus, these messages can be combined into one message with the following parameter: GetPlaseAndDate, op=.*.After the merging, a derived model becomes more generalized and its size decreases in width and height.
To demonstrate the hierarchy of calls, which is important for SOA, a hierarchical diagram can be applied.Thus, the third step of our approach contains a way to present a complex model by using hierarchical UML diagrams.UML standard [7] allows us to divide the model into some parts and connect them by means of interaction use (ref fragment) and gates.

Tool Overview
This section presents a brief overview of the software tool implementing the proposed algorithm.

Event log
The tool requires an input event log to be presented in definite format.We use simple CSV text files to represent event logs.An event log should contain a number of fields that are mapped onto mandatory attributes, namely CaseID, Timestamp and Activity.

Tool implementation
The

Conclusion
This paper introduced a new concept of hybrid UML models and proposed a method of mining them from event logs of SOA information systems using a service mining 170 approach.Our method can also be applied to other types of UML diagrams.The paper discussed approaches to mining diagrams at different abstraction levels.
Our method builds models by using only event logs.This is an advantage over some reverse engineering techniques because the source code is not always available.The proposed method includes mining hybrid UML diagrams that represent workflow abstraction level on UML sequence diagrams and operation level on UML activity diagrams.Moreover, we proposed to build UML communication diagrams to show interaction abstraction level with regards to the service mining approach.Generally, control structures in system's behavior lead to a presence of a big number of nested combined fragments within a UML sequence diagram.It makes the diagram less readable and less understandable.Although UML activity diagrams have no time perspective in contradistinction to sequence diagrams, the former show alternatives, loops and parallelism more clearly.Since there are also a lot of event logs which are not produced by SOA systems, we are going to expand our approach to mining hybrid UML diagrams from event logs of more broad types of software architecture in the future.

Fig. 1 .
Fig. 1.Usual UML sequence diagram mined from event log L1.Thus, we propose to hide these calls on the general model with giving a reference to another diagram.Note, that the hidden calls are restricted by one lifeline only.So, using UML sequence diagram here loses its meaning, since only one agent is involved.Therefore, it is convenient to model such behavior by using UML activity diagrams, another type of UML diagram.Figures 2, 3 and 4 illustrate this idea and represent a hybrid UML diagram combining the best features of two different model types.A distinctive feature of SOA, which is considered, is that processes call other processes and services while services do not call other participants.To demonstrate this feature, it is important to show the interaction between one selected service and

Fig. 2 .
Fig. 2. UML sequence diagram with hidden self calls.High-level diagram of a hybrid UML diagram.
Figures 2, 3 and 4 illustrate an example of a hybrid UML diagram.Figure 2 is a UML sequence diagram and represents a high-level diagram.It refers to UML activity diagrams (Figures 3 and 4) using ref fragments.Definition 5. (UML Communication Diagram) A UML communication diagram is a tuple  С = ( ,  ), where:   ⊂ Λ is a set of named lifelines which represent interaction participants.  is a set of messages. ∈  :  = ( ,  , ), where  ,  ∈  ,  ∈ Λ. Figures 5 and 6 provide examples of UML communication diagrams for two different services. is a set of all possible UML communication diagrams  .Definition 6. (Hybrid UML Model) A hybrid UML model is a tuple  С = ( , CD), where:   is a hybrid UML diagram. CD ⊂  .Figures 2, 3, 4, 5 and 6 represent a hybrid UML model built for example event log L1.
abstraction levels together.A UML sequence diagram is used to represent a business process at a workflow abstraction level.The diagram contains special objects, ref fragments, which make a connection to corresponding UML activity diagram.Every such activity diagram models the behavior of a single service.An example of considered hybrid diagram is presented in Figures 2, 3 and 4. Algorithm 1. Building a hybrid UML model

Algorigm 2 .
Figure 7 represents a workflow diagram of a hybrid mining process.The scheme contains the following tasks (see Algorithm 1):  An event log is split into several parts.The workflow part of the log refers to services communication.Such communication is represented on a UML sequence diagram at workflow level.The operation parts consist of events referred to activity only inside a specific service.
Davydova K.V., Shershakov S.A. Mining Hybrid UML Models from Event Logs of SOA Systems.Trudy ISP RAN/Proc.ISPRAS, vol.29, issue 4, 2017, pp.155-174.168  UML activity diagrams are built from the operation parts of the log independently using one of the process mining algorithms which produces a Petri net.For instance, α-algorithm[4]  or inductive miner [27] can be considered here.Then, Petri nets are converted into activity diagrams by a simple translation routine.This conversion is rather trivial since UML activity diagrams are initially based on Petri nets [7],[23].

Fig. 7 .
Fig. 7.The workflow diagram of a hybrid mining process.
tool is implemented as a Windows application written in C# programming language.The tool allows users to configure main parameters such as regular expressions, hierarchy and type of output diagram (regular UML, hierarchical or hybrid).Regular expressions are applied for merging diagram components.It is implemented as shown in Figure 8.The GUI allows the user to set the type of diagram.The perspective of the diagram (a mapping attributes onto diagram lifelines and messages) is set as it described in [9].The output of the tool is an XMI-file containing a model and a description of diagrams.It can be visualized by Sparx Enterprise Architect [15].

Fig. 8 .
Fig. 8. GUI to set a type of the diagram and regular expressions for merging its components.

Table 1 .
represents a single event.Columns represent attributes of the log.Events are grouped in cases (by CaseID attribute); then, cases are represented in the log by traces.Events are ordered by Timestamp attribute.Different components of SOA are represented by other attributes such as Domain, Service/Process and Davydova K.V., Shershakov S.A. Mining Hybrid UML Models from Event Logs of SOA Systems.Trudy ISP RAN/Proc.ISP RAS, vol.29, issue 4, 2017, pp.155-174.Log fragment L1.Banking SOA-system Davydova K.V., Shershakov S.A. Mining Hybrid UML Models from Event Logs of SOA Systems.Trudy ISP RAN/Proc.ISPRAS, vol.29, issue 4, 2017, pp.155-174.