Automated Type Contracts Generation in Ruby

. Elegant syntax of the Ruby language pays back when it comes to finding bugs in large codebases. Static analysis is hindered by specific capabilities of Ruby, such as defining methods dynamically and evaluating string expressions. Even in dynamically typed languages, type information is very useful as it ensures better type safety and more reliable checking whether the called method is defined for the object or whether the arguments of the correct types are passed to it. One may annotate the code with YARD (Ruby documentation tool) to declare the input and output types of methods or even declare methods that are added dynamically. These annotations improve the capabilities of tooling such as code completion. This paper reports a new approach to type annotations generation. We trace direct method calls while the program is running, evaluate types of input and output variables and use this information to derive implicit type contracts. Each method or function is associated with a finite-state automaton consisting of all variants of typed signatures for this method. An effective compression technique is applied to the automaton to reduce the cost of storage and allows to display the collected information in a human-readable form. The exhaustiveness of the contract defined by the generated automaton depends on the diversity of the traced method usages. Therefore, it is also important to be able to merge all the automatons received from users into one, which is further covered in this paper.


Introduction
Developers suffer from time-consuming investigations when trying to understand why a particular piece of code does not work as expected.The dynamic nature of Ruby allows for great possibilities, which has its drawback: the codebase as a whole becomes entangled and investigations become more difficult compared to statically typed languages like Java or C++ [1].Another downside of its dynamic features is a drastic reduction in static analysis performance due to inability to resolve some symbols reliably.Consider the dynamic method creation which is often done with define_method call.Names and bodies of dynamically created methods may be calculated at runtime [2].The following code dynamically adds active?, inactive?and pending?methods to the User class: One of the possible workarounds to get information about types for such difficult-toanalyze syntactic constructions is using code documentation tools such as RDoc or YARD.@!method annotation defines a method object with a given signature.@param and @return annotations may help to define the actual types, but they have several drawbacks too:  the type system used for documenting attributes, parameters and return values is pretty decent, however, it is not clear how to define relations between the types.For example, operator []= for array usually returns the same type as the second arg taking any type so in YARD this will look like @param value [Object], @return [Object] which is not really helpful, because all classes in Ruby are inherited from the Object and such annotation does not give any additional information about the method. from usability perspective, such documentation in some way contradicts the purpose of Ruby to be as short, natural and expressive as possible.The proposed approach is inspired by the way people tackle this problem manually: one may run or debug the program to inspect the needed info about the code they are investigating.This suggests that collecting direct input and output types of all method dispatches during the program execution with postprocessing and structuring of this data may be considered as a way to automate manual investigations.As a result, it will make up implicit type annotations.As the process is automated, one can retrieve a lot of information about the executed code in the whole project.Since the quality of the result highly depends on the code coverage of the programs run during the data collection, it is important to be able to merge the result annotations built for the same methods called from different places, projects and even users.These annotations also could be stored in a public database to be shared and reused by different users in order to maximize the coverage of the analyzed code and hence the quality of the generated contracts.Two main contract generation stages can be distinguished:  During the first stage, the information about called methods and their input and output types is collected throughout the script execution.It is very important to collect the necessary information as quickly as possible not to keep users waiting for script completion much longer compared to regular execution.To achieve this, we implement a native extension which receives all the necessary information directly from the internal stack of the virtual machine instead of using the standard API provided by the language.This stage is described in Section 3.  During the second stage, the data obtained in the first stage is structured, reduced to a finite-state automaton and prepared for further use in code insight.This storage scheme provides the ability to quickly obtain a regular expression that is easily perceived by a human.This stage is described in Section 4. The generated implicit annotations can be built into the static analysis tools [3] to improve existing and provide additional checks and code completion suggestions.This stage is described in Section 5.

Related works
In Static Analysis of Dynamic Languages [7], static analysis techniques for dynamically and statically typed languages are compared.The author notes that the attributes of dynamically typed languages such as flexibility and expressiveness limit the availability of tool-support for those languages.The paper addresses the main problems of analyzing code written in a language with dynamic typing: particularly, the construction of developer tools is difficult due to the lack of static type systems, therefore, many bugs are not discovered until run-time.The use of static analysis, and in particular whole program dataflow analysis, allow static reasoning about programs written in these languages without changing their nature or imposing unrealistic restrictions on the programmers.In addition, the article mentions the technique called Use Analysis."Use Analysis: A heuristic for recovering missing dataflow facts, due to missing library code, by observing how applications objects are used in the application code."An example of such a heuristic is the approach to be described in this article.
N.Y.Viuginov, V.S. Fondaratov.Automated Type Contracts Generation in Ruby.Trudy ISP RAN/Proc.ISP RAS, vol. 29, issue 4, 2017, pp. 7-20.10 For Ruby, as for most dynamically typed languages, there are tools for source code analysis, but they are not capable of statically identifying all errors associated with type mismatch.Here are some of them:  Rubocop [4] -A Ruby static code analyzer, based on the communitydriven Ruby style guide, but it does not allow actual error detection. Ruby-lint -A tool for detecting syntax errors, such as undeclared variables, an invalid argument set for calling a method, or unreachable sections of code. Diamondback Ruby [5] -an extension to Ruby that aims to bring the benefits of static typing to Ruby.However, at the moment, it is impossible to analyze even the standard Ruby library.

Collecting information about method calls
3.1 Calls structure TracePoint is an API allowing to hook several Ruby VM events like method calls and returns and get any data through Binding, an object which encapsulates the execution context (variables, methods) and retains this context for the future use.Consider a simple Ruby method declaration and handlers set for :call and :return events.

Unspecified arguments
Code analysis often handles direct method calls, so in order to calculate the return type it is important to distinguish which arguments were directly passed to the method by the user, and which were assigned the default values.
Let the following expression occur during the code analysis: a, b, c = foo, foo('1'), foo(1), and the following two contracts be generated: If the method cannot be statically analyzed, then we cannot select a contract to apply to the method call without arguments.
Note that default values are assigned to unspecified optional arguments before the :call event is triggered.Therefore, with the standard API, it is impossible to calculate which arguments were passed to the method, and which were not.This poses a problem because it renders detection of the default value types impossible and, therefore, disables the calculation of the expected return type of calls with any 12 optional parameters unspecified.However, one can build a native extension for the Ruby VM [2] and get this information from an internal stack.Consider a simple Ruby method with an optional parameter and on appropriate bytecode.
def foo(a, b=42, kw1: 1, kw2 The instruction number 0020, which calls the method foo, has information characterizing the number of passed arguments and the list of passed named arguments.Now we need to find a bytecode instruction for the current method dispatch.It is necessary to find the caller control frame and the last executed instruction in this frame.This instruction will correspond to the call of the method that we are interested in.The big disadvantage of this approach is that the calculation of the full execution context is a time-consuming operation.But later we will only need information about a small part of it.Namely: types of arguments, types and names of method parameters.Creating a native extension for the Ruby VM, which will receive information about the method name directly from YARV instruction list (Fig. 1), will help us to receive information about argument types directly from the internal stack.

Tranforming raw call data into contracts
A huge amount of raw data received from the Ruby process must be processed and structured so that it can be easily used and perceived.In our approach, each traced method is associated with a finite-state automaton.This storage structure allows to quickly add raw type tuple obtained from the Ruby process.It can be also easily reduced to a human-readable regular expression.In each automaton, there are a single starting vertex, from which the signature begins to be read and a single terminal vertex, in which all edges corresponding to the return types enters.Words obtained by concatenating tuples and corresponding output types are consistently added to the automaton.

Algorithm 1. Adding a tuple to the automaton
Then, the minimization algorithm [7] is applied to the automaton, but it is slightly modified for the automaton of this type (Alg.2).Note that all the tuples added to the automaton have the same length, so the resulting automaton has a layered structure based on the distance from the starting vertex.And all the edges emerging from the vertices of the i-th layer go to the vertices of i+1-st layer.Note that, after adding a signature to a minimized automaton, each added vertex can be combined only with the vertex of its level (Fig. 3).

Fig. 3. Joining vertices
Quite often there are situations where types of two or more arguments of the method always coincide or the type of the result coincides with the type of one of the arguments.Consider method equals as an example.
While adding the next transition from the vertex to the automaton, let's compare the symbol of the transition we want to add with all the previous symbols of the current tuple.In case there is at least one match, instead of a regular edge with a type symbol, edge with a bit mask is added.The length of this mask equals to the ordinal number of the current type within the tuple decreased by 1. i-th bit is 1 iff the i-th type in the tuple equals to the type to be added (Fig. 4).

Fig. 4. Automaton with counted bit masks
When reading the signature, each following type is compared to the previous signature types and if a nonzero mask is obtained, one goes through the transition with the mask received.
Algorithm 1'.Adding a tuple to the automaton with masks Automata received from different users need to be merged.The following algorithm is used for this:

Algorithm 3. Automatons merge
In Ruby, Duck Typing [8] is quite heavily used.As a consequence, variables of various types that implement a set of methods can be passed as arguments to a method.Hence, many multiple edges corresponding to these classes appear in the automaton.These multiple edges can be replaced by one edge containing information about the interface that all these classes satisfy.Then, to jump on this edge, the next type from the signature must implement this interface.In case this common interface is empty on the edge, it is enough to write the type Object, since it is the parent class for all objects.

Using of contracts in static analysis algorithms
The contract is used to calculate the type returned when the method is called with a certain set of arguments.It is worth noting that the types of arguments are not always uniquely defined.Sometimes there is a set of types to which the variable may belong.To calculate the type returned by the method, it is necessary to go successively along the edges of the automaton calculating a set of vertices reachable by reading some sequence of types.The unspecified optional arguments types are imitated with a special non-alphabetic character so that the length of a tuple is lower than the automaton height by 1.

Algorithm 4. Output type calculation
The generated contracts complement the type selection system because they allow to calculate the types returned from methods which were not successfully analyzed using standard tools.This expands the class of variables for which it is possible to statically compute a type.The collected information for the methods makes it possible to significantly accelerate the existing control flow analysis because the methods for which a sufficiently representative contract is generated do not require additional analysis.Contracts allow to extend the applicability of some of the features that are supported in most modern IDEs.The functions considered are applicable to method calls for which it was possible to select the class of the object to which they were applied and for this class there is a contract corresponding to the method with that name and configuration of parameters.Functions in which contracts are applied:  Go To Declaration/Find Usages.At the execution time information about method declaration was collected.This information can be used for navigation from method call to declaration and vice versa. Autocompletion.A list of methods implemented for an object can be supplemented with methods for which the contract was found.
 'Incorrect method arguments' Inspection.Information about the method parameters can be used to detect incorrect calls.

Conclusion
The paper describes the approach to the generation of implicit type contracts.This approach provides information containing type signatures of methods that cannot be obtained by static analysis using the source code given it is possible to understand in which library the method was declared and to resolve the method receiver.This approach is useful for analyzing programs which heavily utilize dynamic features like dynamic methods creation or when there are complex syntactic constructions in methods implementations.In addition, this approach can be applied to other languages with dynamic typing, such as Python or JavaScript.Several problems remain unsolved, such as Duck Typing and handling an ambiguous resolve of the argument type in a static analysis.The problem with duck typing is that, during the execution of the program, it is impossible to save all the methods implemented for the object.Therefore, it is difficult to find the largest common interface for a group of classes.
The problem with arguments with types ambiguous according to the static analysis is that they cannot be read in the automaton. Аннотация.