SDG: a (Model-Based) Tool for Generating Representative and Valid Synthetic Test Data

Goal

Many testing activities, such as usage-based statistical testing (also known as readability testing), require the generation of synthetic testing data that can be used to build confidence in the reliability of the system under test. In particular, such data has to be structurally and logically well-formed, not to be discarded by the early sanity checks of the system under test. Further, the data should exhibit as much as possible the actual or anticipated system usage to help mimic how the system would behave under realistic circumstances. Generating such data is not a trivial task as the underlying data schemas are usually large, complex, and subject to numerous domain-related logical constraints. The ultimate goal of the SDG (Synthetic Data Generator) tool is to automatically generate such synthetic data.


How to use SDG?

How to use SDG?

To be able to use SDG, one has to apply the steps shown in the figure above.

  • In Step 1, Define data schema, one has to define using a UML Class Diagram (CD) the schema of the data to generate. A CD provides a precise representation of the involved classes, features, and relationships among classes for a given domain. This diagram is the basis for: (a) capturing the desired statistical characteristics of data (Step 2), and (b) generating synthetic data (Step 4).

  • In Step 2, Define statistical characteristics, one has to enrich the CD from Step 1 with probabilistic annotations to express the representativeness requirements that should be met during data generation in Step 4. Further information about this step can be found at [link].

  • In Step 3, Define data validity constraints, one has to express via the Object Constraint Language (OCL) the logical constraints that the generated data must satisfy. Further information about the OCL language is available at [link].

  • Step 4, SDG, one can invoke SDG to generates a data sample (test suite). SDG tries to meet both the statistical representativeness and logical validity requirements, respectively specified in Steps 2 and 3. The output of SDG is a collection of instance models, i.e., instantiations of the underlying data schema. Each instance model characterizes one test case for statistical testing.

How SDG works?


How SDG works?

As shown by the figure above, initially, a potentially invalid collection of instance models is created using our previous data generation approach (see for more information about this heuristic data generator [link]). We call this initial collection the seed sample. SDG then transforms the seed sample into a collection of valid instance models. This is achieved using a customized OCL constraint solver (the baseline version of the solver is presented in [link]).

The solver attempts to repair the invalid instance models in the seed sample. To do so, the solver considers the user-defined OCL constraint alongside the multiplicity constraints of the underlying data schema and the constraints implied by the probabilistic annotations. The rationale for feeding the solver with instance models from the seed sample, rather than having the solver build instance models from scratch, is based on the following intuitions: (1) By starting from the seed sample, the solver is more likely to be able to reach valid instance models, and (2) The valid sample built by the solver will not end up too far away from being representative, in turn making it easier to fix deviations from representativeness, as discussed next.

Then, SDG attempts to realign the sample back with the desired statistical characteristics. This is done through an iterative process, delineated in the figure with a dashed boundary. Briefly, the process goes in a sequential manner through the instance models within the valid sample, and subjects these instance models to additional constraints that are generated on-the-fly. These additional constraints, which we call corrective constraints, provide cues to the solver as to how it should tweak an instance model so that the statistical representativeness of the whole data sample is improved. Concrete examples of corrective constraints (written in OCL) can be found at [link].

If the solver fails to come up with a tweaked instance model that satisfies both the corrective constraint and all the validity constraints at the same time, the original instance model is retained in the sample. Otherwise, SDG decides whether it is advantageous, from a representativeness viewpoint, to replace the original instance model by the tweaked one. Such decision is needed as SDG cannot readily tell whether the tweaked instance is a better fit for representativeness.


System Requirements

  • Eclipse IDE (Mars or higher) [link].

  • Java Development Kit (JDK) 1.8.0 (or higher) [link].

  • Note that all the other required third-party libraries are included in the installation package.

  • We also recommend using the Papyrus modeling environment for building and managing models [link].

Demonstration Material

  • Profile for expressing the statistical characteristics of the test data [link].

  • Example of a domain model annotated with statistical information (TaxCard) [link].

  • OCL constraints expressing the logical validity of the data [link].

  • Example of a valid and representative test data sample generated using SDG [link].


  • Installation Material for SDG

    • The SDG tool can be found [here].

    • Installation and usage instructions can be found [here].


    Relevant Publications

    • G. Soltana, M. Sabetzadeh, and L. C. Briand, "Synthetic Data Generation for Statistical Testing”, in proceedings of 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2017), Illinois, USA, October 30 - November 3, 2017.


    Contact Information

    Ghanem Soltana
    Interdisciplinary Centre for Security, Reliability and Trust
    29, Avenue John Fitzgerald Kennedy
    L-1855, Luxembourg
    E-mail: ghanem(dot)soltana(at)uni(dot)lu