Best paper award for simplifying data transformation
PhD Student Jie Song received a Best Paper Award from the 2021 International Conference On Scientific And Statistical Database Management for her paper “SDTA: An Algebra for Statistical Data Transformation,” co-authored with Profs. George Alter (with the U-M Inter-university Consortium for Political and Social Research) and H V. Jagadish.
Statistical data manipulation is a crucial part of many data science analytic pipelines, and is usually accomplished by writing transformation scripts in languages like SAS, R, or Python. The many different data models, language representations, and transformation operations supported by these tools make it hard for users to understand and document the data transformations being performed, and hard for developers to rewrite transformation code in different languages.
In their paper, the researchers proposed a set of formal tools to make statistical data transformation easier to document and understand. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL). In experiments performed by the group with real statistical transformations on socio-economic data, they were able to show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository.
The team illustrated how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect that’s often neglected in the metadata of datasets. They proposed a system called C2Metadata that automatically captures a data transformation and provenance information in SDTL as a part of the metadata. They additionally demonstrated how transformation programs can be converted to other languages, and the possibility of using SDTA to optimize SDTL transformations.