Database Group Research

PUG (Provenance Unification through Graphs)

Cartoon pug dog

In collaboration with UIUC, we have developed a unified approach for efficiently computing provenance(why) and missing answers(why-not). This approach is based on the observation that in provenance model for queries with unsafe negation, why-not questions can be translated into why questions and vice versa. Furthermore, typically only a part of the provenance, which we call explanation, is actually relevant for answering the user’s question about the existence or absence of a result. We have developed an approach that tries to restrict provenance capture to what is relevant to explain the outcome of interest specified by the user.

ApProveS (Approximate Provenance Summary)

Diagram of each phase of Approximate Provenance Summary

While the size of provenance (even in explanations in PUG) may still overwhelm users with too much information and waste computational resources. In particular for why-not provenance, where the provenance explains all failed ways of how a result could have been derived using the rules of a query, it may be too large to be computed even for small datasets. We address the computational and usability challenges of large provenance by creating summaries based on structural commonalities that exist in the data. Importantly, our approach computes summaries based on a sample of the provenance only and, thus, avoids the computationally infeasible step of generating the full why-not provenance for a user question.

Why-Not Provenance for Nested Data

Brackets with a series of circles depicting nested information

The need to explain missing answers is prevalent in many applications including debugging complex analytical queries. Such complex analytics are often implemented using data-intensive scalable computing (DISC) systems such as Apache Spark which employ nested data models. Hence, there is a need for missing answer techniques for nested data models and implementations of such techniques in DISC systems. In this project, we focus on query-based explanations which assume that the input data is sufficient for producing the missing result and consequently identify which parts of the input query caused the result to be missing. This project is collaborative work with University of Stuttgart.

Database Security

Illustration of a lock and key

Recently, the demand for open data has been increased for transparency. Privacy concerns reside within the information exchange over network and storing and processing the data. In this project, we investigate how to efficiently evaluate the risk in data sharing before opening the data. We study several theoretical models for disclosure risk and information loss to quantify the risk over data linkage. Furthermore, we develop a practical solution for the risk evaluation using sample and summary of data. This is a collaborative work with University of Notre Dame.