Research Projects

The following are example projects for prospective undergraduate and masters students. Concrete projects are well-defined and a good place to start, while open-ended projects are good candidates for research theses. Alternatively, prospective students are welcome to read some recent papers from the lab and suggest extensions or applications as projects as well.

Open-ended projects are trickier and meant as research projects. Research projects demand the ability to look for answers to your questions. So, before condacting the professor, read the papers and formulate a high level plan for how to tackle the problem.

If interested, contact the professor with a description of the project you are interested in, your understanding of the papers/relevant material, and your availability – hours per week, and intended semester(s) of commitment.

SQL Tutor

SQL Tutor is a web application that visualizes how a SQL query is executed step by step in the query plan. Under the covers it uses data provenance to draw the row-level dependencies in each query operator step. It currently only works for two canned queries, and we would like it to work for a broader range of SQL queries.

Concrete Projects

Databass is a slow engine used for educational purposes. PhD students in the lab have been extending the Duckdb query engine to track lineage. An advanced project is to work with the PhD students in the lab to extend their system to support SQL Tutor.

Before contacting the professor, look at the demo, the databass engine, and be able to explain how the query complier works.

Automatic Interface Generation

Precision Interfaces is a project to automatically generate interactive visualization interfaces from SQL queries and natural language.

Concrete Projects

Before contacting the professor, you should have experience with Svelte. Look at the front-end interface framework, and be able to explain the spec and how the backend, views, and widgets communicate with each other.

Open ended projects

Efficient Provenance Tracking

Our group works on efficient systems for provenance tracking in high performance data analysis engines. We are developing program analysis techniques to recommend how to instrument a piece of data analytics code in order to efficiently caption nad query its data provenance.

We are in the middle of writing up the techniques. In the mean time, you should read the precursor to the work to learn about data provenance and instrumentation and then talk to professor wu.

Before contacting the professor, you should read the papers, have experience reading and modifying system software (e.g., networking, databases, OS, etc), and have working knoledge of C++ or Rust.

Open ended projects

Super Fast Query Explanations

Query explanations generate predicates over an input table that explain why the results of a query look wrong. Generating these explanations are very useful (there’s a billion dollar company trying to do this) but super slow.

We have ongoing work that uses provenance, parallelization, and vectorization to brute-force evaluate millions of explanations a second. Help us build this library out and release an open source package that anyone can use.

Before contacting the professor, you should read the paper, and have some systems background (e.g., OS or computer architecture class) and motivation to learn C++, GPU acceleration, parallelization.

VCA: View Composition Algebra

View composition algebra is a new formalism to support comparison interactions in data visualization interfaces.

Concrete projects

Improve the library

Make the library useful

Before contacting the professor, look through the VCA library code and be able to explain how a View is modeled and how statistical composition works.

Open ended projects

Extend the formalism and library to support

Explore integrating design guidelines into the formalism

Explore applying techniques from data integration (matching, type inference, etc) to improve comparison of complex data types or complex data flows

Could This Be Bad?

This is an open ended research project, inspired by

Can we develop a system that warns users if an action in the database may have bad ramifications in the future? One way to think about it:

  P(bad things | actions in the past) =
  P(bad things | actions)P(actions | query log)

What we care about is most likely:

  P(bad things | actions)P(actions | query log) - P(bad things)

Where log shows likelihood of access/query operations. It also could suggest what bad things are “probably OK”

We can categorize/model “bad things”, and assess them based on whether they make a set of “Tasks” “worse”.

Worse could mean many things!

Not runnable:

Incorrect data:

Slower qs (handled by PDD and estimators, but maybe not exposed in a good way)