The following are example projects for prospective undergraduate and masters students. Concrete projects are well-defined and a good place to start, while open-ended projects are good candidates for research theses. Alternatively, prospective students are welcome to read some recent papers from the lab and suggest extensions or applications as projects as well.
Open-ended projects are trickier and meant as research projects. Research projects demand the ability to look for answers to your questions. So, before condacting the professor, read the papers and formulate a high level plan for how to tackle the problem.
If interested, contact the professor with a description of the project you are interested in, your understanding of the papers/relevant material, and your availability – hours per week, and intended semester(s) of commitment.
SQL Tutor is a web application that visualizes how a SQL query is executed step by step in the query plan. Under the covers it uses data provenance to draw the row-level dependencies in each query operator step. It currently only works for two canned queries, and we would like it to work for a broader range of SQL queries.
Databass is a slow engine used for educational purposes. PhD students in the lab have been extending the Duckdb query engine to track lineage. An advanced project is to work with the PhD students in the lab to extend their system to support SQL Tutor.
Before contacting the professor, look at the demo, the databass engine, and be able to explain how the query complier works.
Precision Interfaces is a project to automatically generate interactive visualization interfaces from SQL queries and natural language.
Before contacting the professor, you should have experience with Svelte. Look at the front-end interface framework, and be able to explain the spec and how the backend, views, and widgets communicate with each other.
Our group works on efficient systems for provenance tracking in high performance data analysis engines. We are developing program analysis techniques to recommend how to instrument a piece of data analytics code in order to efficiently caption nad query its data provenance.
We are in the middle of writing up the techniques. In the mean time, you should read the precursor to the work to learn about data provenance and instrumentation and then talk to professor wu.
Before contacting the professor, you should read the papers, have experience reading and modifying system software (e.g., networking, databases, OS, etc), and have working knoledge of C++ or Rust.
Query explanations generate predicates over an input table that explain why the results of a query look wrong. Generating these explanations are very useful (there’s a billion dollar company trying to do this) but super slow.
We have ongoing work that uses provenance, parallelization, and vectorization to brute-force evaluate millions of explanations a second. Help us build this library out and release an open source package that anyone can use.
Before contacting the professor, you should read the paper, and have some systems background (e.g., OS or computer architecture class) and motivation to learn C++, GPU acceleration, parallelization.
View Composition Algebra is a new type of visualization interaction that allows users to drag and compare data in charts easily. We have developed a library for composing SQL queries and visualizations.
Integrate VCA into the Rilldata visualization system.
Improve the library
Use transformers to translate natural language comparison statements into VCA statements, so users can compare visualized data using natural language. NL -> VCA. Compare things on the screen with NL, translate to selections/views and VCA operations
Before contacting the professor, look through the VCA library code and be able to explain how a View is modeled and how statistical composition works.
Extend the formalism and library to support
Explore integrating design guidelines into the formalism
Explore applying techniques from data integration (matching, type inference, etc) to improve comparison of complex data types or complex data flows