The following are example projects for prospective undergraduate and masters students. Concrete projects are well-defined and a good place to start, while open-ended projects are good candidates for research theses. Alternatively, prospective students are welcome to read some recent papers from the lab and suggest extensions or applications as projects as well.
Open-ended projects are trickier and meant as research projects. Research projects demand the ability to look for answers to your questions. So, before condacting the professor, read the papers and formulate a high level plan for how to tackle the problem.
If interested, contact the professor with a description of the project you are interested in, your understanding of the papers/relevant material, and your availability – hours per week, and intended semester(s) of commitment.
SQL Tutor is a web application that visualizes how a SQL query is executed step by step in the query plan. Under the covers it uses data provenance to draw the row-level dependencies in each query operator step. It currently only works for two canned queries, and we would like it to work for a broader range of SQL queries.
Databass is a slow engine used for educational purposes. PhD students in the lab have been extending the Duckdb query engine to track lineage. An advanced project is to work with the PhD students in the lab to extend their system to support SQL Tutor.
Before contacting the professor, look at the demo, the databass engine, and be able to explain how the query complier works.
Precision Interfaces is a project to automatically generate interactive visualization interfaces from SQL queries and natural language.
Before contacting the professor, you should have experience with Svelte. Look at the front-end interface framework, and be able to explain the spec and how the backend, views, and widgets communicate with each other.
Our group works on efficient systems for provenance tracking in high performance data analysis engines. We are developing program analysis techniques to recommend how to instrument a piece of data analytics code in order to efficiently caption nad query its data provenance.
We are in the middle of writing up the techniques. In the mean time, you should read the precursor to the work to learn about data provenance and instrumentation and then talk to professor wu.
Before contacting the professor, you should read the papers, have experience reading and modifying system software (e.g., networking, databases, OS, etc), and have working knoledge of C++ or Rust.
Query explanations generate predicates over an input table that explain why the results of a query look wrong. Generating these explanations are very useful (there’s a billion dollar company trying to do this) but super slow.
We have ongoing work that uses provenance, parallelization, and vectorization to brute-force evaluate millions of explanations a second. Help us build this library out and release an open source package that anyone can use.
Before contacting the professor, you should read the paper, and have some systems background (e.g., OS or computer architecture class) and motivation to learn C++, GPU acceleration, parallelization.
View composition algebra is a new formalism to support comparison interactions in data visualization interfaces.
Improve the library
Make the library useful
Before contacting the professor, look through the VCA library code and be able to explain how a View is modeled and how statistical composition works.
Extend the formalism and library to support
Explore integrating design guidelines into the formalism
Explore applying techniques from data integration (matching, type inference, etc) to improve comparison of complex data types or complex data flows
This is an open ended research project, inspired by https://twitter.com/planetscaledata/status/1551607869585235968
Can we develop a system that warns users if an action in the database may have bad ramifications in the future? One way to think about it:
P(bad things | actions in the past) = P(bad things | actions)P(actions | query log)
What we care about is most likely:
P(bad things | actions)P(actions | query log) - P(bad things)
Where log shows likelihood of access/query operations. It also could suggest what bad things are “probably OK”
We can categorize/model “bad things”, and assess them based on whether they make a set of “Tasks” “worse”.
Worse could mean many things!
Slower qs (handled by PDD and estimators, but maybe not exposed in a good way)