eugenewu.net all posts

Research Projects

The following are example projects for prospective undergraduate and masters students. Concrete projects are well-defined and a good place to start, while open-ended projects are good candidates for research theses. Alternatively, prospective students are welcome to read some recent papers from the lab and suggest extensions or applications as projects as well.

Open-ended projects are trickier and meant as research projects. Research projects demand the ability to look for answers to your questions. So, before condacting the professor, read the papers and formulate a high level plan for how to tackle the problem.

If interested, contact the professor with a description of the project you are interested in, your understanding of the papers/relevant material, and your availability – hours per week, and intended semester(s) of commitment.

SQL Tutor

SQL Tutor is a web application that visualizes how a SQL query is executed step by step in the query plan. Under the covers it uses data provenance to draw the row-level dependencies in each query operator step. It currently only works for two canned queries, and we would like it to work for a broader range of SQL queries.

Concrete Projects

Extend the Python databass engine to emit the appropriate provenance and intermediate results so the query plan can be visualized in SQL Tutor
Make everything run in the browser using pyodide
Help make SQL Tutor more user friendly

Databass is a slow engine used for educational purposes. PhD students in the lab have been extending the Duckdb query engine to track lineage. An advanced project is to work with the PhD students in the lab to extend their system to support SQL Tutor.

Before contacting the professor, look at the demo, the databass engine, and be able to explain how the query complier works.

Automatic Interface Generation

Precision Interfaces is a project to automatically generate interactive visualization interfaces from SQL queries and natural language.

Concrete Projects

We recently redesigned the (front-end interface framework)[https://github.com/cudbg/pi-svelte]. Help us port the backend interface generation system to generate specifications for the new front-end framework
Help implement vis and interaction components for our front-end framework in Typescript+Svelte.
Extend the system to generate interfaces from dbt projects as input. The user optionally chooses dbt models they are interested in, and system synthesizes interactive interface for them.

Before contacting the professor, you should have experience with Svelte. Look at the front-end interface framework, and be able to explain the spec and how the backend, views, and widgets communicate with each other.

Open ended projects

Extend interface generation to support interactions designed for accessibility (e.g., interactions for vision impaired users or for speech only modalities).

Efficient Provenance Tracking

Our group works on efficient systems for provenance tracking in high performance data analysis engines. We are developing program analysis techniques to recommend how to instrument a piece of data analytics code in order to efficiently caption nad query its data provenance.

We are in the middle of writing up the techniques. In the mean time, you should read the precursor to the work to learn about data provenance and instrumentation and then talk to professor wu.

Before contacting the professor, you should read the papers, have experience reading and modifying system software (e.g., networking, databases, OS, etc), and have working knoledge of C++ or Rust.

Open ended projects

Help us implement and harden the program analysis techniques
Apply the program analysis techniques to another vectorized columnar data system, like DataFusion, Monetdb, .
Explore the trade-off and synergies between logical query rewrites for provenance and physical instrumentation in streaming dataflow systems like differential dataflow/materialized. You can get pretty far by simply performing logical rewrites of the dataflow operators without modifying the implementation of the dataflow system.

Super Fast Query Explanations

Query explanations generate predicates over an input table that explain why the results of a query look wrong. Generating these explanations are very useful (there’s a billion dollar company trying to do this) but super slow.

We have ongoing work that uses provenance, parallelization, and vectorization to brute-force evaluate millions of explanations a second. Help us build this library out and release an open source package that anyone can use.

Scorpion Paper

Before contacting the professor, you should read the paper, and have some systems background (e.g., OS or computer architecture class) and motivation to learn C++, GPU acceleration, parallelization.

VCA: View Composition Algebra

View Composition Algebra is a new type of visualization interaction that allows users to drag and compare data in charts easily. We have developed a library for composing SQL queries and visualizations.

Concrete projects

Integrate VCA into the Rilldata visualization system.

The goal is to be able to compare any piece of data visualized in any chart in Rilldata, with any other.
Students should be comfortable hacking on Rilldata, which is in Svelte, and manipulating SQL queries.

Improve the library

Help refactor the VCA library, and convert to typescript
Transition from our custom knex.js based query library to something more widely used (Ibis, polysql, etc)

Use transformers to translate natural language comparison statements into VCA statements, so users can compare visualized data using natural language. NL -> VCA. Compare things on the screen with NL, translate to selections/views and VCA operations

Before contacting the professor, look through the VCA library code and be able to explain how a View is modeled and how statistical composition works.

Open ended projects

Extend the formalism and library to support

hierarchical data
graph/network data
scientific data
general SQL queries/data transform graphs.

Explore integrating design guidelines into the formalism

Explore applying techniques from data integration (matching, type inference, etc) to improve comparison of complex data types or complex data flows