The Wu Lab
Accelerating the Democratization of Data

Our lab focuses on building tools and algorithms to facilitate the democratization of data and enable everyone to easily interact with, explore, analyze and understand their data. To achieve this goal we focus on three core problems: interactive data cleaning and preparation tools so that the data is clean enough for analysis; data exploration and explanation tools so users can identify patterns in the data and understand why those patterns exist; and interactive visualization systems to bridge the gap between database systems, which are currently tailored for data processing, and visualization systems, which are tailored to visualization.

Reading Group Current Projects Fun Students

The Group

Summer 2016

Reading Group

Our paper reading group regularly meets at 3:30-5PM Wednesdays in 417 MUDD.
We spend part of the time discussing the technical ideas in the paper, and part of the time dissecting how the introduction is written. Everyone is welcome to attend!

Click here to see our reading list and notes

Current Projects

Data Visualization Management Systems

A Data Visualization Management System (DVMS) integrates visualizations and databases, by compiling a declarative visualization language into an end-to-end relational operator pipeline that renders the visualization and is amenable to database-style optimizations. Thus the DVMS can be both expressive via the visualization language, and performant by leveraging traditional and visualization-specific optimizations to scale interactive visualizations to massive datasets.

Query Explanation

Instead of explaining and fixing data using data, which is a bit circuitous, we seek to both explain and repair incorrect data values by using the actual queries that modified the database.

Data Exploration and Explanation

Visualizations are excellent for exposing surprising patterns and outliers in data, however existing tools have no way to help explain those patterns and outliers. We are exploring systems to generate sensible explanations for outliers in analytics visualizations.

Data Cleaning for Data Science

Analysts report spending upwards of 80% of their time on problems in data cleaning including extraction, formatting, handling missing values, and entity resolution. How can knowing the application you want to actually run help speed up the cleaning process?


Latex Snapshots

Point latexsnapshots to your git repo, and it will go through the commits and identify those that change your tex files in a significant way, and regenerate the pdf files. It will also take thumbnails of those pdfs, and show them in a web UI.

VLDB conference trends

A history of databases through keyword trends in VLDB publication titles

Python ggplot2 syntax

A python wrapper around ggplot2 that provides nearly the same syntax, but in python.

  from pygg import *

  # Example using diamonds dataset (comes with ggplot2)
  p = ggplot('diamonds', aes('carat', y='price'))
  g = geom_point() + facet_wrap(None, "color")
  ggsave("test1.pdf", p+g, data=None)


GUI interface to normalize and clean up entries in bibtex files to reduce the references section and make the bibtex more managable.


I am lucky to be working, and have worked, with many remarkable students.




Older Projects

Data Import

A tool to import your data into whatever data store you want, as painlessly as possible.

See article for motivation

Qurk and Crowd-sourcing

A look at optimizing human computation through a database lens. Qurk is a database prototype that enables users to write queries that compute results from both machines and humans. With adam marcus.


A look into the properties of structured data at the internet scale. With michael cafarella, yang zhang, nodira k., daisy wang and alon halevy.


System for declaratively filtering and correlating streams of events from sensor and rfid devices. Extends YFilter's core query processing engine. With yanlei diao and daniel gyllstrom.


A Cascading Stream Architecture for Large-Scale Receptor-Based Networks. With the berkeley db group and notably shawn jeffrey and shariq rizvi


An experimental course scheduling system. Tries to make the user experience not suck by using JS. This was around the time google calendar came out. With sukhchander khanna

Relational Grammar of Graphics

Exploration of translation from grammar of graphics to a relational query plan with simple provenance support integrated out-of-the-box.

source on github


Big data course@MIT

I co-developed the "big data" course at MIT. The class surveys techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, Hadoop, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.

The goal is for students to gain working experience of the topics and systems that are covered.

Introduction to Data Literacy

I co-taught a heavily lab-based IAP class called Introduction to Data Literacy that introduces students to many basic data cleaning, analysis, and visualization techniques. The course was added to OCW. With my buddy adam marcus.

MEET -- Middle Eastern Education through Technology

MEET strives to bridge the gap between future Israeli and Palestinian leaders by immersing them together for 3 full years of fun and education. MIT business and technical instructors work in the Middle East for a month-long intensive session during the summer. I was one of four Year 3 technical instructors in 2010, and helped head the curriculum team for the past 3 years