The Wu Lab
Accelerating the Democratization of Data
Our lab focuses on building tools and algorithms to facilitate the democratization of data and enable everyone to easily interact with, explore, analyze and understand their data. To achieve this goal we focus on three core problems: interactive data cleaning and preparation tools so that the data is clean enough for analysis; data exploration and explanation tools so users can identify patterns in the data and understand why those patterns exist; and interactive visualization systems to bridge the gap between database systems, which are currently tailored for data processing, and visualization systems, which are tailored to visualization.
Our paper reading group regularly meets at 3:30-5PM Wednesdays in 417 MUDD.
We spend part of the time discussing the technical ideas in the paper, and part of the time dissecting how the introduction is written. Everyone is welcome to attend!
A Data Visualization Management System (DVMS) integrates visualizations and databases, by compiling a declarative visualization language into an end-to-end relational operator pipeline that renders the visualization and is amenable to database-style optimizations. Thus the DVMS can be both expressive via the visualization language, and performant by leveraging traditional and visualization-specific optimizations to scale interactive visualizations to massive datasets.
Instead of explaining and fixing data using data, which is a bit circuitous, we seek to both explain and repair incorrect data values by using the actual queries that modified the database.
Visualizations are excellent for exposing surprising patterns and outliers in data, however existing tools have no way to help explain those patterns and outliers. We are exploring systems to generate sensible explanations for outliers in analytics visualizations.
Analysts report spending upwards of 80% of their time on problems in data cleaning including extraction, formatting, handling missing values, and entity resolution. How can knowing the application you want to actually run help speed up the cleaning process?
Point latexsnapshots to your git repo, and it will go through the commits and identify those that change your tex files in a significant way, and regenerate the pdf files. It will also take thumbnails of those pdfs, and show them in a web UI.
A history of databases through keyword trends in VLDB publication titles
A python wrapper around ggplot2 that provides nearly the same syntax, but in python.
from pygg import * # Example using diamonds dataset (comes with ggplot2) p = ggplot('diamonds', aes('carat', y='price')) g = geom_point() + facet_wrap(None, "color") ggsave("test1.pdf", p+g, data=None)
GUI interface to normalize and clean up entries in bibtex files to reduce the references section and make the bibtex more managable.
I am lucky to be working, and have worked, with many remarkable students.
- Fotis Psallidas
- Xiaolan Wang (UMaas Amherst, advised by Alexandra Meliou)
- QFix: explaining database errors using query histories
- Yifan Wu (UC Berkeley, Joe Hellerstein)
- Consistency in Declarative Visual Interactive Languages (DeVIL)
- Sanjay Krishnan (UC Berkeley, advised by Michael Franklin, Ken Goldberg)
- Data cleaning and machine learning
- Daniel Haas (UC Berkeley, advised by Michael Franklin)
- Making crowds fast
- Laura Rettig (Fribourg University, advised by Philippe Cudre-Mauroux)
- Lilong Jiang (Ohio State, advised by Arnab Nandi)
- Human graphical perception
- Daniel Alabi (starting PhD at Harvard)
- Using human perceptual models to make visualizations faster
- Zhengjie Miao
- Predictiong user interactions to make visualizations fasters and better
- Sharan Suryanarayanan
- Hamed Nilforoshan
- Rahul Khanna
- James Sands
- HaoCi Zhang (Tsinghua University)
- Kevin Lin
A tool to import your data into whatever data store you want, as painlessly as possible.
Qurk and Crowd-sourcing
A look at optimizing human computation through a database lens. Qurk is a database prototype that enables users to write queries that compute results from both machines and humans. With adam marcus.
An experimental course scheduling system. Tries to make the user experience not suck by using JS. This was around the time google calendar came out. With sukhchander khanna
Relational Grammar of Graphics
Exploration of translation from grammar of graphics to a relational query plan with simple provenance support integrated out-of-the-box.
I co-developed the "big data" course at MIT. The class surveys techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, Hadoop, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.
The goal is for students to gain working experience of the topics and systems that are covered.
I co-taught a heavily lab-based IAP class called Introduction to Data Literacy that introduces students to many basic data cleaning, analysis, and visualization techniques. The course was added to OCW. With my buddy adam marcus.
MEET -- Middle Eastern Education through Technology
MEET strives to bridge the gap between future Israeli and Palestinian leaders by immersing them together for 3 full years of fun and education. MIT business and technical instructors work in the Middle East for a month-long intensive session during the summer. I was one of four Year 3 technical instructors in 2010, and helped head the curriculum team for the past 3 years