This past week, the Columbia University’s Data Management Group had a strong presence at the top data management conference, SIGMOD 2025 in Berlin. Our lab presented four papers, two demos, a keynote, a new workshop, a tutorial, and organized a panel. We’re especially proud to share the highlights below.
John Paparrizos and Luis Gravano received the 10-year Test-of-Time Award for their 2015 paper k-shape: Efficient and Accurate Clustering of Time Series, which remains a cornerstone in time series analysis!
Eugene Wu, Lampros Flokas, and collaborators at Celonis received the best paper award at the Workshop on Data Management for End-to-End Machine Learning (DEEM) for Towards a Framework for Hierarchical Text Segmentation using Large Language Models.
Eugene Wu, Lampros Flokas, and collaborators at Celonis also received the best SIGMOD Industry Demo Award for Prompt Editor: A Taxonomy-driven System for Guided LLM Prompt Development in Enterprise Settings.
Physical Visualization Design: Decoupling Interface and System Design. Yiru Chen et al. Building interactive data visualizations that are both scalable and responsive is incredibly challenging, especially as datasets grow. Interface designers often struggle to ensure their designs meet strict latency expectations and resource limits, leading to complex, manual system optimizations. Ideally, a tool can assess whether it is possible to meet an interface design’s latency expectations, and either propose additional resources that are needed, or deploy a client-server infrastructure that guarantees these expectations. Physical Visualization Design (PVD) is the first to propose such a tool that decouples interface design from the technical system optimization — enabling Design Independence.
Eugene co-organized the first workshop at the intersection of Data Management, Process Mining, and AI. Process mining studies the steps that organizations take in order to complete tasks like onboarding, invoice processing, managing accounts, etc — the lifeblood of an organization. Process mining leverages event logs from data systems, has emerged as a powerful approach for visualizing workflows, identifying anomalies, and optimizing processes. However, traditional methods such as surveys and interviews remain costly, error-prone, and disconnected from real operations. The workshop bridges this gap by examining how AI and data management can push this domain forward.
We presented three pieces of work at the ProvWeek workshop, which focuses on data provenance (lineage) theory and practice.
Compression for High-Performance Lineage at the ProvWeek Workshop. Xiaoyu Han et al. Capturing row-level data lineage for every query in data systems incurs considerable storage overheads, particularly for interactive data exploration (IDE) applications that can generate dozens of similar queries per second. We study intra-query and inter-query lineage compression techniques, and show the potential for almost an order of magnitude reduction in storage costs.
Lineage Capture Trade-offs: A Case Study in DuckDB. Haneen Mohammed and Eugene Wu. While there are many proposals for capturing lineage, they are all implemented in different data systems that are incomparable to each other. This paper provides an apples-to-apples comparison of three lineage capture approaches (Query-Level rewrites, Operator-Level rewrites, and Function-Level instrumentation) within the DuckDB analytical engine. The study evaluates their trade-offs in terms of engineering effort and performance, and finds that a combination of operator- and function-level methods offers the best balance between programmability and performance.
DEMO: Fast Hypothetical Updates Evaluation. Haneen Mohammed et al. An exciting use case for row-level lineage is to support what-if analytics that require fast hypothetical deletions and scaling updates. This demonstration showcases FaDE, a DuckDB extension that achieves a throughput of over 1M what-if questions per second, enabling interactive what-if analysis and powering applications like cross-filter visualizations and interactive explanation engines with low latency.
We presented one paper at the Workshop on Data Management for End-to-End Machine Learning (DEEM):
Towards a Framework for Hierarchical Text Segmentation using Large Language Models. Lampros Flokas, et al. Hierarchical text segmentation is important because it allows models to break down long documents into an organized hierarchy of segments, each annotated with a label from a taxonomy of classes, which is crucial for, e.g., learning templates from corpuses of prompts. This paper formulates this as a join between the document and the taxonomy, introduces two configurable algorithms inspired by index join and merge-sort join, and demonstrates that tailoring the algorithm to the specific use case is crucial for balancing LLM resource usage and segmentation accuracy, especially with large documents and taxonomies.
We also presented one demo at the main SIGMOD conference:
Prompt Editor: A Taxonomy-driven System for Guided LLM Prompt Development in Enterprise Settings. Jeffery Cao et al. This system assists business users in crafting high-quality LLM prompts by leveraging an organization’s existing prompt corpus and a learned taxonomy of prompt segment types.
SIGMOD Panel: “Where Does Academic Database Research Go From Here?” Eugene Wu co-organized a well-attended panel discussion that explored the evolving role and future direction of academic database research in light of technological and budgetary shifts, particularly the rise of AI. The discussion emphasized the unique comparative advantage of the academic database community and brainstormed actionable strategies for its future.
Keynote at the Human-in-the-loop Data Analytics (HILDA) Workshop: Eugene Wu delivered a keynote titled “A Decade of Systems for Human Data Interaction.” His talk reflected on the HILDA community’s comparative advantage in guaranteeing user experience rather than query workload, and how this fundamentally reshapes system design. He discussed the intricate relationship between interface designs influencing system optimizations and architectures, and how innovations in data management can inform new visualization designs, interaction paradigms, and even visualization theory.
Tutorial on Privacy and Security in Distributed Data Markets. Eugene Wu, along with Daniel Alabi (Columbia Postdoc, now faculty at UIUC), Sainyam Galhotra, Shagufta Mehnaz, and Zeyu Song, presented an overview of modern data markets and data search systems, discussed various privacy attacks and security vulnerabilities within these markets, and explored the crucial role of legal systems in addressing these challenges.