September 29, 2025, 2:30pm, MATH 402
When
2:30 – 3:30 p.m., Sept. 29, 2025
Title: ATLAS: A System for Querying Multi-modal Data Lakes
Abstract:
Many organizations such as medical, astronomy, and finance are producing a variety of data with different modalities, including documentation, spreadsheets, and log files, etc. They typically use multi-modal data lakes as intermediate repositories for this data, which does not naturally fit into relational tables, and thus is not directly queryable using standard declarative languages such as SQL. Yet, high-value information is deeply buried in varied formats. To serve patients, accelerate scientific discovery, or find investment opportunities, users are forced to go through a tedious and time-consuming data preparation process to transform, clean, and combine these data sets. Despite years of research on different targeted subcomponents of this complex data preparation problem, data scientists report spending 80% or more of their time on it.
Most recently, data management researchers showed the promise of Large Language Models (LLMs) in solving data preparation problems such as data extraction, data cleaning, and entity resolution, due to its reasoning and semantics understanding abilities. However, making these LLM-based approaches practical in real-word applications poses significant challenges because of the prohibitive inference cost and the generalist nature of LLM services, while solving these problems requires data management and AI expertise, which many organizations do not possess. Therefore, we propose an LLM-powered system, called ATLAS, which, by combining LLMs with custom system optimizations, offers user-friendly, effective, and efficient services for querying multi-modal data lakes. ATLAS allows users to directly analyze multi-modal data with SQL queries. It achieves this by producing optimized query execution plans, which leverage LLMs to prepare raw data in diverse formats into queryable relational format and transparently resolves the efficiency and cost issues of LLMs. Rather than offering a set of pre-developed operators and picking one to solve a data preparation problem, e.g., data extraction, ATLAS leverages LLMs as a compiler to produce on the fly a domain-specific operator tailored to user requirements. This overcomes the problem that applications in different domains have diverse needs, with no one-size-fits-all solution existing even for single data preparation tasks. ATLAS does not require expensive LLM invocation on every data record, which suffers from efficiency and monetary cost issues. Instead, ATLAS uses lightweight methods such as code synthesized by LLMs or small models trained on pseudo labels generated by LLMs to process most of the data records. It only selectively invokes LLMs on challenging cases that require reasoning and semantic understanding, thus efficient.