Event Talks
About the speaker
Ihab Ilyas is a professor of Computer Science at the University of Waterloo and co-founder of Tamr | https://cs.uwaterloo.ca/~ilyas/
About the talk
"Data scientists spend big chunk of their time preparing, cleaning, and transforming raw data before getting the chance to feed this data to their well-crafted models.
Despite the efforts to build robust predication and classification models, data errors still the main reason for having low quality results. This massive labor-intensive exercises to clean data remain the main impediment to automatic end-to-end AI pipeline for Data Science.
In this talk, I focus on data prep and cleaning as an inference problem, which can be automated by leveraging modern abstractions in ML.
I will describe the HoloClean framework, a scalable prediction engine for structured data. The framework has multiple successful proof of concepts with cleaning census data, market research data, and insurance records. The pilots with multiple commercial enterprises showed a significant boost to the quality of source (training) data before feeding them to downstream analytics.
HoloClean builds two main probabilistic models: a data generation model (describing how data was intended to look like); and a realization model (describing how errors might be introduced to the intended clean data). The framework uses few-shot learning, data augmentation, and self supervision to learn the parameters of these models, and use them to predict both error and their possible repairs."
