CS598: Human-in-the-loop Data Management

The course explores two complementary roles for humans as applied to interactive data analytics: one, where humans are the analysts performing or supervising the analysis; here, the emphasis is on building usable tools for these analysts, and second, where humans are the crowdsourced workers assisting with the computation and analysis; here, the emphasis is on having humans process as little data as possible while gaining maximum benefit.

Students will read a number of papers: both important landmark papers as well as cutting-edge papers, act as a discussant for a paper at least once, and complete a semester-long implementation project. Familiarity with basic databases, machine learning, and algorithms expected.

News

September 3: The final list of papers and schedule is up. Following the instructions given below, by midnight September 6: you need to send us the list of 5 papers that you'd like to present. Use the following link.
August 25: There will be no classes on 8/28 and 8/30 as Aditya will be attending the VLDB conference.

Course Significance

Crowd-Powered Analytics: An IBM study estimated that 80% of the data recorded every day is unstructured: i.e., it consists of images, videos and text. Fully automated processing of unstructured data is not yet a solved problem. Humans, on the other hand, are very good at understanding, interpreting, and processing unstructured data. How do we use humans to effectively process large volumes of unstructured data?

Interactive Analytics: A McKinsey Big Data Study estimated that 10s of Millions of new data analysts will be needed by 2017. With so many novice data analysts interacting with data, how do we enable them to quickly get valuable insights? Quickly could mean generating the same results faster, but approximately; it could mean showing them visualizations instead of raw data; it could mean helping the users to ``guess'' the query or insight in mind.

Instructions for Submitting List of Papers to Present

You must use the following link to submit your list of top-5 papers: Link.

The papers you provide can be from the list given below. You are also free to list a paper of your choice, but it must match the themes of the class. This list must be submitted by midnight September 6. .

Instructions for Submitting Class Reviews

You must use the following link to submit class reviews: Link.

Remember to cover the 5 key questions: what is the problem, why is it important, what sets it apart from previous work, what are the key technical ideas, what are the key flaws and open issues, all within 500 words.

The class reviews must be submitted by midnight the day before class.

Grading Policy

Class Reviews: 20%

Due day before class at midnight.
Lightly graded; allowed to miss 3 (three) in total. Late submissions count towards the 3
After 3 misses, you lose 1% of grade per missed review

Class Participation: 15%
Paper Presentation: 15%

Send us top 5 papers you’d like to present by Sep 6
The presentation slides should be mailed to Tarique 48 hrs in advance of your presentation

Research Project: 50%

Proposal (27th Sep midnight) + Midterm report (30th Oct midnight)
Final report + presentation (report due: 13th Dec midnight; presentation slides due: 11th or 13th December prior to class)

Schedule

Date	Paper	Presenter	Notes
8/28/2017	VLDB Conference--No Lecture
8/30/2017	VLDB Conference--No Lecture
9/4/2017	Holiday--Labor Day
9/6/2017	Introduction to course content	Aditya	Send list of paper preferences by midnight September 6th.
9/11/2017	CrowdScreen: Algorithms for Filtering Data Using Humans	Aditya
9/13/2017	Human-Powered Sorts and Joins	Aditya	First time for Class Review- Send it by midnight the day before;
9/18/2017	So Who Won: Dynamic Max Discovery with the Crowd	Assma	Student Presentations Start
9/20/2017	CrowdDB: Answering Queries Using Crowdsourcing	Fareedah
9/25/2017	Deco: Declarative Crowdsourcing	Litian
9/27/2017	Dremel: Interactive Analysis Of Web-Scale Datasets	Dipannita
10/2/2017	Spark SQL: Relational Data Processing in Spark	Junting Lou
10/4/2017	BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data	Subham De
10/9/2017	Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee	Silu Huang, Ph.D. Student, DAIS
10/11/2017	Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster	Aditya
10/16/2017	Incvisage: I’ve Seen “Enough”: Incrementally Improving Visualizations to Support Rapid Decision Making	Sajjadur, Ph.D. Student, DAIS
10/18/2017	ImMens: Real-time Visual Querying of Big Data	Tao Mo
10/23/2017	Polaris: A System for Query, Analysis, and Visualization of Multidimensional Relational Databases	Chi-Hsien Yen
10/25/2017	Effortless Data Exploration with zenvisage:An Expressive and Interactive Visual Analytics System	Edward Xue
10/30/2017	dbTouch: Analytics at your Fingertips	Peter
11/1/2017	Gestural Query Specification	Saar Kuzi	Midterm project report due on 3rd Midnight
11/6/2017	DataPlay: Interactive Tweaking and Example-driven Correction of Graphical Database Queries	Doris
11/8/2017	Data-Spread: Unifying Databases and Spreadsheets	Mangesh, Ph.D. Student, DAIS
11/13/2017	Making Database Systems Usable	Assma
11/15/2017	MLbase: A Distributed Machine-learning System	Jialin
11/20/2017	Thanksgiving
11/22/2017	Thanksgiving
11/27/2017	Guest Lecture: Leveraging data and people to accelerate data science	Laura Haas, IBM Research	No class
11/29/2017	MAD Skills: New Analysis Practices for Big Data	Yue
12/4/2017	GraphLab: A New Framework For Parallel Machine Learning	Siyu
12/6/2017	OrpheusDB: Bolt-on Versioning for Relational Databases	Liqi, Ph.D. Student, DAIS
12/11/2017	Project Presentations		Presentation due prior to class
12/13/2017	Project Presentations		Presentation due prior to class; report due midnight

Tentative List of Papers

Theme 1: Dealing with Unstructured and Noisy Data

Crowd-Powered Systems

Crowd-Powered Algorithms

Data Cleaning Systems

Theme 2: Dealing with Huge Data

Scalable Data Processing Systems

Approximate Analytics

Scalable Visualizations

Theme 3: Dealing with Novice Analysts

Visual Analytics Systems

New Interfaces and Usability

Autocompletion and Query Refinement

Theme 4: Dealing with New Data Analytics Scenarios

Machine Learning and Graph Processing

Collaborative Query Processing

Implementation Project

As part of this course, you need to complete a semester-long project. See the instructor for ideas. Alternatively, you are free to look for ideas in your domain of expertise: for instance, if you work in computational journalism, building a new way to browse and manage large collections of textual archives could be a perfectly reasonable project. Either way, you must speak to the instructor to verify that the project is indeed "challenging" enough.