SEA Data

Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores, 2nd edition

Co-located with VLDB 2021 (20 August 2021, Copenhagen, Denmark)

SEA Data workshop will provide a forum for researchers and practitioners to exchange ideas, results, and visions on challenges in data management, information extraction, exploration, and analysis of heterogeneous data and multiple data models at once.

Companies, governments, and organizations are now producing and collecting data from multiple heterogeneous sources, such as transactional data, internet traffic, logs, IoT applications, knowledge bases, and much more. The unprecedented pace in which data is produced and consumed calls for methods that organize, retrieve, and analyze such data appropriately. While traditionally data were organized into homogeneous datastores and formats, our current data collection from multiple different sources makes such datastores impractical. Even within the same organization, data dwells in independent silos each with a distinct data model and serving a specific application, keeping relevant portions of the data separate from each other.

As a consequence, we have witnessed an increasing interest in systems and methods that try to handle and analyze multiple data sources and formats holistically. Data-lakes and polystores are the most prominent examples of such heterogeneous datastores. Moreover, graphs and learned databases have recently attracted the attention of the community for their flexibility in modeling, managing, and organizing heterogeneous data. Due to the fast pace of data collection and evolution, consolidating all the sources into a single data format and loading them into a single store is usually impractical.

Hence, the first challenge that these systems face is to provide flexible storage and retrieval methods that can adapt to multiple models and domains. On the other hand, from the user perspective, when such diverse data is collected, the tasks of data discovery, exploration, and analysis become even more challenging. These solutions in the case of heterogeneous datastores remain still widely uncharted for a lack of established methods that allow effective multi-model data retrieval and exploration. Data analytics should also accommodate issues due to the lack of shared dimensions, ambiguous semantics, and the need to ensure the quality and lineage of the analytical result.

Workshop Chairs

Davide Mottin, Aarhus University
Matteo Lissandrini, Aalborg University
Senjuti Basu Roy, New Jersey Institute of Technology
Yannis Velegrakis, University of Trento & Utrecht University

Important Dates

Workshop: 8:30am-6:00pm — 20.08.2021

Workshop Program

VLDB Conference program

Workshop proceedings

Topics

SEA Data aims at gathering researchers and practitioners from various communities related to databases. We gladly accept submissions that present initial ideas and visions, just as much as reports on early results, or reflections on completed projects. The workshop will focus on discussion and interaction, rather than static presentations of what is in the paper. A list of relevant topics is presented below.

The workshop also welcomes papers on negative results

Search, Exploration, and Analysis for heterogeneous unstructured and semi-structured data (e.g., knowledge graphs, web documents, semantic web);
Multi-model data exploration and analysis;
Querying and analyzing data lakes and polystores;
Cross-platform query processing and analytics;
Theory of heterogeneous data management;
Machine-learning methods for multi-model data exploration and analysis.
Novel user interfaces and query paradigm for searching heterogeneous data;
Exploration of large datasets including multiple sources;
Data visualization of heterogeneous data;
Example-based search and discovery for multi-model and heterogeneous datastores;
User-driven approaches on data-management for complex datasets;
Novel analyses involving multiple data sources;
Federated search, exploration, and analysis;
Information integration and entity resolutions across heterogeneous knowledge-bases and multi-model databases;
Approximate, anytime, and fast algorithms for extracting information from heterogeneous datastores;
Learnable structures for multi-model datasets;
Workload and Domain-agnostic self-assembling data management systems;

We also welcome submissions on thought-provoking applications and emerging uses of data management technology in heterogeneous datastores or multi-model databases. The workshop also welcomes papers on negative results.

You can also see the details from the previous edition at EDBT 2020.

Program Committee

Manos Athanassoulis (Boston University)
Nikolaus Augsten (University of Salzburg)
Hamdi Ben Hamadou (Aalborg University)
Sonia Bergamaschi (University of Modena and Reggio Emilia)
Nikos Bikakis (University of Ioannina)
Angela Bonifati (University of Lille 1 & Inria)
Anastasia Dimou (Ghent University)
Laura Di Rocco (Northeastern University)
George Fletcher (Eindhoven University of Technology)
Daniele Foroni (Huawei)
Johan-Christoph Freytag (Humboldt University Berlin)
Paul Groth (University of Amsterdam)
Francesco Guerra (University of Modena and Reggio Emilia)
Olaf Hartig (Linköping University)
Panos Karras (Aarhus University)
Xiangyu Ke (Nanyang Technological University)
Haridimos Kondylakis (Foundation of Research & Technology-Hellas)
Georgia Koutrika (Athena Research Center)
Ioana Manolescu (INRIA)
Renée Miller (Northeastern University)
Gabriela Montoya (Aalborg University)
Themis Palpanas (Paris Descartes University)
Paolo Papotti (EURECOM)
Giulia Preti (ISI Fondazione)
Arkaprava Saha (Nanyang Technological University)
Petra Selmer (Neo4j)
Gianmaria Silvello (University of Padua)
Alkis Simitsis (Athena Research Center)
Giovanni Simonini (University of Modena and Reggio Emilia)
Paolo Sottovia (Huawei)
Daniel Ting (Tableau)
Riccardo Torlone (University Roma Tre)
Aikaterini Tzompanaki (University of Cergy-Pontoise)
Kostas Zoumpatianos (Snowflake)

Workshop Program

Program Schedule
9:00	Welcome & Intro
9:05	Keynote: LIquid: Scaling the system that builds and serves the LinkedIn Economic Graph Bogdan Arsintescu
9:50	Paper Presentation: Towards A Unified Knowledge Graph Data Management System Baozhu Liu, Xin Wang, Pengkai Liu, Sizhuo Li
	Paper Presentation: Extreme-Scale Interactive Cross-Platform Streaming Analytics -- The INFORE Approach Antonios Deligiannakis, Nikos Giatrakos, Yannis Kotidis, Vasilis Samoladas, Alkis Simitsis
10:20	Q&A and Further Discussions: Part 1
10:30	Coffee Break
10:45	Paper Presentation: CovidGraph - A Knowledge Graph on COVID-19 Martin Preusse, Alexander Jarasch, Tim Bleimehl, Sebastian Müller, Jamie Munro, Lea Gütebier, Ron Henkel, Dagmar Waltemath
	Paper Presentation: Declarative Querying of Heterogeneous NoSQL Stores Nikolaos Koutroumanis, Nikolaos Kousathanas, Christos Doulkeridis, Akrivi Vlachou
	Paper Presentation: Multi-model Query Processing Meets Category Theory and Functional Programming Valter Uotila, Jiaheng Lu, Dieter Gawlick, Zhen Hua Liu, Souripriya Das, Gregory Pogossiants
	Paper Presentation: Let the Database Talk Back: Natural Language Explanations for SQL Stavroula Eleftherakis, Orest Gkini, Georgia Koutrika,
11:50	Q&A and Further Discussions: Part 2
12:00	Lunch Break
13:00	Paper Presentation: Know your experiments: interpreting categories of experimental data and their coverage Edoardo Ramalli, Barbara Pernici
	Paper Presentation: New Workflows in NoSQL Schema Management Michael Fruth, Kai Dauberschmidt, Stefanie Scherzinger
	Paper Presentation: The Secret Life of Wikipedia Tables Tobias Bleifuß, Leon Bornemann, Dmitri V. Kalashnikov, Felix Naumann, Divesh Srivastava
13:45	Q&A and Further Discussions: Part 3
13:55	Coffee Break
14:10	Paper Presentation: Finding NeMo: Fishing in banking networks using network motifs Xavier Fontes, David Aparício, Maria Inês Silva, Beatriz Malveiro, João Tiago Ascensão, Pedro Bizarro
	Paper Presentation: A Data Discovery Platform Empowered by Knowledge GraphTechnologies: Challenges and Opportunities Essam Mansour
	Paper Presentation: Auctus: A Search Engine for Data Discovery and Augmentation Sonia Castelo, Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati, Juliana Freire
14:55	Q&A and Further Discussions: Part 4
15:05	Keynote: Systems for Human Data Interaction Eugene Wu
15:50	Q&A and Further Discussions: Part 5

Keynotes:

Systems for Human Data Interaction
by Eugene Wu

Abstract:

The rapid democratization of data has placed its access and analysis in the hands of the entire population. While the advances in rapid and large-scale data processing continue to reduce runtimes and costs, the interfaces and tools for end-users to interact and work with data are still lacking.

It is still too difficult to translate a user's data needs into the appropriate interfaces, too difficult to develop data interfaces that are responsive end-to-end and scalable, and too difficult for users to understand and interpret the data they see. In this talk, I will provide an overview of our lab's recent work on systems for human data interaction that go towards addressing these challenges.

Speaker Bio:

Eugene Wu is an Associate Professor of Computer Science at Columbia University. He received a Ph.D. in EECS from MIT, and B.S. from UC Berkeley. He is broadly interested in technologies for human data interaction, and how users can effectively and quickly make sense of their data. Eugene is interested in solutions that ultimately improve the interface between users and data. He combines ideas from database management, visualization, and HCI. Eugene has received the VLDB 2018 test-of-time award, the coveted CIDR gong show award, NSF CAREER, and the Google and Amazon faculty awards.

LIquid: Scaling the system that builds and serves the LinkedIn Economic Graph
by Bogdan Arsintescu & Scott Meyer

Abstract:

Liquid is a distributed graph database service that scales to serve the LinkedIn Economic Graph: 200B edges, 1B vertices, 1.2M QPS with very low latency and 99.99% availability. (Check your LinkedIn profile now, make our day!). This presentation describes some of the fundamental building blocks that allow us, on one side, to build and nimbly evolve a graph of this caliber from multiple data sources and, on the other side, to enable the entire company to compose ever-more-complex queries while increasing launch velocity and eventually develop one-query applications.

The graph log structure dramatically shrinks the cost and complexity of graph construction and curation: the sequential nature of the log enables us to harness the curation effort of arbitrarily many people. Similarly, it allows data to come from different sources. Consider the curation of code bases which, now, works this way as compared to the curation of data, which has been unchanged for decades.

A declarative language is required to free the application developer from the optimization complexity of the graph structure and nature. While many new languages promise simplicity, we have chosen Datalog for expressivity and modularity. Datalog can model subgraphs and constraints in the graph expressions in a better way than 'query by example' languages; also, it is a complete implementation of the relational model. Moreover, Datalog rules allow scalable modularity, reuse, and evolution of the queries: an entire application can be built as a single query and hierarchically composed. We are using Datalog both in the ETL to construct the graph and to query it.

We will exemplify these traits by building a sample graph from open-source datasets using declarative ETLs, curation and queries. We will discuss how such graph can evolve by adding new datasets and how we scale the serving system in production.

Speaker Bio:

Bogdan Arsintescu is Director of Engineering at LinkedIn leading the Graph team, responsible for building a state-of-the art distributed graph database from the ground up and operating the LinkedIn Economic Graph, a low-latency on-line service for a 200B edges graph with a peak traffic exceeding 1M QPS. Prior to that Bogdan worked at Google with primary contributions on interval temporal logic for location data and Pregel-based query execution for the Google Knowledge Graph. Earlier still, he worked at Cadence Design Systems on semantic data for design automation.

Venue

SEA Data will be co-located with VLDB 2021, to be hosted in Copenhagen, Denmark.

Tivoli Hotel & Congress Center
Arni Magnussons Gade 2
1577
Copenhagen, Denmark.

Submission Guidelines

All submissions will be electronic via the Easychair submission system.

Regular research papers as well as system papers have a page limit of 6 pages (references included).

Experiments and Analysis papers have also a page limit of 6 pages (references included).

Vision papers, work-in-progress papers, and experiences papers have a page limit of 4 pages (references included).

SEA Data workshop 2021 is single-blind, and thus authors must include their names and affiliations in submissions.

Formatting

Submitted papers must follow the VLDB Proceedings Format (available here) and submitted as PDF files.

The font size, margins, inter-column spacing, and line spacing in the templates must be kept unchanged.

Any submitted paper violating the length, file type, or formatting requirements will be rejected without review.

Formatting guidelines for camera ready will follow.

SEA Data

Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores, 2nd edition

Co-located with VLDB 2021 (20 August 2021, Copenhagen, Denmark)

Workshop Chairs

Important Dates

Topics

Program Committee

Workshop Program

Keynotes:

Systems for Human Data Interaction by Eugene Wu

Abstract:

Speaker Bio:

LIquid: Scaling the system that builds and serves the LinkedIn Economic Graph by Bogdan Arsintescu & Scott Meyer

Abstract:

Speaker Bio:

Venue

Submission Guidelines

Formatting

Systems for Human Data Interaction
by Eugene Wu

LIquid: Scaling the system that builds and serves the LinkedIn Economic Graph
by Bogdan Arsintescu & Scott Meyer