The field of data analytics includes techniques, algorithms and tools for the inspection of data collections in order to extract patterns, generalizations and other useful information. The success and effectiveness of such analysis depend on numerous challenges related to the data itself, the nature of the analytics tasks, as well as the computing environment over which the analysis is performed. These issues have given rise to many diverse programming models, execution engines and data stores to enable large-scale data management. While all these systems have had great success, they still showcase their advantages on a limited subset of applications and types of data: For instance, graph-processing engines limit the amount of freedom in the computation at each node (or part of a graph) and fail to fully exploit possible parallelism. In addition, modern analytics workflows are tremendously complex: Data sources are heterogeneous and distributed. The tasks may be long- or short-running and entail different execution details depending on the user role and expertise. Furthermore, such tasks may range from simple or complex data operations and queries, to algorithmic processing, like data mining, text retrieval, data annotation, etc. Finally, the analysis may require multiple query engines. To harvest the benefits of this plethora of data and compute engines as well as programming models, libraries and tools available, we need coordinated, adaptive and integrative efforts on collectively tapping their potential. This central goal is the focus of this workshop. These efforts include the definition of versatile programming models, engine performance modeling and monitoring, extended planning and optimization algorithms, deployment/execution on multiple engines, as well as workflow management and visualization techniques, for complex analytics queries over large, heterogeneous, irregular or unstructured data over diverse compute environments.
Workshop focus and related topics
The goal of the 1st International Workshop on Multi Engine Data Analytics (MEDAL) is to bring together researchers and practitioners from both academia and industry to explore, discuss and possibly redefine the state of the art in big data analytics relative to modeling, methods and tools applied over any part of algorithms and computing infrastructures, as well as use-cases and applications that relate to big data analytics over multi-engine environments. Concretely, the workshop is expected to provide insight into:
- Modeling of analytics processes: new models and languages to program, represent and execute complex tasks
- Execution of analytics processes: planning, optimizing and executing complex or multiple workflows especially on dynamic multi-engine and elastic environments
- Tools for advanced analytics tasks: theoretical and practical development of analytics tasks and operators for regular and irregular computations
- Visualization of analytics tasks: adaptive, user-friendly and diverse representation of real-time and batch tasks executing over single or multiple runtimes
- Applications of big data analytics: case-studies, exhibition of application-specific challenges
This workshop will solicit original research work on fundamental aspects of big data analytics as well as the design, implementation and evaluation of novel tools, methods and applications for optimizing big data workflows (in parts or as a whole). We note here that contributions may span a wide range of topics, including (but not limited to):
- New or advanced models for data analytics, especially those unifying multiple data and execution engines
- Languages for data analytics
- Data analytics on multi-engine environments, adaptive execution
- Execution semantics of data analytics
- Analytics engine runtime monitoring
- Unified cost models towards multi-engine analytics optimization
- Multi-workflow optimization
- Visualization tools or UI architectures that integrate multiple analytics inputs
- Scheduling algorithms and tools for analytics execution
- Applications and use cases of data analytics over diverse platforms
- Visionary ideas on data analytics
Workshop ProgramEDBT/ICDT and Workshops Program Overview
09:00 - 10:30 Session 1
- Towards an Analytics Query Engine, Nantia Makrynioti and Vasilis Vassalos
- Polystore Query Rewriting: The Challenges of Variety, Yannis Papakonstantinou
- The Data Management Entity: A Simple Abstraction to Facilitate Big Data Systems Interoperability, Damianos Chatziantoniou and Florents Tselai
10:30 - 11:00 Coffee Break
11:00 - 12:30 Session 2 - Keynote Talks11:00 - 11:55 Keynote1: Big Data Management and Scalable Data Science: Key Challenges and (Some) Solutions, Prof. Dr. Volker Markl, TU Berlin
11:55 - 12:30 Keynote2: Enabling Cross-Platform Applications with Rheem, Dr. Jorge Quiane-Ruiz, QCRI
12:30 - 14:00 Lunch Break
14:00 - 15:30 Session 3
- MELOGRAPH: Multi-Engine WorkfLOw Graph Processing, Camelia Elena Ciolac.
- A Relational Approach to Complex Dataflows, Yannis Chronis, Yannis Foufoulas, Vaggelis Nikolopoulos, Alexandros Papadopoulos, Lefteris Stamatogiannakis, Christoforos Svingos and Yannis Ioannidis
- Optimizing, Planning and Executing Analytics Workflows over Multiple Engines, Katerina Doka, Maxim Filatov, Victor Giannakouris, Verena Kantere, Nectarios Koziris, Christos Mantas, Nikolaos Papailiou, Vasilios Papaioannou and Dimitrios Tsoumakos
15:30 - 16:00 Coffee Break
16:00 - 17:30 Session 4
- Weighted Sum Model for Multi-Objective Query Optimization for Mobile-Cloud Database Environments, Florian Helff, Le Gruendwald and Laurent d'Orazio
- Operator and Workflow Optimization for High-Performance Analytics, Hans Vandierendonck, Karen Murphy, Mahwish Arif, Jiawen Sun and Dimitrios Nikolopoulos
- Large Scale Sentiment Analysis on Twitter with Spark, Nikolaos Nodarakis, Spyros Sioutas, Athanasios Tsakalidis and Giannis Tzimas
Keynote 1Big Data Management and Scalable Data Science: Key Challenges and (Some) Solutions
Prof. Dr. Volker Markl
The shortage of qualified data scientists is effectively limiting Big Data from fully realizing its potential to deliver insight and provide value for scientists, business analysts, and society as a whole. Data science draws on a broad number of advanced concepts from the mathematical, statistical, and computer sciences in addition to requiring knowledge in an application domain. Solely teaching these diverse skills will not enable us to on a broad scale exploit the power of predictive and prescriptive models for huge, heterogeneous, and high-velocity data. Instead, we will have to simplify the tasks a data scientist needs to perform, bringing technology to the rescue: for example, by developing novel ways for the specification, automatic parallelization, optimization, and efficient execution of deep data analysis workflows. This will require us to integrate concepts from data management systems, scalable processing, and machine learning, in order to build widely usable and scalable data analysis systems. In this talk, I will present some of our research results towards this goal, including the Apache Flink open-source big data analytics system, concepts for the scalable processing of iterative data analysis programs, and ideas on enabling optimistic fault tolerance.
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin (TU Berlin) and also holds a position as an adjunct full professor at the University of Toronto. He is director of the research group “Intelligent Analysis of Mass Data” at DFKI, the German Research Center for Artificial Intelligence and director of Berlin Big Data Center, a collaborative research center bringing together research groups in the areas of distributed systems, scalable data processing, text mining, networking, machine learning and applications in several areas, such as healthcare, logistics, Industrie 4.0, and information marketplaces.
Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. His research interests include: new hardware architectures for information management, scalable processing and optimization of declarative data analysis programs, and scalable data science, including graph and text mining, and scalable machine learning.
Volker Markl has presented over 200 invited talks in numerous industrial settings and at major conferences and research institutions worldwide. He has authored and published more than 100 research papers at world-class scientific venues. He has been speaker and principal investigator of the Stratosphere collaborative research unit funded by the German National Science Foundation (DFG), which resulted in numerous top-tier publications as well as the "Apache Flink" big data analytics system. Dr. Markl currently serves as the secretary of the VLDB Endowment and was elected as one of Germany's leading "digital minds" (Digitale Köpfe) by the German Informatics Society (GI).
Keynote 2Enabling Cross-Platform Applications with Rheem
Dr. Jorge Quiane-Ruiz
The world is fast moving towards a data-driven society where data is the most valuable asset. Organizations need to perform very diverse analytic tasks using various data processing platforms. In doing so, they face many challenges; mainly, platform dependence, poor interoperability, and poor performance when using multiple platforms. In this talk, I will present Rheem, our vision for big data analytics over diverse data processing platforms. Rheem provides a three-layer data processing and storage abstraction to achieve both platform independence and interoperability across multiple platforms. I will discuss how Rheem allows for cross-platform applications. In particular, I will present a machine learning and a data cleaning application and show how applications can leverage the platform-independence and cross-platform execution features of Rheem to boost performance. I will conclude with a discussion on the multiple research challenges that we need to address to achieve our vision.
Jorge-Arnulfo Quiané-Ruiz is a Scientist at the Qatar Computing Research Institute (QCRI) since October 2012. His research interests include cross-platform data management, big data analytics, and big data profiling. Before joining QCRI, Jorge was research associate at Saarland University for three years. He did his Ph.D. in Computer Science at INRIA and University of Nantes, France obtained his degree in September 2008. He received a M.Sc. in Computer Science with specialty in Networks and Distributed Systems from Joseph Fourier University, Grenoble, France, in July 2004. He obtained, with highest honors, a M.Sc. in Computer Science from the National Polytechnic Institute, Mexico, in August 2003.
- Verena Kantere, University of Geneva, Switzerland (Verena.Kantere@unige.ch)
- Dimitrios Tsoumakos, Ionian University, Greece (email@example.com)
Technical program committee (tentative):
- Alex Delis, University of Athens, Greece
- Bipin C. Desai, Concordia University, Canada
- Victor Chang, University of Southampton, UK
- Katerina Doka, National Technical University of Athens, Greece
- Thomas Heinis, Imperial College, UK
- Asterios Katsifodimos, TU Berlin, Germany
- Tasos Kementsietsidis, Google, USA
- Manolis Koumbarakis, University of Athens, Greece
- Vera Moffitt, Drexel University, USA
- Laurent d’ Orazio, University of Clermont Ferrand, France
- George Pallis, University of Cyprus, Cyprus
- Polyvios Pratikakis, Foundation for Research and Technology – Hellas (FORTH), Greece
- Philippe Rigaux, Internet Memory Research, France
- Senjuti Basu Roy, University of Washington, USA
- Arno Scharl, webLyzard technology, Austria
- Peter Triantafillou, University of Glasgow, UK
- Hans Vandierendonck, Queen's University Belfast, UK
- Paper Submission deadline: EXTENDED TO December 31, 2015
- Notification to authors: EXTENDED TO January 25, 2016
- Camera-ready deadline: 05 February 2016
MEDAL will be a full-day event, organized and themed around the ASAP FP7 EU-funded project (http://www.asap-fp7.eu/), which tackles the problem of complex analytical tasks over multi-engine environments that require integrated profiling, modeling, planning and scheduling functions.
Papers will be submitted as PDF files, using the ACM SIG Proceedings double-column template (http://www.acm.org/sigs/publications/proceedings-templates) with a page limit of: 8 pages for full submissions 4 pages for short/visionary paper submissions 2 pages for demo/tutorial submissions.
All submissions will be handled electronically via EasyChair. The link for the submission page is : https://easychair.org/conferences/?conf=medal2016.