Progressive Data Analysis
Loading...
Date
2024-11-11
Journal Title
Journal ISSN
Volume Title
Publisher
The Eurographics Association
Abstract
We live in an era in which data is abundant and growing rapidly. Big data databases sprawl past memory and computation limits and across distributed systems. To sustain this growth, engineers are designing new hardware and software systems with new storage management and capabilities for predictive computation. Yet, as datasets grow and computations become more complex, response times suffer. These infrastructures, while good for data at scale, do not support exploratory data analysis (EDA) effectively. EDA allows analysts to make sense of data with little or no known model and is essential in many application domains, from network security and fraud detection to epidemiology and preventive medicine. Data exploration is conducted with an iterative loop in which analysts interact with data through computations that return results, usually displayed with visualizations, which the analyst can then interact with again. EDA requires highly responsive system response times: at 500 ms, users typically change their querying behavior; after five or ten seconds, they abandon tasks or lose attention. To address this problem, a new computational paradigm has emerged in the last decade, which goes under several names. In the database community, it is called online aggregation, while among visualization researchers, it has been called progressive, incremental, or iterative visualization. In this book, we will refer to it as Progressive Data Analysis.
This paradigm consists of splitting long computations into a series of approximate results that improve over time. In this process, partial or approximate results are rapidly returned to the user and can be interacted with in a fluent and iterative fashion. With the increasing growth in data, such progressive data analysis approaches will become one of the leading paradigms for data exploration systems, but it will also require major changes to algorithms, data structures, and visualization tools.
By solving the latency issue, progressive data analysis opens up new perspectives but also presents new challenges. Problems involving complex analyses can now be addressed interactively with little or no preparation time, but how long should the analyst wait before obtaining results that are good enough to make decisions? Can progressive systems provide effective quality measures to avoid either waiting too long or deciding too early? While the progressive process is being computed, the iterative visualization of partial results should remain stable enough to be monitored. Can we stabilize these partial results without hurting their quality or speed? And, more fundamentally, can we transform all data analysis operations so that they become progressive? These questions, among others, are at the core of this book.
This book is an introduction to the new paradigm of progressive data analysis and visualization. It explores the major scientific and technical benefits of performing complex data analysis progressively on big data. It also examines the challenges that must be addressed for the paradigm to become fully usable. These important issues involve research fields that are traditionally viewed as separate areas in computer science: databases, scientific computing, machine learning, visualization, statistics, and human-computer interaction; they will need to work together to achieve end to- end solutions to solve these challenges and enable the emergence of practical, progressive systems. This book closes with a research agenda to help researchers converge on key questions.
Description