XLDB - Extremely Large Databases

XLDB-2012 Demonstrations


Tuesday, September 11, 2012, 3:30-4:10pm session
A Array Databases in Action: Rasdaman in the Earth Sciences Peter Baumann / Jacobs University
B EXTASCID: An Extensible System for the Analysis of Scientific Data Yu Cheng and Florin Rusu / University of California at Merced
C Scientific Data Processing Using SciQL Ying Zhang and Martin Kersten / CWI
Wednesday, September 12, 2012, 3:45-4:25pm session
D Evolutionary Big Data Management System Built from Experience and Pain Iran Hutchinson / InterSystems and GlobalsDB
E Traffic Data Analytics - Big Data technologies opens up new paths Aparajeeta Das / Third Eye Consulting Services & Solutions,
Mike Long and Adrian Wilson / Microsoft

Array Databases in Action: Rasdaman in the Earth Sciences
Peter Baumann / Jacobs University

A large class of scientific "Big Data" is naturally represented as arrays of some dimension, extent, and cell ("pixel", "voxel", etc.) type. While traditionally being neglected by the database community, arrays recently are gaining attention, and both conceptual array models, query languages, and finally system implementations appear.

The rasdaman ("raster data manager") array analytics database is a fully-fledged tool augmenting standard relational databases with massive array storage and query capabilities. It is being used by a series of superscale data centers. In the transatlantic EarthServer initiative, NASA and ESA join in to establish ad-hoc processing and filtering capabilities on Petascale land, ocean, air, and planetary archives based on rasdaman.

We plan to briefly introduce to the rasdaman array model and query language to establish the basics for live demo applications in several Earth sciences, including 1-D through 5-D examples taken from the fields of sensor data, remote sensing, and climate modeling. This shows spatiotemporal subsetting, aggregation, and massive ad-hoc data restructuring, including array joins. Rasdaman has massively influenced geo service standardization, and, hence, demoing these standards in action on top of a database stack and on real-life data center products is integral part.

Finally, we will give an outlook on next steps to come.


EXTASCID: An Extensible System for the Analysis of Scientific Data
Yu Cheng and Florin Rusu / University of California at Merced

Data generated through scientific experiments and measurements has unique characteristics that require special processing techniques. The size of the data is extremely large, with terabytes of new data generated daily. The structure of the data is diverse, not only relational, but rather multi-dimensional arrays with different degrees of sparsity. Analyzing scientific data requires complex processing that can be seldom expressed in a declarative language such as SQL.

In this demonstration, we introduce EXTASCID, a parallel system targeted at efficient analysis of large-scale scientific data. EXTASCID provides a flexible storage manager that supports natively both relational data as well as high-dimensional arrays. Any processing task is specified using a generic interface that extends the standard UDA mechanism. Given a task and the corresponding input data, EXTASCID manages the execution in a highly-parallel environment with data partitioned across multiple processing nodes completely transparent to the user. We evaluate the expressiveness of the task specification interface and the performance of our system by implementing the SS-DB benchmark, a standard benchmark for scientific data analysis. In terms of expressiveness, EXTASCID allows the scientific user to focus on the core logic of the processing, executed entirely by the system, near the data, without the need to handle complicated operations at the application level. Our experimental results on the SS-DB benchmark confirm the remarkable performance of EXTASCID when handling large amounts of data.

In this demonstration, we present the main characteristics of the system by showing how the SS-DB benchmark is implemented in EXTASCID. We use two configurations: small (100GB) and normal (1TB). The small configuration is demonstrated on a single computing node (our laptop), while for the normal configuration we connect remotely to our computing cluster at UC Merced. An XLDB participant attending our demo will experience how complex analysis tasks on massive datasets (raw and derived) can be executed in a matter of seconds. The attendees will be able to run any query in the benchmark, including cooking and grouping, both with pre-defined parameters as well as with their choice of parameters.


Scientific Data Processing Using SciQL
Ying Zhang and Martin Kersten / CWI

Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we are developing SciQL (pronounced as ‘cycle’), an SQL-based query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence- interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between elements positions. This leads to a generalisation of window-based query processing with wide applicability in science domains. In this demo, we show the main features of SciQL using use cases of remote sensing image processing.


Evolutionary Big Data Management System Built from Experience and Pain
Iran Hutchinson / InterSystems and GlobalsDB

I propose to discuss the drivers for a project called Cetacean and demonstrate some of its capabilities. Cetacean’s core concept is that it is a Big Data solution that is not to be limited by a NoSQL (Not only SQL) approach. If I look at the broader view of NoSQL, I feel it is about choice. I don’t want to be restricted to accessing my data via any one paradigm (Relational, Document, Key/Value, Object, etc.). I don’t want to be limited to one distributed consistency model (Real-time consistency, Eventual, etc.). I will add my own perspective to what I consider the continuing evolution of data management systems; I want the option to choose if ACID (or parts of it) applies to my use case. I want optional transactions as well in a data management system. I want the previously thought mandates of RDBMS, data caching systems, Object Databases, distributed data stores, etc. to be configurable at critical areas or areas where needed or appropriate. Allow me to extend the usage of my data and the longevity of my data management system.

Notice I said data management system (DMS). The reason is that the term database does not traditionally encompass the scope of multi-paradigm data caching, hybrid network / application layered transactions, and on-disk data management to name a few.

We will discuss the core low-level concepts of applying model-view-controller to a data system and a configurable transactional model that can participate in always consistent and eventual consistent paradigms. For those interested there will lots of code to review after covering concepts and seeing the system run.

This project is ongoing. Current goals to implement a robust data infrastructure that can enable both massive CRUD (create, read, update, delete) use cases. Once in place analytics including real-time situational awareness will be the focus.


Traffic Data Analytics - Big Data technologies opens up new paths
Aparajeeta Das  / Third Eye Consulting Services & Solutions
Mike Long and Adrian Wilson / Microsoft

Government is getting savvier about the way it spends tax-payer’s money.  Thus for every infrastructure related problem, they are now looking at many other alternatives than going in straight for developing new infrastructure. Thanks to Big Data technologies that have given them these powers to seek for alternatives by mining the tons of data that they normally collect. Traffic data is one such Big Data set that has being collected and assimilated over the years.

This demo will walk through as to how we harnessed Hadoop technologies to crunch through traffic data to develop both real-time configurations & what-if scenarios of traffic conditions during any selected date & time.

Privacy Statement -