XLDB-2012 Demonstrations
Array Databases in Action: Rasdaman in the Earth Sciences
Peter Baumann / Jacobs University
A large class of scientific "Big Data" is naturally represented as
arrays of some dimension, extent, and cell ("pixel", "voxel", etc.)
type. While traditionally being neglected by the database community,
arrays recently are gaining attention, and both conceptual array models,
query languages, and finally system implementations appear.
The rasdaman ("raster data manager") array analytics database is a
fully-fledged tool augmenting standard relational databases with massive
array storage and query capabilities. It is being used by a series of
superscale data centers. In the transatlantic EarthServer initiative,
NASA and ESA join in to establish ad-hoc processing and filtering
capabilities on Petascale land, ocean, air, and planetary archives based
on rasdaman.
We plan to briefly introduce to the rasdaman array model and query
language to establish the basics for live demo applications in several
Earth sciences, including 1-D through 5-D examples taken from the fields
of sensor data, remote sensing, and climate modeling. This shows
spatiotemporal subsetting, aggregation, and massive ad-hoc data
restructuring, including array joins. Rasdaman has massively influenced
geo service standardization, and, hence, demoing these standards in
action on top of a database stack and on real-life data center products
is integral part.
Finally, we will give an outlook on next steps to come.
EXTASCID: An Extensible System for the Analysis of Scientific Data
Yu Cheng and Florin Rusu / University of California at Merced
Data generated through scientific experiments and measurements has
unique characteristics that require special processing techniques. The
size of the data is extremely large, with terabytes of new data
generated daily. The structure of the data is diverse, not only
relational, but rather multi-dimensional arrays with different degrees
of sparsity. Analyzing scientific data requires complex processing that
can be seldom expressed in a declarative language such as SQL.
In this demonstration, we introduce EXTASCID, a parallel system targeted
at efficient analysis of large-scale scientific data. EXTASCID provides
a flexible storage manager that supports natively both relational data
as well as high-dimensional arrays. Any processing task is specified
using a generic interface that extends the standard UDA mechanism. Given
a task and the corresponding input data, EXTASCID manages the execution
in a highly-parallel environment with data partitioned across multiple
processing nodes completely transparent to the user. We evaluate the
expressiveness of the task specification interface and the performance
of our system by implementing the SS-DB benchmark, a standard benchmark
for scientific data analysis. In terms of expressiveness, EXTASCID
allows the scientific user to focus on the core logic of the processing,
executed entirely by the system, near the data, without the need to
handle complicated operations at the application level. Our experimental
results on the SS-DB benchmark confirm the remarkable performance of
EXTASCID when handling large amounts of data.
In this demonstration, we present the main characteristics of the system
by showing how the SS-DB benchmark is implemented in EXTASCID. We use
two configurations: small (100GB) and normal (1TB). The small
configuration is demonstrated on a single computing node (our laptop),
while for the normal configuration we connect remotely to our computing
cluster at UC Merced. An XLDB participant attending our demo will
experience how complex analysis tasks on massive datasets (raw and
derived) can be executed in a matter of seconds. The attendees will be
able to run any query in the benchmark, including cooking and grouping,
both with pre-defined parameters as well as with their choice of
parameters.
Scientific Data Processing Using SciQL
Ying Zhang and Martin Kersten / CWI
Scientific discoveries increasingly rely on the ability to
efficiently grind massive amounts of experimental data using database
technologies. To bridge the gap between the needs of the Data-Intensive
Research fields and the current DBMS technologies, we are developing
SciQL (pronounced as ‘cycle’), an SQL-based query language for
scientific applications with both tables and arrays as first class
citizens. It provides a seamless symbiosis of array-, set- and sequence-
interpretations. A key innovation is the extension of value-based
grouping of SQL:2003 with structural grouping, i.e., fixed-sized and
unbounded groups based on explicit relationships between elements
positions. This leads to a generalisation of window-based query
processing with wide applicability in science domains. In this demo, we
show the main features of SciQL using use cases of remote sensing image
processing.
Evolutionary Big Data Management System Built from Experience and Pain
Iran Hutchinson / InterSystems and GlobalsDB
I propose to discuss the drivers for a project called Cetacean and
demonstrate some of its capabilities. Cetacean’s core concept is that it
is a Big Data solution that is not to be limited by a NoSQL (Not only
SQL) approach. If I look at the broader view of NoSQL, I feel it is
about choice. I don’t want to be restricted to accessing my data via any
one paradigm (Relational, Document, Key/Value, Object, etc.). I don’t
want to be limited to one distributed consistency model (Real-time
consistency, Eventual, etc.). I will add my own perspective to what I
consider the continuing evolution of data management systems; I want the
option to choose if ACID (or parts of it) applies to my use case. I want
optional transactions as well in a data management system. I want the
previously thought mandates of RDBMS, data caching systems, Object
Databases, distributed data stores, etc. to be configurable at critical
areas or areas where needed or appropriate. Allow me to extend the usage
of my data and the longevity of my data management system.
Notice I said data management system (DMS). The reason is that the term
database does not traditionally encompass the scope of multi-paradigm
data caching, hybrid network / application layered transactions, and
on-disk data management to name a few.
We will discuss the core low-level concepts of applying
model-view-controller to a data system and a configurable transactional
model that can participate in always consistent and eventual consistent
paradigms. For those interested there will lots of code to review after
covering concepts and seeing the system run.
This project is ongoing. Current goals to implement a robust data
infrastructure that can enable both massive CRUD (create, read, update,
delete) use cases. Once in place analytics including real-time
situational awareness will be the focus.
Traffic Data Analytics - Big Data technologies
opens up new paths
Aparajeeta Das / Third Eye Consulting Services & Solutions
Mike Long and Adrian Wilson / Microsoft
Government is getting savvier about the way it spends tax-payer’s
money. Thus for every infrastructure related problem, they are now
looking at many other alternatives than going in straight for developing
new infrastructure. Thanks to Big Data technologies that have given them
these powers to seek for alternatives by mining the tons of data that
they normally collect. Traffic data is one such Big Data set that has
being collected and assimilated over the years.
This demo will walk through as to how we harnessed Hadoop technologies
to crunch through traffic data to develop both real-time configurations
& what-if scenarios of traffic conditions during any selected date &
time.