Bridging the gap between Big Genome Data Analysis and Database Management Systems

Cijvat, Robin

The bioinformatics field has encountered a data deluge over the last years, due to in- creasing speed and decreasing cost of DNA sequencing technology. Today, sequencing the DNA of a single genome only takes about a week, and it can result in up to a ter- abyte of data. The sequencing data are usually stored in files, and specialized tools have been designed to analyze and manage them. Despite of these tools, bioinformaticians are still exposed to many data management hurdles when analyzing these files, which often leads to excessively time consuming tasks. In this thesis, we accurately map the needs of bioinformaticians by defining a set of use cases that reflect the everyday analysis that is applied on genetic data. We propose a modern-DBMS based approach, to analyze and manage genetic data file repositories. We identify the pros and cons of this method compared to the traditional file-based approach. Additionally, we experimented with a novel in-situ approach, where the DBMS ap- plies Just-In-Time ETL (Extract-Transform-Load) on the original files instead of loading all data from these files up front. A major advantage of this approach is that it greatly reduces the data-to-query time, since not all data are loaded in the DBMS during initial- ization. Other advantages include the decrease in storage requirements and the reduced data duplication. With this project, we have taken the first step towards the adaptation of the state-of- the-art database technology to accelerate genetic data analytics. The preliminary results presented in this thesis are highly promising and they open up a plethora of new research opportunities.

Additional Metadata
Keywords	MonetDB, Database, Genome data analytics, DNA
THEME	Information (theme 2), Life Sciences (theme 5)
Publisher	Utrecht University - Department of Information and Computing Sciences
Thesis Advisor	Y. Zhang (Ying)
Series	UU-CS
Organisation	Database Architectures
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Cijvat, R. (2014, February). Bridging the gap between Big Genome Data Analysis and Database Management Systems. UU-CS. Utrecht University - Department of Information and Computing Sciences.

Free Full Text ( Final Version , 3mb )

Bridging the gap between Big Genome Data Analysis and Database Management Systems

Publication

Publication

Address

CWI researchers

Questions or comments?

Bridging the gap between Big Genome Data Analysis and Database Management Systems

Publication

Publication

Workflow

Workflow

Add Content