Development of Best Practices for Large-scale Data Management Infrastructure

Stadtmüller, S.; Mühleisen, Hannes; Bizer, C.; Kersten, Martin; de Rijke, Arjen; Groffen, Fabian; Zhang, Ying; Ladwig, G.; Harth, A.; Trampus, M

S. Stadtmüller, H.F. Mühleisen (Hannes), C. Bizer, M.L. Kersten (Martin), J.A. de Rijke (Arjen), F.E. Groffen (Fabian), Y. Zhang (Ying), G. Ladwig, A. Harth and M Trampus

2012-09-01

Development of Best Practices for Large-scale Data Management Infrastructure

The amount of available data for processing is constantly increasing and becomes more diverse. We collect our experiences on deploying large-scale data management tools on local-area clusters or cloud infrastructures and provide guidance to use these computing and storage infrastructures. In particular we describe Apache Hadoop, one of the most widely used software libraries to perform large- scale data analysis tasks on clusters of computers in parallel and provide guidance on how to achieve optimal execution time when performing analysis over large-scale data. Furthermore we report on our experiences with projects, that provide valuable insights in the deployment and use of large- scale data management tools: The Web Data Commons project for which we extracted all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-date web corpus that is currently available to the public. SciLens, a local-area cluster machine that has been built to facilitate research on data management issues for data-intensive scientific applications. CumulusRDF, an RDF store on cloud-based architectures. We investigate the feasibility of using a distributed nested key/value store as an underlying storage component for a Linked Data server, which provides functionality for serving large quantities of Linked Data. Finally we describe the News Aggregator Pipeline, which is a piece of software to perform the acquisition of high volume textual streams, it’s processing into a suitable form for further analysis and the distribution of the data.

Additional Metadata
Keywords	Hadoop, MapReduce, large-scale data analysis, Web Data Commons, data extraction, SciLens, Scientific DBMS, MonetDB, local-area cluster, RDF, triple store, nested key/value store, news aggregation, high volume data streams
THEME	Information (theme 2)
Series	PlanetData Deliverables
Project	PlanetData
Organisation	Database Architectures
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Stadtmüller, S., Mühleisen, H., Bizer, C., Kersten, M., de Rijke, A., Groffen, F., … Trampus, M. (2012). Development of Best Practices for Large-scale Data Management Infrastructure. PlanetData Deliverables.

Free Full Text ( Final Version , 3mb )

Additional Files
Fulltext Final Version
Publisher Version

Development of Best Practices for Large-scale Data Management Infrastructure

Publication

Publication

Address

CWI researchers

Questions or comments?

Development of Best Practices for Large-scale Data Management Infrastructure

Publication

Publication

Workflow

Workflow

Add Content