Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

Doehmen, Till; Hoekstra, Rinke

Tabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system.

Additional Metadata
THEME	Information (theme 2)
Thesis Advisor	P.A. Boncz (Peter) , H.F. Mühleisen (Hannes)
Organisation	Database Architectures
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Doehmen, T., & Hoekstra, R. (2016, August). Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files.

Free Full Text ( Final Version , 3mb )

Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

Publication

Publication

Address

CWI researchers

Questions or comments?

Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

Publication

Publication

Workflow

Workflow

Add Content