On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

Debowski, Lukasz Jerzy

The article presents a new interpretation for Zipf's law in natural language which relies on two areas of information theory. We reformulate the problem of grammar-based compression and investigate properties of strongly nonergodic stationary processes. The motivation for the joint discussion is to prove a proposition with a simple informal statement: If an $n$-letter long text describes $n^\beta$ independent facts in a random but consistent way then the text contains at least $n^\beta/\log n$ different words. In the formal statement, two specific postulates are adopted. Firstly, the words are understood as the nonterminal symbols of the shortest grammar-based encoding of the text. Secondly, the texts are assumed to be emitted by a nonergodic source, with the described facts being binary IID variables that are asymptotically predictable in a shift-invariant way. The proof of the formal proposition applies several new tools. These are: a construction of universal grammar-based codes for which the differences of code lengths can be bounded easily, ergodic decomposition theorems for mutual information between the past and future of a stationary process, and a lemma that bounds differences of a sublinear function. The linguistic relevance of presented modeling assumptions, theorems, definitions, and examples is discussed in parallel.While searching for concrete processes to which our proposition can be applied, we introduce several instances of strongly nonergodic processes. In particular, we define the subclass of accessible description processes, which formalizes the notion of texts that describe facts in a self-contained way.

Additional Metadata
Keywords	Zipf's law, universal source coding, grammar-based codes, smallest grammar problem, ergodic decomposition, excess entropy, nonergodic processes, language models, sublinear functions, asymptotically mean stationary processes, variable-length coding
MSC	Source coding (msc 94A29)
Project	Learning when all models are wrong
Note	Submitted to IEEE Transactions on Information Theory. In open review.
Organisation	Quantum Computing and Advanced System Research
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Debowski, L. J. (2008). On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts.

Free Full Text ( Final Version )

Additional Files
13406B.pdf Author Manuscript , 330kb
Publisher Version

On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

Publication

Publication

Address

CWI researchers

Questions or comments?

On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts

Publication

Publication

Workflow

Workflow

Add Content