Hungarian Generative Diachronic Syntax

Main Page » Texts

Language Records

This page contains all codices, minor texts, and Bible translations from the Old and Middle Hungarian period digitized and annotated within the projects. The list of the texts in the corpus and their abbreviations used in the corpus query tool is available here.

For displaying all special characters, installation of the Junicode font package is needed.

For each text, a short description about its source, the text processing steps which were conducted on the text, and the number of tokens is available. Moreover, remarks on spelling, locus markers, bracketing, punctuation, etc. are also provided if there are deviations from the general rules given in the corpus description.

For each text, the original orthographic form is provided in a plain text and in a PDF format. If the original text material is already available on the web, we did not create the PDF version but we did paste the link. If there is a normalized version of the text, it is also provided here in plain text format. For each text, a tsv file containing every text processing level and metadata is also available. Blank lines mark sentence boundaries, while the columns separated by tabulators contain the following pieces of information:

The original orthographic form of each text is available, however, only a smaller subcorpus has also been normalized and annotated morphologically.

The normalized version of the following texts are available:

If there is morphological annotation, by default, it follows the rules written in the corpus description and in the list of morphological codes. The morphologically annotated version of some codices is also available in the CoNLL-U format applied within the Universal Dependencies and Morphology framework, in which word lines contain the annotation of a word in 10 fields separated by single tab characters, and blank lines mark sentence boundaries.

Word lines contain the following fields:

  1. ID: Word index, integer starting at 1 for each new sentence.
  2. FORM: Word form or punctuation symbol in its original orthographic form.
  3. LEMMA: Lemma or stem of word form in its normalized form.
  4. UPOSTAG: Universal part-of-speech tag following the Universal Dependencies and Morphology annotation scheme.
  5. XPOSTAG: The original morphological analysis.
  6. FEATS: List of morphological features from the Universal Dependencies and Morphology feature inventory.
  7. HEAD: Head of the current token following the Universal Dependencies and Morphology annotation scheme; currently empty.
  8. DEPREL: Dependency relation to the HEAD, following the Universal Dependencies and Morphology annotation scheme; currently empty.
  9. DEPS: List of secondary dependencies, following the Universal Dependencies and Morphology annotation scheme; currently empty.
  10. MISC: Any other annotation; currently empty.

The morphologically annotated version of the following texts are also available:

The number of tokens of the original orthographic version with punctuation marks: 3,224,515, without punctuation marks: 2,751,869. The number of tokens of the normalized subcorpus with punctuation marks: 1,305,687, without punctuation marks: 1,049,019. The number of tokens of the morphologically annotated subcorpus with punctuation marks: 285,070, without punctuation marks: 228,851.