Hungarian Generative Diachronic Syntax

Main Page » Corpus

About the corpus

  1. Aims
  2. The structure of the corpus
  3. The original orthographic form
  4. Normalization
  5. Morphological analysis
  6. Metadata

1. Aims

Our aim was to construct an annotated corpus comprising all extant texts from the Old Hungarian period (896–1526), which could provide answers to linguistically relevant problems.

The corpus includes only documents containing coherent texts in Hungarian. We did not include so-called sporadic records, documents containing isolated occurrences of Hungarian words or names. Thus the corpus contains 47 codices, 24 minor texts and 244 letters, totalling more than 2.2 million tokens.

2. The structure of the corpus

The structure of the corpus, that is, the annotation levels assigned to each token, evolved in parallel with the text processing steps:

(1)scanned codex
automatic OCR
(2)raw OCR output
manual correction
(3)original orthographic form
manual normalization
(4)normalized form
automatic morphological analysis
(5)lemmatized and morphologically analyzed form
semi-automatic disambiguation
(6)disambiguated form

In order to have a linguistic database which can provide a fertile ground for theoretical investigations, the relevant information needs to be specified in a computationally interpretable and retrievable way. Sophisticated, linguistically relevant queries often target information from various levels of the grammar. In order for all these levels to be available, that data obtained from each level of text processing are stored parallelly in the database. That is, for each token, the corpus provides the following pieces of linguistic information:

  • original orthographic form = (3)
  • normalized form = (4)
  • lemma – based on disambiguated morphological analysis (6)
  • analysis – based on disambiguated morphological analysis (6)

The corpus contains the original orthographic form of every Old Hungarian text. Some of these have also been normalized, and some of the normalized texts have been morphologically analyzed and morphosyntactically disambiguated.

3. The original orthographic form

Codices from the Old Hungarian period are hand-written texts, which already have transcribed editions. We opted for using editions as the basis of corpus compilation. In preparing texts in their original orthographic form we followed conventions set by the editor of the published version used by us, and not those in the hand-written codex. In other words, we did not aim to provide paleographically fully accurate documents. If the publisher did not distinguish, for instance, between the characters ſ and s, we accepted the linguist's decision and followed his or her practice. Cases when we did deviate from the publisher's version are marked accordingly.

In producing the texts in their original orthographic form we preserved the original punctuation, tokenization, and capitalization in proper names or sentence-initially. We have not reproduced the colouring, bolding or emphases from the original codices.

We do not use single (|) or double (||) vertical lines which were used to mark pagebreaks (in print editions), since we need not resort to such ad hoc measures, typical of typography.

3.1. Character encoding

On account of the advantages afforded by standardization the entire corpus is stored and represented with UTF-8 encoded standard Unicode characters. One of the advantages of Unicode is that accented forms can be dinamically composed, since basic characters and combining diacritical marks are represented by their own codes and can be freely combined. Combining diacritical marks can also be accumulated, so that most of the special Old Hungarian characters can be represented by standard Unicode characters. If a text element is encoded as a static precomposed form in the Unicode code charts, we use that form. In the absence of a precomposed form, we use dynamically composed sequence of characters. For instance, we use the precomposed form é (U+00E9 latin small letter e with acute), instead of the dynamic composition of U+0065 latin small letter e and U+0301 combining acute accent.

On the recommendation of the Mediaeval Unicode Font Initiative (MUFI) certain mediaeval characters have been accepted by the Unicode Consortium; from Version 5.1 onwards these can be found in the official code charts of the Unicode Standard. The Junicode font package is recommended for the proper displaying of these characters in various applications.

3.2. Regularization of spelling

In order for asking queries on the entire corpus, consistency is indispensable. One of the great advantages of corpora is that they do not merely provide isolated examples, but list all instances of the searched term, so analyses based on frequency become available. This important property of corpora can be ensured only if one follows the principle of consistency, and always uses the same appropriate character for representing the same letter and different characters for representing different letters. There is still an Old Hungarian character, however, which is not present in Unicode charts: this is the so-called Hussite [tʃ], which resembles a small capital l, and which (following the practice of György Volf) is regularly replaced with č (U+010D latin small letter c with caron).

Modern editions of codices were published at different times, so their typographic techniques and technology also show varition. Therefore, for typographic reasons, they also vary in the displaying of one and the same character. In the corpus such contingencies are overcome with a uniform phonological representation: characters with the same semantics are represented with the same standard Unicode character in each codex.

3.2.1. Apostrophes

Typically, apostrophes had two uses in Old Hungarian codices. First, they were used to mark palatalization. We marked this use with U+02BC modifier letter apostrophe, or, in the case of certain characters, with an acute accent:

dʼ [ɟ/dʲ] U+0064 latin small letter d + U+02BC modifier letter apostrophe
lʼ [lʲ] U+006C latin small letter l + U+02BC modifier letter apostrophe
tʼ [c͡ç/tʲ] U+0074 latin small letter t + U+02BC modifier letter apostrophe
ǵ [ɟ] U+01F5 latin small letter g with acute
ń [ɲ] U+0144 latin small letter n with acute

The apostrophe occasionally appearing right next to the letter č is also rendered as U+02BC modifier letter apostrophe, because this use also marks some kind of palatalization: čʼ. In order to mark missing letters U+0027 apostrophe is used, as in nap'a (nap-ja ‘day-poss.3sg’). Additionally, apostrophes were used in imperatives, as in akar’ (akar-j ‘want-imp.2sg’), even when the imperative suffix j was assimilated, as in ÿrgalmaz’ (irgalmaz-z ‘have.mercy-imp.2sg’).

Certain scribes double the apostrophe when marking palatalization or missing letters. In these cases we too double the apostrophe or use double accenting, e.g. boczanattʼʼa (bocsánat-ja ‘mercy-poss.3sg’), nag̋ (nagy ‘great’).

If the scribe adds an h after the letter to be palatalized, and attaches the apostrophe to the h, we also use U+02BC modifier letter apostrophe, as in dy̋chʼeseghes (dicsőséges ‘glorious’), kerezthʼenekkel (keresztyén-ek-kel ‘Christian-pl-ins’).

3.2.2. Punctuation

When having to choose from similar punctuation characters we always use those found in the Basic Latin code chart. For instance, the characters U+2013 en dash and U+2014 em dash are converted to U+002D hyphen-minus. (The latter is the variant found on English or Hungarian keyboards.)

Multiple punctuation signs are handled as one group, without spacing, and are separated from the words preceding and following them, e.g. fel fualkodnak ;¶ de (felfuvalkodnak, de ‘they puff up, but’). Multiple hyphens or tildes, which had an ornamental use, are replaced with a single hyphen or tilde, respectively.

Signs resembling the tilde are always rendered with U+007E tilde. Variants of the cross are uniformly rendered with U+002B plus sign. Decorated full stops are rendered with U+00A7 section sign. Every sign resembling the question mark was rendered with U+003F question mark. The signs = and : marking separation have been standardized as today's simple hyphen.

3.2.3. Special characters

The character o with an accent below, which was used to represent the vowel [ø], is always rendered as (U+006F latin small letter o + U+0317 combining acute accent below), irrespective of the orientation of the accent in the print edition (it could be either vertical or slanting to the left).

The sign used in the Vienna Codex and other codices is the so-called Tironian et, which has a Unicode representation, and we render it accordingly. It is also used in the abbreviation ‘etc.’: ⁊c.

The most widely used abbreviation form is the macron above one or more characters, which is generally used for marking missing m or n letters. These are uniformly rendered with U+0304 combining macron, e.g. mōnon (monnon ‘both of them’), Am̄ (ámen ‘Amen’).

Dotless i-s are never used; we use dotted i-s throughout the corpus.

Ligatures are always taken apart, e.g. ij → ij.

If the publisher of a codex distinguished z and ʒ, we did so too. For this we used the character U+0292 latin small letter ezh. (Its capitalized variant is U+01B7 latin capital letter ezh.) The abbreviation at the end of the word mindꝫ (minden ‘every’) is rendered with U+A76B latin small letter et. This character was a typical abbreviation in mediaeval texts, and its meaning varied from context to context; e.g. it stood for et in videlicet, in nam or omnem it stood for an m.

The abbreviation in syllables containing r is always rendered with U+0309 combining hook above, e.g. akảvan (akar-ván ‘want-part’), bảrabast (Barabás-t ‘Barabbas-acc’).

Scribes often used Latin abbreviations typical of their era; these are displayed in the following manner:

  • word-initial con-/com-: (U+A76E latin capital letter con) or (U+A76F latin small letter con);
  • word-final -us: (U+A770 modifier letter us);
  • the prefix pro-: (U+A752 latin capital letter p with flourish) or (U+A753 latin small letter p with flourish);
  • the prefix per-/par-: (U+A750 latin capital letter p with stroke through descender) or (U+A751 latin small letter p with stroke through descender);
  • the prefix pre-/pri-: (U+0050 latin capital letter p + U+0304 combining macron) or (U+0070 latin small letter p + U+0304 combining macron).

(The Junicode font package is required to properly display these characters.)

A rounded r resembling the number 2 is the so-called r rotunda, whose small and capital variant is already present in Unicode 5.1: (U+A75A latin capital letter r rotunda) and (U+A75B latin small letter r rotunda).

The word-final abbreviation usually rendered as φ or 9 by publishers is displayed as a in the corpus (U+A76D latin small letter is). This sign could stand for any number of letters at the end of the word[1]. Scribes used it in Hungarian words as well, e.g. harꝭ (három ‘three’).

Scribes who used the so-called diacritic orthographic system (especially the second scribe of the Debrecen Codex and the third scribe of the Lobkowicz Codex) usually rendered the sounds [y] and [y:] with v and w, sometimes with an accent below. Various publishers have displayed these accents in various ways; we display them in a uniform manner, by means of U+0317 combining acute accent below, e.g. v̗uo̗lt (üvölt ‘scream’), keserw̗segghel (keserűség-gel ‘bitterness-ins’).

3.3. Bracketing

We always use the brackets from the basic Latin code chart:

(U+0028 left parenthesis
)U+0029 right parenthesis
<U+003C less-than sign
>U+003E greater-than sign
[U+005B left square bracket
]U+005D right square bracket
{U+007B left curly bracket
}U+007D right curly bracket

Blank pages in the codices are indicated with double square brackets: [[]].

4. Normalization

Given the fact that there was no uniform, conventionalised orthography during the Old Hungarian period, a normalization operation had to be performed, when the original form of a word was converted into a form that conforms to Modern Hungarian spelling. This makes texts easier to search and to read; normalized forms also serve as input for morphological analysis.

It happens sometimes that the normalized version cannot be reconstructed. In such cases the slot for the normalized version remains blank, and the code NOIDEA is entered in the Remark field.

4.1. Fundamental principles of normalization

During normalization we were guided by two principles. First, we preserved all words, affixes or morphological constructions that are extinct in Modern Hungarian. That is, we neither added nor omitted any morphemes. According to the second principle, we eliminated all forms that were accidental from a phonological or orthographic point of view. That is, we strove for a uniform orthography, mimicking Modern Hungarian orthography as closely as possible. This meant that a given word was always spelled in the same way.

Sometimes it was not easy to decide what counts as accidental, and what is a construction worthy of preservation. In these cases we relied on entries from the Historical-Etymological Dictionary of Hungarian[2] (HEDH). Forms that had their own entry (or sub-entry) in the HEDH counted as separate words, that is, they were not lumped together with Modern Hungarian words. For instance, the conjunctions kedig (the extinct form of ‘but’) and pedig (‘but’) count as two words, whereas mikoron and mikort are variants of mikor (‘when’), and not separate words.

4.2. Polysemy

Normalized forms consist in one word, and they represent the final result of a deliberation process. This form is not always unambiguous, however; there are several cases of irresolvable ambiguity:

  • Inflected verbs: Sometimes it cannot be decided whether they agree with the definite object or not (whether they are in the so-called definite conjugation or not), and even context is of no help. In such cases we use the indefinite version, which usually is a short vowel, followed by an U+00B4 acute accent. At the level of morphological analysis we mark the undecidedness of definiteness, e.g.
  • kichallottac
    ki-knemhall-ott-a´k
    N:Pro:Rel.PlAdvV.Past.P3.Def?
    who-plnothear-prf-3pl
    ‘who did not hear (it)’
  • The suffix -i: In many cases it is difficult or impossible to decide whether it is the 3rd person singular form of the possessive suffix, or whether it marks the plurality of the possessum. For instance the form ÿgeretÿth can be normalized either as ígéret-é-t (‘promise-poss.3sg-acc’), or as ígéret-e-i-t (‘promise-poss.3sg-pl-acc’). If neither context nor agreement were of help we selected the singular form and noted in the coding of morphological analysis that we have to do with a form using -i, e.g.
  • vrunknakedesÿgeretÿth
    Ur-unk-nakédesígéret-é-t
    N:P.PxP1.Dat_genAdjN.PxS3=i.Acc
    Lord-poss.1pl-datsweetpromise-poss.3sg-acc
    ‘sweet promise of our Lord’
  • When the scribe used inessive and illative suffixes interchangeably: In many cases the scribe wrote -bA (‘ill’), when -bAn (‘ine’) would have been more appropriate for Modern Hungarian intuitions (or conversely). In such cases the normalized form and morphological analysis encode what the scribe originally wrote; in the Remark field we entered a code which refers to the Modern Hungarian form, e.g.
  • Miägänkkiwägmēnëgbe
    miatyá-nkkivagymenny-ek-be
    MORFO{INE}
    weFather-poss.1plwhobe.p2sgheaven-pl-ill
    ‘our Father in heaven’

4.3. Tokenization

Since our aim was to obtain a transcript conforming to modern orthography, we separated words that appear to be incorrectly spelled as one word, and marked that the original consisted in one word: the end of the first word and the beginning of the second word are marked with two equality signs, e.g.

desäbädicz====mkmikëtagonostwl
deszabadít-smegmink-etagonosz-tól
butdeliver-impprtwe-acctheevil-abl
‘but deliver us from evil’

Conversely, if two words are spelled as one word in Modern Hungarian, and they were written separately in the codices, the normalized form was spelled as one word, and the original variant contained a space; e.g.

harmalnaponhalottay boolfelthamata
harmadnap-onhalott-a-i-bólfel-támad-a
thirdday-supdead-poss-pl-elaup-rise-pst.3sg
‘on the third day he is risen from the dead’

Original line breaks were also marked, with a double @. If the line-ending word was followed by a hyphen, it was also preserved, e.g.

egmen-@@denicatʼtʼafiatzorongatʼtʼa
egymindenikőatyjafiá-tnemszorongat-ja
eitherhebrother.poss.3sg-accnotthrust-def.3sg
‘neither shall one thrust another’

If line break separated a word form which in Modern Hungarian would be spelled in two words anyway, we write the two words separately, but we also use a hyphen at the end of the first word, e.g.

wrthol-angyal
Úr-tólangyal
God-ablangel

Only the formatting from the original codices is preserved. That is, if the publisher of a printed edition did not preserve original line breaks but made new ones which were borne of typhographic constraints, we did not preserve them.

4.4. Sentence segmentation and punctuation

Texts are broken down into sentences during the normalization process. In dubious cases we do not mark sentence boundaries; that is, sentences are preferably longer rather than shorter. The basic rule of sentence segmentation is a one-to-one correspondence between finite verbs and sentences. Embedded subordinate clauses are separated from the matrix with a comma, so that these too appear as separate sentential units.

Original orthographic forms have preserved the original punctuation of their codices (if these contained punctuation signs at all). Normalized versions use punctuation similar to Modern Hungarian. If the original text lacked punctuation signs, and, according to modern intuitions, these were necessary, the normalized form included punctuation signs according to current orthographic rules. The signs we used are .,?!:

Punctuation signs are separated from the word preceding them, i.e. they count as separate tokens.

4.5. Capitalization, proper names

In the normalized form sentence-initial words are not capitalized; only proper names begin with an uppercase letter. Names like Atya (‘Father’), Úr (‘the Lord’), Isten (‘God’), Fiú (‘the Son’), Szentlélek (‘the Holy Ghost’), a.s.o. begin with an uppercase letter, except when they are not used as proper names, e.g.

wagyembernekfýa
vagyember-nekfia
be.2sgman-datson.poss.3sg
‘you are man’s son’

For the sake of uniformity, proper names mentioned in biblical translations or excerpts are also normalized: variants of one and the same proper name are rendered uniformly. For this we have relied on the modern Bible translation published by the Szent István Társulat (St. Stephen Association): every proper name was rendered in the form used by this edition.

The names of prayers are also proper names, e.g.

KeethIdwez leegýmaríathmongý
kétÜdvözlégyMáriá-tmond-j
twoHailMary-accsay-imp
‘say two Hail Marys’

However, when the same expression is used in the prayer itself it is not analyzed and normalized as a proper name, e.g.

IDwezleegyzenthseegesmaría
üdvözlégyszentségesMária
welcomebe.imp.2sgHolyMary
‘hail, Holy Mary’

Attributes accompanying proper names (e.g. the Virgin Mary) are analyzed as belonging to the name; accordingly, they begin with an uppercase letter and are coded as proper names.

Multi-word proper names are decomposed, and every component is handled as a separate token. In the case of suffixed proper names only the last component receives the morphological code of the suffix; those preceding it are coded as having nominative form; e.g.

hoghmeltoklegýenkyesuschristosnakýgeͤrethýre
hogyméltó-klegy-ünkJézusKrisztus-nakígéret-é-re
CAdj.PlV.Subj.P1N:PN:P.Dat_genN.PxS3=i.Sub
thatworthy-plbe.sbjv-1plJesusChrist-datcommandment-poss-sub
‘that we may be worthy of the commandment of Jesus Christ’

5. Morphological analysis

Given that the central aim of the project was research on Old Hungarian syntax, we have not attempted a complete morpho-phonological analysis in the corpus: coding does not reflect the full composition of words. On the other hand, we have introduced a new category that in fact transcends morphology, and can be of help in syntactic analysis. This is the use of nominative- and dative-marked possessor expressions, e.g.

ewfelthamadasaanakdýchewseegeͤn
őfeltámadás-á-nakdicsőség-é-n
N:Pro.S3.Nom_genN.PxS3.Dat_genN.PxS3.Sup
heresurrection-poss.3sg-dathonour-poss.3sg-sup
‘on the honour of his resurrection’

In these cases the task of a purely morphological analysis is to mark surface nominative and dative case; for some syntactic analyses however genitive coding is indispensable.

In other cases however we refrained from encoding syntactic phenomena. For instance, unmarked direct objects are coded as bearing nominative and not accusative case, e.g.

feýele haythwaan
fej-elehajt-ván
N.PxS3VPfx.V.PartAdv=vÁn
head-poss.3sgbow-part
‘bowing his head’

We handled the possessive pronoun ő in a similar manner, in those cases where it corresponds to a plurality of possessors, and on the surface it is a singular form. For instance,

ewzawok
őszav-uk
N:Pro.S3.Nom_genN.PxP3
heword-poss.3pl
‘their word’

We also encode semantic information, in that we treat colour terms, names of nations, mass nouns and proper names as separate subclasses.

Polysemy posed a problem during the normalization process already, in that not even context can help in disambiguating certain Old Hungarian words. In these cases morphological analysis has preserved underspecification.

If the verbal prefix immediately precedes the verb its code is attached to that of the verb, e.g.

eesky tízthwloknagýwetheesbewl
éski-tisztul-oknagyvétés-ből
CVPfx.V.S1AdjN.Ela
andout-purge-1sgbigsin-ela
‘and I am purged from big sin’

Similarly, the code of the question particle -e is attached to the analysis of the word to which it is cliticised, e.g.

haborosage
háborúság-e
N.QPtl
conflict-q
‘conflict?’

In several cases it could not be decided whether a word was an adverb or a verbal prefix; we analyzed such words as adverbs, e.g.

mÿnteggÿo̗thelnÿesegÿembehnÿaÿaskodnÿ
mintegyüttél-niésegyembenyájaskod-ni
CAdvV.InfCAdvV.Inf
thantogetherlive-infandtogetherbe.gentle-inf
‘than living together and being gentle together’

Similarly, if it could not be decided whether a word was a pronoun or a verbal prefix, it was coded as a pronoun, e.g.

Esrategyetekakenyertesazbort
éstegyétekakenyer-etésabor-t
CN:Pro.Sub.S3V.Subj.P2.DefDetN.AccCDetN.Acc
andontoput.imp.2plthebread-accandthewine-acc
‘and put the bread and the wine onto it’

Prefixes detached from the verb are marked in a separate field adjacent to the verb. With this function it is possible to retrieve prefixed verb forms where prefix and verb have become separated in the sentence, e.g.

desäbädiczmkmikëtagonostwl
deszabadít-smegmink-etagonosz-tól
meg
butdeliver-impprtwe-acctheevil-abl
‘but deliver us from evil’

Synthetic verb forms are decomposed, so that each component is analysed as a separate token. Auxiliaries are analyzed as containing no person features, e.g.

zenthÿanosewangelistaćzodalkozÿkvala
SzentJánosevangélistacsodálkoz-ikvala
N:PN:PNV.S3V.Ipf
SaintJohnevangelistwonder-3sgbe.pst
‘John the Apostle was wondering’

Double accusatives (e.g. őtet ‘he.acc.acc’) are not marked as such in morphological analysis; such forms receive the same code as their unary variants (őt ‘he.acc’).

Words belonging to several categories are always labelled according to their primary category, even when affixed, e.g.

Syketeknekhallamast
süket-ek-nekhallomás-t
Adj.Pl.DatN.Acc
deaf-pl-dathearing-acc
‘hearing to the deaf’

The full description of the code inventory used in morphological analysis is available here.

6. Metadata

The corpus contains several kinds of metadata beside the texts processed at various levels.

6.1. Locus markers

Primary metadata are markers encoding loci, i.e. data used in locating the currently queried token. Locus markers change from text to text; what they have in common is that they refer to the original codex and not to the print edition. With codices containing translations of the Bible we also provide information on the book, chapter and verse, provided that this information was contained in the print edition as well. The names of books are abbreviated, according to the Szent István Társulat (St. Stephen Association) edition (e.g. Ter: Teremtés könyve ‘Genesis’, Én: Énekek éneke ‘Song of Solomon’).

6.2. The Remark field

The Remark field can contain spontaneous notes and comments; it can also contain metadata, in the form of various codes. The corpus contains the following kinds of metadata:

  • If the title is part of the text it is coded as such, and the Remark field contains the code TITLE. If the title is not part of the text it functions as a locus marker.
  • Foreign words within the text are included in the corpus and labelled as LANG{language}. This label also signals that such a word has no normalized form or morphological analysis. If a foreign word appears with Hungarian affixes it is handled as a Hungarian word, viz. it undergoes normalization and morphological analysis. Extensive passages in a foreign language are omitted. If such a passage is in fact included for some reason, this is marked separately in the descriptions of the texts themselves.
  • Original orthographic versions contain the scribe's corrections. These are marked in the following manner:
    • ulterior addition by the scribe: ADD;
    • passage struck out in the original text: STRIKE;
    • mis-spelled, uncorrected word: FAIL;
    • word fragment: FRAG;
    • mis-spelled, uncorrected word, which is certainly incorrect: ERROR.

6.3. The Interpretation field

The corpus contains an Interpretation field, which may contain the “translation” of the normalized form into Modern Hungarian. For instance, Old Hungarian jonh corresponds to Modern Hungarian szív ‘heart’. Having a separate Interpretation field does not mean that normalization is carried out without interpretation. Interpretation is a precondition for normalization. The Interpretation field is optional; its use depends solely on the individual decisions of the coder.


[1] Adriano Cappelli: The elements of abbreviation in medieval Latin paleography 2.o.

[2] Benkő, Loránd (ed.): A magyar nyelv történeti-etimológiai szótára [Historical-Etymological Dictionary of Hungarian]. Budapest: Akadémiai Kiadó, 1967.