Our corpus

Parts of speech

Our corpus consists of 681 non-English and 3405 English words, meaning 4086 words in total. Here are some of the distributions as they were classified by the machine:

Part of speechTotal amount of part of speechAmount of non-English wordsAmount of English words
Proper nouns989420569
Nouns606168438
Verbs40825383
Adjectives18426158
Adverbs1051095
Conjunctions79178
Determiners2583255

The table shows the total amount of words belonging to part of speech category in the corpus and the amount of non-English and English words which belong to the respective part of speech. This shows that the machine usually classifies a word it does not know as a proper noun or maybe as a noun, but rarely as a conjunction, determiner, adjectives or adverbs.

English determiners, conjunctions, adjectives, verbs and nouns were usually categorized correctly, while non-English determiners, conjunctions, adverbs, adjectives, verbs, nouns and English adverbs were usually categorized incorrectly. This clearly shows that the English words, except fort he English adverbs, are usually classified correctly, while the non-English words, together with the English adverbs, are usually classified incorrectly.

Dependency relations

Regarding the dependency relations, a similar pattern is observable concerning to (in-)correct work by the machine: English words, except for the English adverbs have usually been assigned a correct dependency relation, while the non-English words, plus the English adverbs, have usually been assigned an incorrect dependency relation. It is also worth mentioning that some non-English words have not been assigned a dependency relation at all.

Dieser Beitrag wurde unter Allgemein abgelegt und mit , , , , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.

Eine Antwort zu Our corpus

  1. anneschu sagt:

    Dear Michelle, I really like that you have illustrated our corpus with a table so that you have a visual for your claims. I also found the distribution of parts of speech really interesting and it surely does underline the assumption made in class at the beginning of term, that ANNIS classifies most non-English words as proper nouns. I also think your finding about the non-assignation of dependency relations of some non-English words is really interesting. I did not have a closer look on that aspect and I am really fascinated, but not shocked, by it.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert