To me, using ANNIS to study and interpret the annotations of our collective corpus was fun. But only after figuring out the limits and possibilities that diffeI focused on one particular area. It was possible to discover different patterns in regards to the mistakes that occur when tagging non-English words. I specifically enjoyed analysing the POS-tagging of words that resembled English words but actually weren’t.
How Non-English Words Are Tagged
Across languages, there is an overlap in form, eventhough the meaning can vary drastically, words can look alike. I have addressed the English bias that can be found in the tagging of the multilingual passages in my first blog entry. But until using ANNIS to analyse the whole corpus I couldn’t estimate how often these mistakes occur nor what languages are affected. As our corpus shows, the given mistake is not uncommon, occurs across languages and across POS. By using the query pos=“….“ & isForeign=“True“ I could extract the following cases of words that seemed to be tagged according to English grammar.
POS (lemma) | # „isForeign“ but tagged according to English grammar |
PRON (me) | 2 |
ADJ (mere) | 2 |
Verb (do) | 2 |
NOUN (soy) | 1 |
ADV (so) | 1 |
DET (a) | 1 |
Eventhough the tags seemed to consequently adhere to the English grammar when the word „looked“ to be English, there was one exception. The word „won“ was tagged as a noun and not a verb, as the software might have understood it. Instead, it seems that in this case even without a DET infront of it, the words was related to the currency „won“.
Implications for Further Research
In order to successfully interpret the use of multilingual passages in literature, a correct tagging of words is neccessary. It matters, whether we try to determine what POS are most frequently non-English words etc. Questions arise, if these overlaps of languages are used by authors as a stylistic device or a challenge for the readers bias, or simply out of happenstance. Furthermore, the English dataset doesn’t suffice to annotate these multilingual texts. Moreover, these overlaps between languages might affect the machine translation of given passages and further highlight insufficencies or biases within AI/ translations softwares.
Dear Elisa, I think the query you built shows very interesting results, especially the annotation of “won” as a noun. I’ve also looked at the “NOUN” category in general and noticed that many words originated from different languages but are now quite established in present-day English, so maybe “won” as a currency is also included in some English dictionaries. However, it is strange that the program would skip the “default” meaning of the word as a verb. Maybe the labeling as a noun occurred due to other factors like sentence structure? Other than that, I think it would be interesting to know what kind of language corpus the program uses to determine the POS of words, especially in other cases where a word could belong to different categories like “shelter” or “dance”.