Using ANNIS as a Tool in Analysing Multilingual Sentences

To me, using ANNIS to study and interpret the annotations of our collective corpus was fun. But only after figuring out the limits and possibilities that diffeI focused on one particular area. It was possible to discover different patterns in regards to the mistakes that occur when tagging non-English words. I specifically enjoyed analysing the POS-tagging of words that resembled English words but actually weren’t.

How Non-English Words Are Tagged

Across languages, there is an overlap in form, eventhough the meaning can vary drastically, words can look alike. I have addressed the English bias that can be found in the tagging of the multilingual passages in my first blog entry. But until using ANNIS to analyse the whole corpus I couldn’t estimate how often these mistakes occur nor what languages are affected. As our corpus shows, the given mistake is not uncommon, occurs across languages and across POS. By using the query pos=‎“….“ & isForeign=‎“True‎“ I could extract the following cases of words that seemed to be tagged according to English grammar.

POS (lemma)	# „isForeign“ but tagged according to English grammar
PRON (me)	2
ADJ (mere)	2
Verb (do)	2
NOUN (soy)	1
ADV (so)	1
DET (a)	1

Eventhough the tags seemed to consequently adhere to the English grammar when the word „looked“ to be English, there was one exception. The word „won“ was tagged as a noun and not a verb, as the software might have understood it. Instead, it seems that in this case even without a DET infront of it, the words was related to the currency „won“.

Implications for Further Research

In order to successfully interpret the use of multilingual passages in literature, a correct tagging of words is neccessary. It matters, whether we try to determine what POS are most frequently non-English words etc. Questions arise, if these overlaps of languages are used by authors as a stylistic device or a challenge for the readers bias, or simply out of happenstance. Furthermore, the English dataset doesn’t suffice to annotate these multilingual texts. Moreover, these overlaps between languages might affect the machine translation of given passages and further highlight insufficencies or biases within AI/ translations softwares.

Eine Antwort zu Using ANNIS as a Tool in Analysing Multilingual Sentences

juliasel sagt:

25. Juli 2024 um 17:08 Uhr

Dear Elisa, I think the query you built shows very interesting results, especially the annotation of “won” as a noun. I’ve also looked at the “NOUN” category in general and noticed that many words originated from different languages but are now quite established in present-day English, so maybe “won” as a currency is also included in some English dictionaries. However, it is strange that the program would skip the “default” meaning of the word as a verb. Maybe the labeling as a noun occurred due to other factors like sentence structure? Other than that, I think it would be interesting to know what kind of language corpus the program uses to determine the POS of words, especially in other cases where a word could belong to different categories like “shelter” or “dance”.

Antworten

Using ANNIS as a Tool in Analysing Multilingual Sentences

How Non-English Words Are Tagged

Implications for Further Research

Eine Antwort zu Using ANNIS as a Tool in Analysing Multilingual Sentences

Schreibe einen Kommentar Antwort abbrechen

Archive

Kategorien