For me, working with ANNIS was much more fun than annotating sentences in Google Collab. I liked actually being able to get some quantifications out of the sentences we annotated. Although in the end the corpus we uploaded on ANNIS was not that extensive, it was still very interesting to see the possibilities the program offers.
During the course of this semester, we already noticed an extremely high error rate in the POS classification and dependency tagging of spaCY. I was interested to see, how high that rate is in numbers, now that we could do analysis of that sort with ANNIS. Because it was the end of the semester and my time and capacities were limited, I limited my research to four POS, which I thought were most interesting.
With the help of the ANNIS Query Builder I looked for all proper nouns (PROPN), nouns (NOUN), verbs (VERB) and adjectives (ADJ) that were annotated (manually) as foreign. I then checked, how many of the POS tags were actually correct. Because my language skills are limited and it wasn’t always clear what language I was analysing, there were some words (or tokens) I couldn’t identify even through research. These are my findings:
POS | Foreign words with that tag | classification considered correct | unsure about classification |
PROPN | 420 | 55 | – |
ADJ | 26 | 4 | 2 |
NOUN | 168 | 93 | 23 |
VERB | 25 | 8 | 5 |
As you can see, the POS tagged the most is the proper noun, which was also our impression in previous sessions. It seems that labelling a token as PROPN is an easy solution for words which can’t be recognised by the machine. It is also the category with the most mistakes.
The POS with the least mistakes is the NOUN. This might be due to the fact that a noun is relatively easy to identify, especially if it is the only foreign word in an English sentence and maybe even accompanied by a determinant. This is probably also the reason, why authors of multilingual literary texts use relatively many foreign nouns: they are easy to discern for English readers.
For ADJ and VERB, the number of findings is relatively low, so I’m not comfortable making any conclusions because of my „findings“. I suspect, that verbs and adjectives are a lot harder to identify than nouns and proper nouns, also because they usually change due to flexion and conjugation (at least in the languages that I know).
I know that our corpus is not yet big enough to come to definitive conclusions. It might also be biased, because we tried choosing a high variety of sentences out of every book. Still, it was quite interesting to see (if only on a small scale) what ANNIS can do.
Dear Iska, As you already know, I also found ANNIS easier and more fun to work with. In addition, I believe that the ANNIS Query Builder is an extremely helpful device. Something that Google Colab did not provide. I enjoyed searching and comparing different POS-tags of "foreign" words through the surveys. As you also mention, ANNIS made it easier to observe specific words and their POS tags by manually searching for them. It is impressive to see that in your observation only 55 PROPN POS tags of the 420 is correct. This is a common observation with both ANNIS and Google Collab: the over-usage of PROPN POS-tag for many "foreign" words. I also agree that they are definitely easier to discern for English readers; however, even for an English reader the constant tagging of words as noun do not make any sense. I hope that in near future, websites as such are more familiar with multilingual contexts which would allow them to better identify foreign words and their part of speech.