Working with the software ANNIS and the corpus consisting of our novels’ example sentences, one of the predictions we had at the beginning of term came true: Machine annotation is indeed faulty and there is a tendency to classify non-English words as proper nouns. We already saw this tendency while working with Google Collab, but now ANNIS shows us that non-English words are indeed often analysed as such.
To come to this conclusions, I pretty much just did a query for all foreign words with isForeign=“True“. There were 681 matches, so obviously quite a lot as this was our goal. Instead of looking through all matches, I decided to focus on one sentence, which starts with the the fourth result „Que siempre la lengua …“.
As can be seen, 13 out of 15 non-English words are annotated and analysed as a proper noun – which obviously does not make sense, as a sentence just cannot only be made up of proper nouns. Hence, this already confirms the prediction of a tendency to categorise non-English words as proper nouns.
However, there are also two exceptions: siempre and la. Siempre is categorised as a verb, la as a determiner. Which is however, also not completely correct. ‚La‘ is a determiner as it is an article, however, ’siempre‘ is – if my broken Spanish is anything to be sure of – an adverb, as it means always. I do not really have an answer as to why these two words are exceptions, and especially not why one is still falsely while the other is correctly analysed.
Hence, most of our initial predictions did come true while working with ANNIS. This is obviously not a great result in terms of correctly annotating and tokenising sentences, however, in terms of emphasising the problem of an anglocentric view and an Anglocentrism in software developing, it is a very telling result.
Hi Anne, I also noticed this sentence, especially because it is so conspicuous concerning the issue that this sentence is supposed to consist of almost only proper nouns, and I agree with you that this doesn't make any sense. And I also agree that the anglocentrism has a huge impact in creating this problem, which can be seen not only by the issue with the proper nouns as you mentioned it, but also with other parts of speech classes. Michelle