Initially, I was very unsure about this task, because as someone who has focused on literature during their studies (for good reason), I am neither that good at linguistics, nor at programming or coding. While it seemed an intriguing task, there was also some apprehension on my part, looking at the Google Colab file for the first time. However, this was quickly overcome when we went through the steps one by one. I was curious to see how this program would deal with multilingual sentences, when it is based purely on English.
I chose to examine A Concise Chinese-English Dictionary for Lovers. This book not only contains multilingual sentences as in a mixture of English and Chinese, it also contains grammatically incorrect English, as it is written from the point of view of the protagonist, who is learning English as the storyline progresses.
Because of this, I naturally ran into some issues. One of the sentences I chose to look at this one: „Chinese we say shi yue huai tai (十月怀胎). It means giving the birth after ten months pregnant.“
When it came to the POS tagging, not only did it categorise all of the pinyin Chinese words in the sentence as proper nouns, it also counted the Hànzì as one word, instead of a whole sentence.
In addition to this, the word „Chinese“ was categorised as an adjective, since spaCy is incapable of recognising that it is meant as a noun, since the sentence is not grammatically correct.
It was definitely interesting to see, what the program made of the example sentences from A Concise Chinese-English Dictionary for Lovers, and even though I am still in the process of getting the hang of tokenisation and dependencies, I am interested to see what we do next.
I find it really interesting to see, how the programme reacts with grammatically incorrect English, as this wasn't the case with my examples. I bet there were a lot of mistakes from spaCy on that part. At first I was also a bit unsure about especially the linguistic aspect of this class, as I had my last linguistic class a few years back, but the step-by-step guide-through also helped me a lot. It is also quite interesting, that no matter which non-English language, the programme likes to interpret them as proper nouns and very very rarely as something else. What is different to my findings is that you also have a different writing system, the Hànzì, and how spaCy reacts to this. Did you try out whether it also categorises the characters as one word when you put spaces in-between (even though that then would not be correct from the Chinese perspective)?