Converting and Annotating Multilingual Sentences & Quotes – My Experience

Initially, I was very unsure about this task, because as someone who has focused on literature during their studies (for good reason), I am neither that good at linguistics, nor at programming or coding. While it seemed an intriguing task, there was also some apprehension on my part, looking at the Google Colab file for the first time. However, this was quickly overcome when we went through the steps one by one. I was curious to see how this program would deal with multilingual sentences, when it is based purely on English.

I chose to examine A Concise Chinese-English Dictionary for Lovers. This book not only contains multilingual sentences as in a mixture of English and Chinese, it also contains grammatically incorrect English, as it is written from the point of view of the protagonist, who is learning English as the storyline progresses.

Because of this, I naturally ran into some issues. One of the sentences I chose to look at this one: „Chinese we say shi yue huai tai (十月怀胎). It means giving the birth after ten months pregnant.“
When it came to the POS tagging, not only did it categorise all of the pinyin Chinese words in the sentence as proper nouns, it also counted the Hànzì as one word, instead of a whole sentence.
In addition to this, the word „Chinese“ was categorised as an adjective, since spaCy is incapable of recognising that it is meant as a noun, since the sentence is not grammatically correct.

It was definitely interesting to see, what the program made of the example sentences from A Concise Chinese-English Dictionary for Lovers, and even though I am still in the process of getting the hang of tokenisation and dependencies, I am interested to see what we do next.

