A machine learning model was used by researchers from the University of Waikato, in New Zealand, to narrow down a massive 8 million tweets to a more manageable 1.2 million in order to look at how te reo Māori is being used in the genre.
According to a recent press release, the team focused on 77 Māori loanwords, or te reo Māori words used in an English context, and used them as training data for their machine learning model.
Machine learning allows data scientists to provide a computer with a large data set, and teach it to make predictions based on that data.
The initial 8 million tweets contained a fair bit of distracting data ‘noise’. The irrelevant tweets are those that are not used in a New Zealand English context, or were otherwise unrelated.
At first, the team manually coded about four thousand tweets then trained a machine learning model to weed out the irrelevant ones.
After which, they used a machine learning technique, invented by a popular search engine multinational company, to automatically extract the meaning of words according to their context.
There is a plan to grow this project into a dissertation, wherein the team will be asking some questions regarding the data they have gathered.
The team is interested to know if the people who tweeted are mainly te reo speakers and if not, then they want to what is the reason behind their use of the loanwords.
Their analysis involves locating the other words that are associated with the Māori loanwords because it will give them a different kind of idea about how the words are being used.
In a dictionary, they tend to get what the word means, abstractly out of context, or with a synonym or two.
But in this case, they have more of a network of related words, which may not have the same meaning but seem to occur in the same contexts.
One of the researchers, Dr Calude, has been involved in research on Māori loanwords in newspapers as part of a Marsden funded project.
The Marsden Fund is the primary mechanism in New Zealand for funding pure research, which is undertaken solely to increase knowledge.
She has already noticed a difference with the use in tweets during the manual coding phase. By comparison, the words are more integrated.
More language mixing is done, which means full sections of Māori and full phrases in English together. Hence, it is similar to code switching, which is what bilinguals do.
The theory around the project has been around for quite a while, but combining it with machine learning means they have created a remarkably vast and accurate corpus of words to analyse.
The researchers want to make it possible for others to do the same, so they are providing the knowledge on an open-source platform found here.
They are adding to the website as they go along, so it is a growing resource.