Luyia POS tagging results with 1000 runs

I ran 1000 iterations for each of the cross-language part of speech tagging experiments discussed in this paper. For each iteration, the training data was a different subsample of all available training data.

In all cases the test dataset remained static. This standard test dataset was the same used in the paper linked above. After my presentation at COLING, one of the questions I was asked was whether the conclusions I drew (that the more distantly related language, Tiriki, was more effective at cross language tagging of Wanga data than Bukusu) were robust or whether they could be due to noise in the data.

The best way to evaluate whether this was due to noise or due to an actual trend was to rerun these experiments with many more samplings from the available data. This took about 8 days to finish but the deed is now, mostly, done.

My initial worries were that the data would turn out to not even me normally distributed, which would require more statistics than I am comfortable with at the moment. However, the results obtained turned out to be roughly normally distributed.

The next steps moving forward are to run the Swahili tests at the 0.5 training dataset size. These iterations of the machine learning algorithm were taking a rather long time and I had to do other things on my servers so I aborted this portion before it could finish. I will be resuming this shortly.

After that, I will begin by looking at descriptive statistics like median and standard deviation across different experimental settings and with different parts of speech. I am interested in seeing if there are trends that could be exceptions to my generalization. E.g. if Bukusu performs slightly worse but has much less variance in performance then this may not be a bad thing. There may not be a clear choice of one source language for the cross language tagging experiments. In that case, it will still be somewhat supprising that the closer language does not overwhelmingly fare better.

Then, I will do some statistical significance tests to determine if the source language has an effect on the eventual performance of the classifier.

The results of these 1000 iterations are summarized in this R Shiny Application.