Getting labeled data isn't easy.

Having reliable labeled data is crucial for any supervised learning task. But given large datasets, manually determining these labels may be intractable. Services like Mechanical Turk can be particularly useful to crowdsource labeling, but it is difficult to guarantee that the workers will reputably provide correct labels. Each worker also has their own subjectivity when issuing labels, and it is important to capture this bias somehow in the model.

Quick Video Summary

The task becomes even more difficult with more classes from which to choose. In the case of labeling flowers, for example, an amateur labeler may not be well versed enough to differentiate between families of orchids, when faced with over a hundred options. There also may be ambiguity between classes. Take the classic example of lions, tigers, and ligers. To the layperson, an image of a liger may fall closer to a tiger on the lion-tiger spectrum, and thus be hard to classify.

So how can we ask the question?

We compare two ways to ask a worker to classify an item.

MULTICLASS (MC) PARADIGM

This is commonly used in labeling tasks. A worker is presented with an item and has to choose one of the k classes.

1 question

Worker predicts class k

YES-NO (YN) PARADIGM

In this case, when a worker is presented with an item, he or she is one question per class and has to answer yes or no.

K questions

Worker predicts yes or no on class k

First, we have to generate the data.

We began with a confusion matrix Θ(j) per worker, where the rows represent the true class labels and the columns the predicted labels. So each element in this K x K matrix is the probability of predicting class k1 indicated by the column, when the true class is k2. To generate each Θ(j), we took an identity matrix of size K and added a bias term over the entire matrix, and then normalized over the rows. This puts the most weight along the diagonal, which is intuitive because we would expect that a worker predicts the correct class most often. We also generated true classes, using a categorical distribution. We determined the true label yn on item n by pulling class k with probability λk.

MULTICLASS PARADIGM

From the confusion matrices and true labels, we generated the data as a N × J × K size matrix R. rnjk = 1 when a worker j predicts class k on item n, and rnjk = 0 otherwise for that same worker on the same item.

YES-NO PARADIGM

In the Yes-No case, our data matrix size becomes N × J × K × 2. Now, for each worker j on item n, rnjk0 = 1, rnjk1 = 0 if they answered no on the question regarding class k, and rnjk0 = 0, rnjk1 = 1 if they answered yes. We have two binary values for each question rather than a single one (where 0 is no and 1 is yes) because in the event that a worker does not answer a question on an item, it will not affect the likelihood, because both rnjk0 and rnjk1 will be 0.

Recovering true labels from the data

Once we had the data, we performed Expectation Maximization and Simulated Annealing on both MC and YN to see if we could recover the confusion matrices, class distribution, and true labels. This part gets a bit hairy, so take a look at the paper for all the technical details. But here are the basic ideas:

Expectation Maximization (EM)

The EM procedure is the same for both MC and YN, but they vary in their likelihoods (derived in the paper). In the E-step, we calculate the posterior, which is the probability of that class on that item. Then, for the M-step, we compute the M.L.E’s for the class distributions and the confusion mainatrices. We repeat the E and M steps until the estimates stop changing between iterations. Our predicted class label per item then is derived by finding the class with the highest probability in the posterior.

Simulated Annealing (SA)

For SA, our states are represented by our estimate of the labels. We initially randomly assign labels to each item without worrying about the class distribution. Using these random assignments and the data, we compute the initial confusion matrix. In a Simulated Annealing step, we assign a random new label to a random item, and then use this new set of assignments to compute the confusion matrix. This provides us with all the variables to compute the new likelihood. Using the standard SA procedure, we can now optimize parameters like the re-annealing schedule and temperature to converge to the right solution. Since we are effectively optimizing on the assignments as well as the confusion matrices, simulated annealing has a tough time finding the global optimum. To solve this problem, the initial assignments to the SA can be inferred by taking the mode class over all workers. Again, this procedure applies for both MC and YN paradigms.

Results

We used two data sets: an easy data set and a hard data set. Both had 5 workers and 4 classes, but were generated with different confusion matrices. The easy data were created with confusion matrices close to an identity matrix, representing workers that are very accurate. In contrast, the confusion matrices used to generate the hard data set were much more muddled and varied vastly between workers, representing workers with varying accuracy. We compared overall classification accuracy (the percentage of predicted labels that match the true labels) for 40, 80, 100, 200, and 500 data points, on both the easy and hard data. We then ran the data through SA and EM for both MC and YN.

Easy Data

Hard Data

What does this all mean?

As we can observe in the plot for easy data, both paradigms are able to recover the true labels well, but YN has an edge over MC in almost every instance.

The differences in accuracy between YN and MC are much more evident on the hard data, where our workers are less precise. Not only can YN consistently recover the true labels with almost 100% accuracy, but with EM it can also capture the confusion matrices for each worker. The implication of this is that we can pick out individual workers who are worse than others. As for differences in methods, EM is much faster than SA; it is able to converge in two iterations (a tenth of a second on average), versus the ten minutes it takes for SA to reach a solution. SA also requires tweaking of parameters, whereas EM is able to reach the optimal solution with just the formulas.

Of course these are preliminary results, but they are certainly promising. In future work, we expect to consider things like interdependency between questions, the ordering of questions, and other questioning paradigms.

Conclusion:

From these results, it would seem that asking k yes-no is almost unilaterally better than asking a single multiclass question if you want to recover true class labels.

The team

We're master's students studying Computational Science and Engineering at Harvard. This was our final project for the AM207 course.

Neil Chainani

Christian Junge

Abhishek Malali

Want to see more?

Check out the code.