Multiview Representation Learning from Crowdsourced Triplet Comparisons

Crowdsourcing has been used to collect data at scale in numerous fields. Triplet similarity comparison is a type of crowdsourcing task, in which crowd workers are asked the question ``among three given objects, which two are more similar?'', which is relatively easy for humans to answer. However, the comparison can be sometimes based on multiple views, i.e., different independent attributes such as color and shape. Each view may lead to different results for the same three objects. Although an algorithm was proposed in prior work to produce multiview embeddings, it involves at least two problems: (1) the existing algorithm cannot independently predict multiview embeddings for a new sample, and (2) different people may prefer different views. In this study, we propose an end-to-end inductive deep learning framework to solve the multiview representation learning problem. The results show that our proposed method can obtain multiview embeddings of any object, in which each view corresponds to an independent attribute of the object. We collected two datasets from a crowdsourcing platform to experimentally investigate the performance of our proposed approach compared to conventional baseline methods.


INTRODUCTION
In recent years, deep learning methods have been widely adopted in various fields, and have exhibited remarkable performance [20,22].However, these applications largely depend on the collection of sufficient amounts of appropriate training data.Crowdsourcing is an efficient and economical approach in which various data are collected by applying human intelligence [16]; asking crowd workers to annotate labels for specified objects in digital images by choosing one of several categories is among the most popular tasks [10].However, collecting labels by making choices can become difficult in some cases; the set of all possible categories may not be available in advance, objects may be difficult to recognize, and crowd workers may not be able to provide accurate answers to some questions without expert or professional knowledge.
To solve this problem, we consider tasks that ask about similarity rather than requesting workers to perform categorization.For example, it is difficult for people with only general knowledge to identify all dog breeds so that the images cannot be directly labeled but similarity data can be collected by comparisons as shown in Figure 1.Similarity comparison data can be used to train representation learning models and provide multi-dimensional vector embeddings of objects.Usually, we expect the similarity (inversely proportional to distance) of embeddings of similar objects to be larger and vice-versa.
Pairwise and triplet similarity comparisons are two representative types of similarity comparison tasks [7,12,13,23].Pairwise similarity comparison, also known as absolute similarity comparison, asks crowd workers to answer the following question, i.e., "Are objects A and B similar?".
However, applying pairwise similarity comparisons in crowdsourcing involves at least one key issue; that is, making absolute decisions is generally challenging for humans.This problem is especially noticeable for making subjective judgments.Different crowd workers may have different thresholds of similarity or dissimilarity, thus their judgments conflict with each other often.
Triplet similarity comparisons can be used to solve this problem with pairwise similarity comparison in crowdsourcing, also known as relative similarity comparison.Triplet similarity comparison involves the selection of two relatively similar objects among three that are provided.Here, crowd workers are asked to answer the following question, i.e., "Which two objects among A, B, and C are more similar?".
There are three possible answers: (1) A and B are more similar, (2) A and C are more similar and, (3) B and C are more similar.Triplet similarity comparisons are more accurate than pairwise similarity comparisons because relative comparisons are often easier for humans for common cases [13].
However, triplet similarity comparisons can be more ambiguous in some cases, i.e., an object may have multiple attributes.Humans would thus naturally compare them differently in terms of different attributes.In this study, we refer to these characteristics as "views" for simplicity.At least two problems arise in considering multiple views: (1) First, different crowd workers consider different views for the same task.For example, the triplet similarity comparison task shown in Figure 1 involves two possible views: color of dogs and face shape of dogs.The leftmost dog and the rightmost dog are clearly dissimilar.Crowd workers who focus more on color will consider that the leftmost and the middle dogs are more similar.However, crowd workers who focus more on face shape will consider that the rightmost and the middle dog are more similar.Neither choice is wrong, and we describe this difference as being caused by the different views of the workers in considering different attributes of objects.
(2) Second, the same crowd worker might choose different views in different tasks or a different times.A worker might select different views in different situations, e.g., sometimes focusing on color and considering shape in other instances.Moreover, workers may select a view that simplifies their decision making process for a given task.
In this study, we propose a novel end-to-end representation learning framework to learn from multiview triplet data by adding multiple branches to the existing network structure.Our proposed method allows different workers to choose different views for different tasks.We recruited crowd workers and conducted simulation experiments to evaluate the performance of our proposed method.
The contributions of this study are summarized as follows.
• In contrast to previous work [1], our proposed method performs inductive learning and can provide multiview embeddings for an arbitrary new sample.• To address the problem of different crowd workers responding with different views, we added worker models to reflect the preference of workers for different views.• To address the problem that a given crowd worker might choose different views, the proposed approach adopts triplet entropy to measure the difficulty of deciding on a view.• We used multiple evaluation metrics to evaluate the performance of our proposed approach compared to some existing baseline methods.Additionally, we also confirmed the semantic meaning of multiview embeddings using visualization techniques.

RELATED WORK
We briefly review some relevant studies and describe their differences from our proposed method, which are summarized in Table 1.
Multiview Triplet Embedding: Amid et al. [1] proposed an algorithm that produces multiview triplet embeddings (MVTE), in which each view corresponds to a hidden attributes.The input of this algorithm is only the triplet comparison data, which does not contain information about the original object, such as raw images, and the output is the multiview embeddings of each sample.
The problem can be summarized as follows.
Given a set of triplets T , each element in T corresponds to a triplet ordered tuple (, , ) in which object  is more similar to object  than object , find  -views embeddings of  objects Y = { 1 ,  2 , . . .,   }, where   = Triplet Network: Triplet Network [8] is a deep metric learning approach for classification problems.The Triplet Network is inspired by the Siamese Network [4], and it takes three samples as input and provides three embeddings corresponding to the three samples.Two of the three samples are from the same class, while the other is not.The training process for the Triplet Network minimizes the embedding distance of samples of the same class and vice-versa.In this study, we adopt a similar approach, which takes three samples as the input for one triplet similarity comparison.
Crowd Worker Modeling: Some workers will recognize instances of some classes as belonging to other classes in label-annotated crowdsourcing tasks.As an alternative to combining labels using methods such as average or majority voting, a crowd layer [19] is added to automatically correct the worker bias.The crowd layer constructs confusion matrices modeling each worker, and uses an end-to-end deep learning framework to obtain their values.Our model also uses a similar idea, constructing models for each crowd worker to represent the preference of workers.Other models that extended the crowd layer model have also been developed recently such as SpeeLFC [3], CoNAL [5] and LFC-x [14].
Multiview Clustering: Multiview clustering methods [26,27,29] typically use multiview features of a given object to perform clustering tasks.Generally, these techniques do not require raw images or texts, but instead only use the extracted features as input.In the field of image processing, features are usually obtained based on image processing techniques such as histogram of oriented gradient (HoG), local binary pattern (LBP), and scale-invariant feature transform (SIFT).Our proposed approach differs these methods in which we utilize a raw image as input, and the output is multiview embedding; then, the output can be processed with existing clustering methods.
Multi-label Learning: Multi-label learning [15,18,21] is used for datasets in which a single object has multiple attributes that represent distinctive features, and each attribute corresponds to a specified label.In this study, we adopt a similar approach, which uses a shared layer structure in the neural network.

PROPOSED MULTIVIEW LEARNING FRAMEWORK
In this section, we define the problem of training an end-to-end neural network that can provide multiview embeddings of given inputs and finding the view preferences of workers. 1

Problem Settings
Given the  -sample dataset We denote by T = {T 1 , T 2 , . . ., T  } the triplets annotated by  crowd workers with multiple views, where each element in T  corresponds to a double-sided triplet ordered tuple (, , ), which implies that worker  considers objects  and  among , , and  to be more similar and can be represented by where   : X × X → R is the similarity function of worker .Our goal is to estimate  statistically.
In some other studies [1,8,25], triplet similarity comparison was also defined as asking which of two objects  and  is more similar to a given object  .This question has only two answers; either object  or  must be selected.To distinguish between the two definitions, we refer to triplet comparison with two and three answers as one-sided and double-sided triplet comparison, respectively.

Network Architecture
Our proposed framework is shown in Figure 2. The neural network consists of two parts, including shared layers  : X → Z and the layers of views ℎ [1] , ℎ [2] ..., ℎ [ ] : Z → R  , where Z is the domain of hidden embeddings of the shared layers outputs and  is the number of dimensions of embeddings.The shared layers extract common representation embeddings of objects.Subsequent layers of views take hidden embeddings obtained from the shared layers as inputs and extract embeddings corresponding to different 1 Our implement is available at https://github.com/17bit/multiview_crowdsourcing.views to obtain multiview embeddings.The calculation process of the neural network can be written as ℎ [1] ((  ))  ℎ [2] ((  )) Our network is based on the ResNet18 architecture [6].The shared layers are the layers before and including the third residual block in ResNet18, and the layers of views are multiple copies of the fourth residual block and the fully connected layer.

View Selection
Next, we consider how to choose views in a task.Our network outputs multiview embeddings.However, the importance of each view differs in each task.We use weights to measure their importance.If the weight of a view is large, it plays a key role in the given task.When the triplet task is relatively simple, a view can be selected easily.For example, given three images showing "Red O", "Red X", and "Blue P", a worker is more likely to note that "Red O" and "Red X" are similar because their color is the same and all the three images have different shapes.However, in more ambiguous tasks, such as the task in Figure 1, choosing either color or dog face shape is acceptable, and depends on the preference of the workers.Therefore, we consider that the weight can be divided into two parts, including an inherent weight and worker preference.
Inherent Weight: The inherent weight is only based on three objects, and we use triplet entropy to measure the inherent weight in a triplet task.If the inherent weight of a view is larger, all workers are more likely to select it in a task.Next, we provide an algorithm to calculate the triplet entropy and the inherent weight.Given a triplet (, , ), the similarity function   : X × X → R between two objects in view  can be defined as Next, the probabilities that the similarities of one pair are the largest among three pairs can be defined as Note that    _ +   _ +   _ ≡ 1.The triplet entropy of view  can be defined as The maximum triplet entropy is log 3 when    _ =   _ =   _ = 1 3 and   (  ,   ) =   (  ,   ) =   (  ,   ), which implies that all views have equal importance and choosing between them is difficult.In contrast, the lower bound of triplet entropy is 0, which implies that similarity between one pair is much higher than the other two, and making a choice is easy.
Finally, the inherent weight of view  of triplet task (, , ) can be defined as the inverse of triplet entropy, that is, where 0 ≤ ℎ    < 1. Worker Preference: We denote learnable parameters by W = { 1 ,  2 , . . .,   }, where , the weights of  views of  workers.A larger value of    implies that worker  prefers view .
Combining Weights: Next, we consider combining the inherent weights and worker preferences.Ideally, if a triplet task has the same inherent weights for different views, then the final weights are expected to depend on the preferences of the workers.In contrast, if a worker has no preference, then the answer is expected to depend on the inherent weights.To reflect this hypothesis, we add two weights as We note that the initialization of    follows a uniform distribution between 0 and 1 such that the scales of the two weights are the same.The weights after softmax normalization can be written as Obviously,     () and

Loss Function
The similarity between two objects of triplet (, , ) given by worker  can be written as The probability of choosing object  and object  as more similar by worker  can be defined as The loss function can be derived as the following log likelihood function of all triplet tasks of all workers:

EXPERIMENTS
First, we performed simulation experiments to verify whether our method could accurately learn the view preferences of simulated workers because knowing exactly which view a real person uses for each triplet comparison task is challenging.Subsequently, we recruited workers from the crowdsourcing platform Lancers2 to collect triplet data and use multiple evaluation metrics to test whether our method performed better than baseline methods.

Datasets
We constructed a 10-color MNIST dataset by selecting 2000 images from the MNIST [11] dataset and using 1000 of them as the training set and the other 1000 as the test set.We chose 10 colors from the 12-color wheel3 by removing Red-orange and Yellow-orange and using a single color in each image, there were 1000 images each in the training and test sets, 100 images for each number category (from 0 to 9), and 100 images for each color category (Red, Orange, Yellow, Yellow-green, Green, Blue-green, Blue, Blue-purple, Purple and Red-purple).There were a total of 100 (number, color) pairs with 10 images in each pair.In the following experiments, we considered two images to belong to the same category if their numbers and colors were both the same.The Stanford dog dataset [9] was used in the crowdsourcing experiments.The dataset contains 20580 images of dogs from 120 different breeds.We prepared a subset of 1000 images from 20 breeds.There were 50 images in each breed category, half of which were used for the training set and half as the test set.Therefore, there were 500 images in each of the training and test sets.

Simulation Settings
The data in the dog dataset could be interpreted from many uncertain hidden views, e.g., skin color, nose, eyes, and height, and finding specified criteria is challenging.Therefore, we only conducted experiments with human crowdsourcing participants for the dog dataset.For the 10-color MNIST dataset, because the data included only two possible views, we used simulated worker experiments to verify whether our proposed method could accurately find these two views.Our experiment included three different simulation settings.
In simulation setting 1, there were two workers, referred to as worker 1 and worker 2, who make decisions based on color and number, respectively.Taking the worker focusing on color as an example, he would answer if and only if the two images had the same color among three images in a triplet comparison task.Otherwise, the task is an invalid triplet query and was not included in the triplet dataset.The two workers produced 2000 triplets in each of the training and test sets.
In simulation setting 2, there were two workers, worker 1 and worker 2, who made decisions based on the distance of color and number, respectively, and the color or the number need not be exactly the same.We defined the distance of the color as the distance in the color wheel (after removing two colors), e.g.,  (Red, Yellow-green) = 3.We defined the distance of the numbers as the absolute value of the difference between two numbers, e.g.,  (9, 0) =  (0, 9) = 9.These two workers find two images with the shortest distance among three images.If two such images do not exist, the task is an invalid triplet query and was not included in the triplet dataset.The two workers produced 2000 triplets in each of the training and test sets.
In simulation setting 3, there were four workers, of which worker 1 and worker 4 were the same as the two workers in setting 2. The remaining two workers made decisions according to the weights of 0.3 for color, 0.7 for number, and 0.7 for color, 0.3 for number, respectively.The four workers produced 1000 triplets on each of the training and the test sets, respectively.There were 4000 triplets in the training and the test sets in all the three settings.
The difficulty of the three settings was gradually increased to verify the performance of the proposed method in different situations.Table 2 gives two examples of how simulated workers choose the more similar pair among three images are given in the appendix.

Real Crowdsourcing Setting
80 workers were recruited for both the 10-color MNIST dataset and the dog dataset, respectively.Each worker was required to complete 100 random triplet comparison tasks shown in Fig. 3 for a reward of JPY220 on the Lancers platform, of which images of 50 tasks were used from the training set and the other 50 tasks were used from the test set.We did not filter the data to avoid bias, except for workers who completed the tasks within an extremely short period and that implied obvious suspicions of submissions of inferior quality for the tasks.Workers completed the task in roughly 10 minutes on average.The hourly pay was about JPY1300, which exceeds the minimum hourly wage, about JPY1000, in Japan.We consider that if the choice of a given worker is inconsistent with that of the majority, this was because their view was in the minority, not because their choice is wrong.

Evaluation Metrics
Accuracy refers to the proportion of cases in which the network can successfully give the correct pair highest similarity among three objects for the triplet data.We use triplet data to train the network, but our goal was to train a neural network  to obtain multiview embeddings, which means that we need to evaluate the performance of embeddings Y without using triplet data T .Therefore, the accuracy of the test set was not applicable.Several different evaluation metrics can be considered.
Clustering Evaluation: We clustered the embeddings of the test set without labels and evaluated the results.In our experiments, we used -means and agglomerative clustering methods as well as purity and normalized mutual information (NMI) evaluation metrics.
Table 2: Two examples for three simulation settings.The left part and the right part refers to two different triplet comparison tasks of three colored numbers.The column " (i, ii)" refers to the distance between the first and the second number in the triplet query and the same for the remaining columns.The row "Triplet Query" refers to triplet comparison tasks, which gives three images and asking which two of them are more similar.Linear Evaluation: We evaluated the performance of representation learning models in fine-tuning tasks [2,17,28].A linear layer was trained by the embeddings and ground truth labels of the training set.Then, the linear layer was applied to the embeddings of the test set to predict the labels.The accuracy of classification was the evaluation score.

𝐾-anchors Evaluation:
We evaluated the performance of predicting labels when only labels of a small number of samples were known. samples were randomly selected from each category as anchors in the test set.We then used anchors to predict labels in the test set.The prediction of each sample was the category of a selected anchor sample with the minimum Euclidean distance.The accuracy of classification was the evaluation score.

Experimental Results
Our proposed methods with 2 views performed better than the baselines on both simulation experiments and real crowdsourcing experiments in all the evaluation metrics, as shown in Tables 3  and 4 respectively.Experiments with a single view were used as baseline, and the neural network architecture was exactly the same as the common ResNet18.The results show that compared to the single view, setting the number of views to 2 lead to a significant improvement.Next, we discuss cases in which we used more views.Table 5 indicates that increasing the number of views improved performance slightly, but not as much as introducing multiview, which increased the number of views from 1 to 2.
Figure 4 shows the results of visualization using t-SNE [24] for the 10-color MNIST dataset.When the number of views was 1, it shows that samples were clustered by color in all the three simulation setting, and by number in real crowdsourcing experiments.None were clustered by color and shape at the same time.However, views 1 and 2 corresponded to color and number respectively in simulation setting 1 and to number and color respectively in simulation setting 2, simulation setting 3, and the real crowdsourcing experiments.The embeddings also learned the distance relationships given by workers.For example, it may be observed that numbers were roughly ordered from 0 to 9 in view 1 (see the third row and the third column in Figure 4) and the color distribution roughly matched the 12-color wheel in view 2 (see the third row and the fourth column in Figure 4) in simulation setting 3. It may also be observed that 0s (samples of 0 in the figure) and 8s were close, whereas 1s and 7s are close in the real crowdsourcing experiments (see the fourth row and the third column in Figure 4) which reflects the decision standard of the crowdsourcing workers.
Table 6 indicates that our proposed model learned the settings of worker preferences in simulation experiments.For example, worker 2 used the weight of color as 0.7, and weight of number as 0.3 in simulation setting 3. It may be observed that view 1 corresponded

CONCLUSIONS
In this study, we investigated a new end-to-end framework designed to learn multiview representation embeddings from crowdsourced triplets data.Based on the hypothesis that different crowd workers may have different views and the same crowd worker may choose different views in different tasks, we adopted triplet entropy and worker models to give different views different weights.
We convened 160 crowd workers to conduct experiments using two datasets in total.The results demonstrated that our proposed method performed better in terms of multiple evaluation metrics on both simulated worker experiments and human crowdsourcing experiments using two datasets.In our experiments, we chose ResNet18 as a baseline for comparison with our proposed approach.However, in future research, other network structures should be selected as baselines and compare the performance of the baseline method with and without multiview method.Moreover, we confirmed that our multiview embeddings focused on different attributes of objects separately on an MINST with color and learned the preference of workers in the simulation experiments, as shown in Figure 4.However, further study needs to investigate the semantics meaning of multiview embeddings in datasets with ambiguous views, such as the dog dataset considered here.Better performance could be achieved on our experimental dataset by setting the number of views, a hyperparameter, to be greater than or equal to 2. However, methods of setting number of views should also be investigated.
Our proposed crowd multiview method could become a typical solution for many other tasks when human might have multiple different views.For example, a user prefers a movie from several presenting movies when other favorite movies of the user are known.
That is, the new preferred movie might be more similar to the previous favorite movies than other presenting movies.Such movie preferences data are another kind of relative comparisons.It is possible that users have multiple views to consider which movies are similar.Our method could be adopted if modifying the definition of relative comparison and adding multiple branches to other neural network structures.

Figure 1 :
Figure 1: An example of triplet similarity comparison with three samples.Crowd workers are asked to select two similar images among the given three images.In this example, there are two possible views: color and face.

Figure 2 :
Figure 2: Architecture of multiview representation learning framework.

Figure 3 :
Figure 3: The crowdsourcing task of triplet similarity comparison.Three images are put in a triangle shape to avoid the distance between them affecting decisions of crowd workers.

Number of Views = 1 2 Figure 4 :
Figure 4: t-SNE visualization of embeddings from the 10-color MNIST dataset.Each number in the figure corresponds to a sample in the test set (Zoom in to see colored numbers clearly).The second column shows the global visualization of the experiment with two views, where the third and fourth columns were Views 1 and 2 respectively.For the second column, the dimensions of embeddings of each view were reduced from  to 1.The horizontal and vertical axes indicate Views 1 and 2, respectively.For the first, the third and the fourth column, the dimensions of embeddings were reduced from  to 2. The horizontal and vertical axes indicate component 1 and component 2, respectively.

Table 1 :
Differences between the present work and existing methods.This table indicates the novelty and contribution of this work."Multiple crowd workers" refers to whether crowdsourcing is involved and whether the difference between workers is considered."Multiview" refers to whether data are processed in multiple views."Annotation type" refers to the ground-truth data in supervised learning."Output" refers to the final desired result."Inference type" refers to the mode of learning, whereas inductive learning refers to general learning tasks, in which general patterns are learned from training data and applied to testing data, and transductive learning refers to reasoning from training data to specific testing data.

Table 3 :
Results on simulation experiments on the 10-color MINST test set.A larger value indicates better performance.

Table 4 :
Results on real crowdsourcing experiments on the 10-color MINST and Dog test sets.

Table 5 :
Results of different number of views on the Dog test set.

Table 6 :
Results of how much workers prefer view 1 on simulation experiments on 10-color MNIST.The value of worker