TY - JOUR
T1 - Multi-Class Ground Truth Inference in Crowdsourcing with Clustering
AU - Zhang, Jing
AU - Sheng, Victor S.
AU - Wu, Jian
AU - Wu, Xindong
N1 - Funding Information:
This research has been supported by the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China, under grant IRT13059, the National 973 Program of China under grant 2013CB329604, the National Natural Science Foundation of China under grant 61229301, 61402311, the start-up funding of Nanjing University of Science and Technology, and the US National Science Foundation under grant IIS-1115417.
Publisher Copyright:
© 2015 IEEE.
PY - 2016/4/1
Y1 - 2016/4/1
N2 - Due to low quality of crowdsourced labelers, the integrated label of each example is usually inferred from its multiple noisy labels provided by different labelers. This paper proposes a novel algorithm, Ground Truth Inference using Clustering (GTIC), to improve the quality of integrated labels for multi-class labeling. For a K labeling case, GTIC utilizes the multiple noisy label sets of examples to generate features. Then, it uses a K-Means algorithm to cluster all examples into K different groups, each of which is mapped to a specific class. Examples in the same cluster are assigned a corresponding class label. We compare GTIC with four existing multi-class ground truth inference algorithms, majority voting (MV), Dawid & Skene's (DS), ZenCrowd (ZC) and Spectral DS (SDS), on one synthetic and eight real-world datasets. Experimental results show that the performance of GTIC is significantly superior to the others in terms of both accuracy and M-AUC. Besides, the running time of GTIC is about twenty times faster than EM-based complicated inference algorithms.
AB - Due to low quality of crowdsourced labelers, the integrated label of each example is usually inferred from its multiple noisy labels provided by different labelers. This paper proposes a novel algorithm, Ground Truth Inference using Clustering (GTIC), to improve the quality of integrated labels for multi-class labeling. For a K labeling case, GTIC utilizes the multiple noisy label sets of examples to generate features. Then, it uses a K-Means algorithm to cluster all examples into K different groups, each of which is mapped to a specific class. Examples in the same cluster are assigned a corresponding class label. We compare GTIC with four existing multi-class ground truth inference algorithms, majority voting (MV), Dawid & Skene's (DS), ZenCrowd (ZC) and Spectral DS (SDS), on one synthetic and eight real-world datasets. Experimental results show that the performance of GTIC is significantly superior to the others in terms of both accuracy and M-AUC. Besides, the running time of GTIC is about twenty times faster than EM-based complicated inference algorithms.
KW - Clustering
KW - EM algorithm
KW - crowdsourcing
KW - ground truth inference
KW - multi-class labeling
UR - http://www.scopus.com/inward/record.url?scp=84963731561&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2015.2504974
DO - 10.1109/TKDE.2015.2504974
M3 - Article
AN - SCOPUS:84963731561
SN - 1041-4347
VL - 28
SP - 1080
EP - 1085
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 4
M1 - 7345572
ER -