TY - JOUR
T1 - Imbalanced multiple noisy labeling
AU - Zhang, Jing
AU - Wu, Xindong
AU - Sheng, Victor S.
N1 - Publisher Copyright:
© 2014 IEEE.
Copyright:
Copyright 2015 Elsevier B.V., All rights reserved.
PY - 2015/2/1
Y1 - 2015/2/1
N2 - It can be easy to collect multiple noisy labels for the same object via Internet-based crowdsourcing systems. Labelers may have bias when labeling, due to lacking expertise, dedication, and personal preference. These cause Imbalanced Multiple Noisy Labeling. In most cases, we have no information about the labeling qualities of labelers and the underlying class distributions. It is important to design agnostic solutions to utilize these noisy labels for supervised learning. We first investigate how imbalanced multiple noisy labeling affects the class distributions of training sets and the performance of classification. Then, an agnostic algorithm Positive LAbel frequency Threshold (PLAT) is proposed to deal with the imbalanced labeling issue. Simulations on eight UCI data sets with different underlying class distributions show that PLAT not only effectively deals with the imbalanced multiple noisy labeling problems that off-the-shelf agnostic methods cannot cope with, but also performs nearly the same as majority voting under the circumstances without imbalance. We also apply PLAT to eight real-world data sets with imbalanced labels collected from Amazon Mechanical Turk, and the experimental results show that PLAT is efficient and better than other ground truth inference algorithms.
AB - It can be easy to collect multiple noisy labels for the same object via Internet-based crowdsourcing systems. Labelers may have bias when labeling, due to lacking expertise, dedication, and personal preference. These cause Imbalanced Multiple Noisy Labeling. In most cases, we have no information about the labeling qualities of labelers and the underlying class distributions. It is important to design agnostic solutions to utilize these noisy labels for supervised learning. We first investigate how imbalanced multiple noisy labeling affects the class distributions of training sets and the performance of classification. Then, an agnostic algorithm Positive LAbel frequency Threshold (PLAT) is proposed to deal with the imbalanced labeling issue. Simulations on eight UCI data sets with different underlying class distributions show that PLAT not only effectively deals with the imbalanced multiple noisy labeling problems that off-the-shelf agnostic methods cannot cope with, but also performs nearly the same as majority voting under the circumstances without imbalance. We also apply PLAT to eight real-world data sets with imbalanced labels collected from Amazon Mechanical Turk, and the experimental results show that PLAT is efficient and better than other ground truth inference algorithms.
KW - Imbalanced noisy labeling
KW - crowdsourcing
KW - imbalanced learning
KW - repeated labeling
UR - http://www.scopus.com/inward/record.url?scp=84920118410&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2014.2327039
DO - 10.1109/TKDE.2014.2327039
M3 - Article
AN - SCOPUS:84920118410
VL - 27
SP - 489
EP - 503
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
SN - 1041-4347
IS - 2
M1 - 6823124
ER -