Get another label? Improving data quality and data mining using multiple, noisy labelers

Victor S. Sheng; Foster Provost; Panagiotis G. Ipeirotis

doi:10.1145/1401890.1401965

Get another label? Improving data quality and data mining using multiple, noisy labelers

Victor S. Sheng, Foster Provost, Panagiotis G. Ipeirotis

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

895 Scopus citations

Abstract

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

Original language	English
Title of host publication	KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining
Pages	614-622
Number of pages	9
DOIs	https://doi.org/10.1145/1401890.1401965
State	Published - 2008
Event	14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008 - Las Vegas, NV, United States Duration: Aug 24 2008 → Aug 27 2008

Publication series

Name	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference	14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008
Country/Territory	United States
City	Las Vegas, NV
Period	08/24/08 → 08/27/08

Keywords

Data preprocessing
Data selection

Access to Document

10.1145/1401890.1401965

Cite this

Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining (pp. 614-622). (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). https://doi.org/10.1145/1401890.1401965

Sheng, Victor S. ; Provost, Foster ; Ipeirotis, Panagiotis G. / Get another label? Improving data quality and data mining using multiple, noisy labelers. KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining. 2008. pp. 614-622 (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining).

@inproceedings{3c1ec21fb1904279ba7ceca2c6622c89,

title = "Get another label? Improving data quality and data mining using multiple, noisy labelers",

abstract = "This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.",

keywords = "Data preprocessing, Data selection",

author = "Sheng, {Victor S.} and Foster Provost and Ipeirotis, {Panagiotis G.}",

year = "2008",

doi = "10.1145/1401890.1401965",

language = "English",

isbn = "9781605581934",

series = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

pages = "614--622",

booktitle = "KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining",

note = "14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008 ; Conference date: 24-08-2008 Through 27-08-2008",

}

Sheng, VS, Provost, F & Ipeirotis, PG 2008, Get another label? Improving data quality and data mining using multiple, noisy labelers. in KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 614-622, 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008, Las Vegas, NV, United States, 08/24/08. https://doi.org/10.1145/1401890.1401965

Get another label? Improving data quality and data mining using multiple, noisy labelers. / Sheng, Victor S.; Provost, Foster; Ipeirotis, Panagiotis G.
KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining. 2008. p. 614-622 (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Get another label? Improving data quality and data mining using multiple, noisy labelers

AU - Sheng, Victor S.

AU - Provost, Foster

AU - Ipeirotis, Panagiotis G.

PY - 2008

Y1 - 2008

N2 - This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

AB - This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

KW - Data preprocessing

KW - Data selection

UR - http://www.scopus.com/inward/record.url?scp=65449144451&partnerID=8YFLogxK

U2 - 10.1145/1401890.1401965

DO - 10.1145/1401890.1401965

M3 - Conference contribution

AN - SCOPUS:65449144451

SN - 9781605581934

T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 614

EP - 622

BT - KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining

T2 - 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008

Y2 - 24 August 2008 through 27 August 2008

ER -

Sheng VS, Provost F, Ipeirotis PG. Get another label? Improving data quality and data mining using multiple, noisy labelers. In KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining. 2008. p. 614-622. (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). doi: 10.1145/1401890.1401965

Get another label? Improving data quality and data mining using multiple, noisy labelers

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this