Repeated labeling using multiple noisy labelers

Panagiotis G. Ipeirotis; Foster Provost; Victor S. Sheng; Jing Wang

doi:10.1007/s10618-013-0306-1

Repeated labeling using multiple noisy labelers

Panagiotis G. Ipeirotis, Foster Provost, Victor S. Sheng, Jing Wang

Computer Science

Research output: Contribution to journal › Article › peer-review

121 Scopus citations

Abstract

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction of predictive models. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

Original language	English
Pages (from-to)	402-441
Number of pages	40
Journal	Data Mining and Knowledge Discovery
Volume	28
Issue number	2
DOIs	https://doi.org/10.1007/s10618-013-0306-1
State	Published - Mar 2014

Keywords

Active learning
Classification
Data preprocessing
Data selection
Human computation
Repeated labeling
Selective labeling

Access to Document

10.1007/s10618-013-0306-1

Cite this

@article{720a9e2696284448927efd93117b46c3,

title = "Repeated labeling using multiple noisy labelers",

abstract = "This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction of predictive models. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.",

keywords = "Active learning, Classification, Data preprocessing, Data selection, Human computation, Repeated labeling, Selective labeling",

author = "Ipeirotis, {Panagiotis G.} and Foster Provost and Sheng, {Victor S.} and Jing Wang",

note = "Funding Information: Acknowledgments This work was supported by the National Science Foundation under Grant No. IIS-0643846 and IIS-1115417, by an NSERC Postdoctoral Fellowship, by an NEC Faculty Fellowship, by a Google Focused Award, and a George Kellner Fellowship. Thanks to Carla Brodley, John Langford, and Sanjoy Dasgupta for enlightening discussions and comments.",

year = "2014",

month = mar,

doi = "10.1007/s10618-013-0306-1",

language = "English",

volume = "28",

pages = "402--441",

journal = "Data Mining and Knowledge Discovery",

issn = "1384-5810",

number = "2",

}

TY - JOUR

T1 - Repeated labeling using multiple noisy labelers

AU - Ipeirotis, Panagiotis G.

AU - Provost, Foster

AU - Sheng, Victor S.

AU - Wang, Jing

N1 - Funding Information: Acknowledgments This work was supported by the National Science Foundation under Grant No. IIS-0643846 and IIS-1115417, by an NSERC Postdoctoral Fellowship, by an NEC Faculty Fellowship, by a Google Focused Award, and a George Kellner Fellowship. Thanks to Carla Brodley, John Langford, and Sanjoy Dasgupta for enlightening discussions and comments.

PY - 2014/3

Y1 - 2014/3

N2 - This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction of predictive models. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

AB - This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction of predictive models. With the outsourcing of small tasks becoming easier, for example via Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a set of robust techniques that combine different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

KW - Active learning

KW - Classification

KW - Data preprocessing

KW - Data selection

KW - Human computation

KW - Repeated labeling

KW - Selective labeling

UR - http://www.scopus.com/inward/record.url?scp=84893707563&partnerID=8YFLogxK

U2 - 10.1007/s10618-013-0306-1

DO - 10.1007/s10618-013-0306-1

M3 - Article

AN - SCOPUS:84893707563

SN - 1384-5810

VL - 28

SP - 402

EP - 441

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

IS - 2

ER -

Repeated labeling using multiple noisy labelers

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this