A hybrid machine-crowdsourcing approach for web table matching and cleaning

Chunhua Li; Pengpeng Zhao; Victor S. Sheng; Zhixu Li; Guanfeng Liu; Jian Wu; Zhiming Cui

doi:10.1007/978-3-319-39958-4_11

A hybrid machine-crowdsourcing approach for web table matching and cleaning

Chunhua Li, Pengpeng Zhao, Victor S. Sheng, Zhixu Li, Guanfeng Liu, Jian Wu, Zhiming Cui

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

Abstract

Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning.We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.

Original language	English
Title of host publication	Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings
Editors	Bin Cui, Xiang Lian, Dexi Liu, Nan Zhang, Jianliang Xu
Publisher	Springer-Verlag
Pages	132-144
Number of pages	13
ISBN (Print)	9783319399577
DOIs	https://doi.org/10.1007/978-3-319-39958-4_11
State	Published - 2016
Event	17th International Conference on Web-Age Information Management, WAIM 2016 - Nanchang, China Duration: Jun 3 2016 → Jun 5 2016

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	9659
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	17th International Conference on Web-Age Information Management, WAIM 2016
Country/Territory	China
City	Nanchang
Period	06/3/16 → 06/5/16

Keywords

Crowdsourcing
Data cleaning
Table matching

Access to Document

10.1007/978-3-319-39958-4_11

Cite this

Li, C., Zhao, P., Sheng, V. S., Li, Z., Liu, G., Wu, J., & Cui, Z. (2016). A hybrid machine-crowdsourcing approach for web table matching and cleaning. In B. Cui, X. Lian, D. Liu, N. Zhang, & J. Xu (Eds.), Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings (pp. 132-144). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9659). Springer-Verlag. https://doi.org/10.1007/978-3-319-39958-4_11

Li, Chunhua ; Zhao, Pengpeng ; Sheng, Victor S. et al. / A hybrid machine-crowdsourcing approach for web table matching and cleaning. Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings. editor / Bin Cui ; Xiang Lian ; Dexi Liu ; Nan Zhang ; Jianliang Xu. Springer-Verlag, 2016. pp. 132-144 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{6aa8a48ce7714cb390ee4f135fc19672,

title = "A hybrid machine-crowdsourcing approach for web table matching and cleaning",

abstract = "Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning.We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.",

keywords = "Crowdsourcing, Data cleaning, Table matching",

author = "Chunhua Li and Pengpeng Zhao and Sheng, {Victor S.} and Zhixu Li and Guanfeng Liu and Jian Wu and Zhiming Cui",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing Switzerland 2016.; 17th International Conference on Web-Age Information Management, WAIM 2016 ; Conference date: 03-06-2016 Through 05-06-2016",

year = "2016",

doi = "10.1007/978-3-319-39958-4_11",

language = "English",

isbn = "9783319399577",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer-Verlag",

pages = "132--144",

editor = "Bin Cui and Xiang Lian and Dexi Liu and Nan Zhang and Jianliang Xu",

booktitle = "Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings",

}

Li, C, Zhao, P, Sheng, VS, Li, Z, Liu, G, Wu, J & Cui, Z 2016, A hybrid machine-crowdsourcing approach for web table matching and cleaning. in B Cui, X Lian, D Liu, N Zhang & J Xu (eds), Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9659, Springer-Verlag, pp. 132-144, 17th International Conference on Web-Age Information Management, WAIM 2016, Nanchang, China, 06/3/16. https://doi.org/10.1007/978-3-319-39958-4_11

A hybrid machine-crowdsourcing approach for web table matching and cleaning. / Li, Chunhua; Zhao, Pengpeng; Sheng, Victor S. et al.
Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings. ed. / Bin Cui; Xiang Lian; Dexi Liu; Nan Zhang; Jianliang Xu. Springer-Verlag, 2016. p. 132-144 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9659).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - A hybrid machine-crowdsourcing approach for web table matching and cleaning

AU - Li, Chunhua

AU - Zhao, Pengpeng

AU - Sheng, Victor S.

AU - Li, Zhixu

AU - Liu, Guanfeng

AU - Wu, Jian

AU - Cui, Zhiming

N1 - Publisher Copyright: © Springer International Publishing Switzerland 2016.

PY - 2016

Y1 - 2016

N2 - Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning.We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.

AB - Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning.We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.

KW - Crowdsourcing

KW - Data cleaning

KW - Table matching

UR - http://www.scopus.com/inward/record.url?scp=84976611819&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-39958-4_11

DO - 10.1007/978-3-319-39958-4_11

M3 - Conference contribution

AN - SCOPUS:84976611819

SN - 9783319399577

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 132

EP - 144

BT - Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings

A2 - Cui, Bin

A2 - Lian, Xiang

A2 - Liu, Dexi

A2 - Zhang, Nan

A2 - Xu, Jianliang

PB - Springer-Verlag

T2 - 17th International Conference on Web-Age Information Management, WAIM 2016

Y2 - 3 June 2016 through 5 June 2016

ER -

Li C, Zhao P, Sheng VS, Li Z, Liu G, Wu J et al. A hybrid machine-crowdsourcing approach for web table matching and cleaning. In Cui B, Lian X, Liu D, Zhang N, Xu J, editors, Web-Age Information Management - 17th International Conference, WAIM 2016, Proceedings. Springer-Verlag. 2016. p. 132-144. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-39958-4_11

A hybrid machine-crowdsourcing approach for web table matching and cleaning

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this