Extracting structured data from web pages with maximum entropy segmental Markov model

Susan Mengel, Yaoquin Jing

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Automated techniques can help to extract information from the Web. A new semi-automatic approach based on the maximum entropy segmental Markov model, therefore, is proposed to extract structured data from Web pages. It is motivated by two ideas: modeling sequences embedding structured data instead of their context to reduce the number of training Web pages and preventing the generation of too specific or too general models from the training data. The experimental results show that this approach has better performance than Stalker when only one training Web page is provided.

Original languageEnglish
Title of host publicationWeb Information Systems Engineering - WISE 2009
Subtitle of host publication10th International Conference, Proceedings
Pages219-226
Number of pages8
DOIs
StatePublished - 2009
Event10th International Conference on Web Information Systems Engineering, WISE 2009 - Poznan, Poland
Duration: Oct 5 2009Oct 7 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5802 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference10th International Conference on Web Information Systems Engineering, WISE 2009
CountryPoland
CityPoznan
Period10/5/0910/7/09

Keywords

  • HTML extraction
  • Markov Model

Fingerprint Dive into the research topics of 'Extracting structured data from web pages with maximum entropy segmental Markov model'. Together they form a unique fingerprint.

Cite this