Loading…

Record-boundary discovery in Web documents

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record bou...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data 1999-06, Vol.28 (2), p.467-478
Main Authors:	Embley, D. W., Jiang, Y., Ng, Y.-K.
Format:	Article
Language:	English
Subjects:	Abstraction Applied computing Artificial intelligence Computing methodologies Document management and text processing Document preparation Document representation Heuristic function construction Human computer interaction (HCI) Human-centered computing Information retrieval Information systems Interaction paradigms Program reasoning Search methodologies Semantics and reasoning Theory of computation Web applications Web services Web-based interaction World Wide Web
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).
ISSN:	0163-5808 1943-5835
DOI:	10.1145/304181.304223