Loading…

An Automatic Label Extraction Technique for Domain-Specific Hidden Web Crawling (LEHW)

General-purpose search engines (e.g. Google and Yahoo) ignore valuable data that represent 80% of the content on the Web, this portion of Web called hidden Web (HW). Pages in the hidden Web are dynamically generated in response to queries submitted via the search forms. In this paper, a new algorith...

Full description

Saved in:

Bibliographic Details
Main Authors:	El-Desouky, A.I., Ali, H.A., El-Ghamrawy, S.M.
Format:	Conference Proceeding
Language:	English
Subjects:	Automatic control Crawlers Crawling Data mining Filling Hidden Web HTML HTML search Forms Query generation Radio control Search engines Technological innovation Uniform resource locators Web information extraction Web pages
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	General-purpose search engines (e.g. Google and Yahoo) ignore valuable data that represent 80% of the content on the Web, this portion of Web called hidden Web (HW). Pages in the hidden Web are dynamically generated in response to queries submitted via the search forms. In this paper, a new algorithm for extracting labels from multi-attribute (M-A) search form fields is proposed. A technique for automatic query generation for single-attribute (S-A) search forms is also provided in order to enhance the performance of the overall domain-specific hidden Web crawlers. The innovation of (LEHW) algorithm is its capability to distinguish between (S-A) and (M-A) forms; so that the capability of dealing with both of them, unlike most hidden Web crawlers that ignore either of them. Embedding of the proposed algorithm within the overall framework of the HW crawler is evaluated through experiments using real Web sites. The preliminary results demonstrate the accuracy and precision of the proposed approach for most of the sites considered
DOI:	10.1109/ICCES.2006.320490