Automatic classification model of semi-structured HTML text data based on State Grid cloud architecture

Abstract Regarding the construction of the “State Grid Cloud” platform, various businesses of the power grid have their own Wed systems. The data between different websites is scattered and the coupling between resources is low, which can easily form the problem of information islands. This paper is...

Full description

Saved in:
Bibliographic Details
Published in:Journal of physics. Conference series 2021-05, Vol.1920 (1), p.12072
Main Authors: Zhang, Enjie, Zhang, Zhidong, Yan, Long, Li, Da
Format: Article
Language:eng
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Regarding the construction of the “State Grid Cloud” platform, various businesses of the power grid have their own Wed systems. The data between different websites is scattered and the coupling between resources is low, which can easily form the problem of information islands. This paper is oriented to the semi-structured HTML text data in web pages under the State Grid cloud architecture platform, and uses the Python-based Scrapy framework to collect semi-structured power data information from various power business websites. We propose a semi-structured text data classification model based on BIGRU neural network and Bayesian classifier. BIGRU neural network is used to extract text features, Bayesian classifier is used for classification, and the TF-TDF algorithm is used to assign weights to improve the traditional recurrent neural network model with many parameters and long training model time. We use this method to simulate the semi-structured HTML text data of the State Grid, and conduct a comparative experiment with the traditional neural network model. The experimental results show that the classification algorithm can effectively improve the efficiency and accuracy of power semi-structured text data classification.
ISSN:1742-6588
1742-6596