Polysemy Needs Attention: Short-Text Topic Discovery With Global and Multi-Sense Information

The topic model has been widely applied to various research domains such as information retrieval, data mining, and so on. It can discover topics of texts in an unsupervised way. In the early years, most researches mainly focused on long texts. With the emergence of the Internet, the number of short...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access 2021, Vol.9, p.14918-14932
Main Authors:	Lu, Heng-Yang, Yang, Jun, Zhang, Yi, Li, Zuoyong
Format:	Article
Language:	eng
Subjects:	Context modeling Data management Data mining Information retrieval Internet Manuals Multiple senses Natural languages Semantics short texts Task analysis Texts topic model word embeddings Words (language)
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The topic model has been widely applied to various research domains such as information retrieval, data mining, and so on. It can discover topics of texts in an unsupervised way. In the early years, most researches mainly focused on long texts. With the emergence of the Internet, the number of short texts is growing rapidly. Most existing schemes to solve the sparsity problems of short texts, are mainly based on data aggregation or model improvements. Among them, the Biterm Topic Model is one of the most representative models. It proposed a new way to model topics based on document-level word pairs and has shown creativity and effectiveness. However, this strategy ignores those semantically similar and rarely co-occurrent word pairs. What's more, most researches ignore the multi-sense phenomenon in natural languages. In this paper, we utilize multi-sense word vectors to extract similar word pairs from the whole corpus by considering multiple senses. Based on this idea, we introduce a novel short-text topic model, which disambiguates multiple senses of words and generates more reasonable global biterms. Experimental results on two open-source English datasets have shown superiority to state-of-the-art topic models.
ISSN:	2169-3536 2169-3536