Loading…

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity

The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word similarity. The majority of these efforts focused on...

Full description

Saved in:
Bibliographic Details
Published in:ACM transactions on Asian and low-resource language information processing 2023-03, Vol.22 (3), p.1-19, Article 83
Main Authors: Muneer, Iqra, Fatima, Ghazeefa, Khan, Muhammad Salman, Adeel Nawab, Rao Muhammad, Saeed, Ali
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The semantic word similarity task aims to quantify the degree of similarity between a pair of words. In literature, efforts have been made to create standard evaluation resources to develop, evaluate, and compare various methods for semantic word similarity. The majority of these efforts focused on English and some other languages. However, the problem of semantic word similarity has not been thoroughly explored for South Asian languages, particularly Urdu. To fill this gap, this study presents a large benchmark corpus of 518 word pairs for the Urdu semantic word similarity task, which were manually annotated by 12 annotators. To demonstrate how our proposed corpus can be used for the development and evaluation of Urdu semantic word similarity systems, we applied two state-of-the-art methods: (1) a word embedding–based method and (2) a Sentence Transformer–based method. As another major contribution, we proposed a feature fusion method based on Sentence Transformers and word embedding methods. The best results were obtained using our proposed feature fusion method (the combination of best features of both methods) with a Pearson correlation score of 0.67. To foster research in Urdu (an under-resourced language), our proposed corpus will be free and publicly available for research purposes.
ISSN:2375-4699
2375-4702
DOI:10.1145/3566124