CIVILICA We Respect the Science
(ناشر تخصصی کنفرانسهای کشور / شماره مجوز انتشارات از وزارت فرهنگ و ارشاد اسلامی: ۸۹۷۱)

Topic Detection on COVID-۱۹ Tweets: A Comparative Study on Clustering and Transfer Learning Models

عنوان مقاله: Topic Detection on COVID-۱۹ Tweets: A Comparative Study on Clustering and Transfer Learning Models
شناسه ملی مقاله: JR_TJEE-52-4_007
منتشر شده در در سال 1401
مشخصات نویسندگان مقاله:

الناز زعفرانی معطر - Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
محمدرضا کنگاوری - Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
امیر مسعود رحمانی - Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran

خلاصه مقاله:
Automatic topic detection seems unavoidable in social media analysis due to big text data which their users generate. Clustering-based methods are one of the most important and up-to-date categories in topic detection. The goal of this research is to have a wide study on this category. Therefore, this paper aims to study the main components of clustering-based-topic-detection, which are embedding methods, distance metrics, and clustering algorithms. Transfer learning and consequently pretrained language models and word embeddings have been considered in recent years. Regarding the importance of embedding methods, the efficiency of five new embedding methods, from earlier to recent ones, are compared in this paper. To conduct our study, two commonly used distance metrics, in addition to five important clustering algorithms in the field of topic detection, are implemented by the authors. As COVID-۱۹ has turned into a hot trending topic on social networks in recent years, a dataset including one-month tweets collected with COVID-۱۹-related hashtags is used for this study. More than ۷۵۰۰ experiments are performed to determine tunable parameters. Then all combinations of embedding methods, distance metrics and clustering algorithms (۵۰ combinations) are evaluated using Silhouette metric. Results show that T۵ strongly outperforms other embedding methods, cosine distance is weakly better than other distance metrics, and DBSCAN is superior to other clustering algorithms.

کلمات کلیدی:
Topic Detection, Transfer learning, Embedding Methods, Distance Metrics, Clustering Methods, Covid-۱۹

صفحه اختصاصی مقاله و دریافت فایل کامل: https://civilica.com/doc/1609760/