AWS: Automatic Webpage Segmentation

a webpage contains many blocks of data, which can be informative or non-informative. In content extraction methods, informative data such as page title, headlines, news article and post body are distinguished from non-informative data such as advertisement, sidebar and navigational menus. The content extraction tasks have many difficulties because of the variety structure of webpages. In this paper, we proposed a content extraction method named Automatic Webpage Segmentation, AWS, which classifies the main content of a given webpage using a feature set consisting of structural and shallow text features. We benefit DOM tree of webpages for feature extraction. The obtained results are promising due to the effectiveness of proposed method to classify individual text elements of a webpage. Besides, feature selection methods such as wrapper and filter are utilized to improve performance of AWS.

کلیدواژه ها:

Content Extraction ، Web Information Extraction ، Full-text Extraction ، Web Document Modeling

نویسندگان

Mohammad Mehdi Yadollahi

Social Networks Lab., Faculty of Electrical and Computer Engineering, University of Tehran, Tehran, Iran

Masoud Asadpour

Social Networks Lab., Faculty of Electrical and ComputerEngineering, University of Tehran, Tehran, Iran

صدور گواهی نمایه سازی
من نویسنده این مقاله هستم

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

https://civilica.com/doc/481676

شناسه ملی سند علمی:

IRANWEB02_032

تاریخ نمایه سازی: 9 مرداد 1395

نحوه استناد به مقاله:

در صورتی که می خواهید در اثر پژوهشی خود به این مقاله ارجاع دهید، به سادگی می توانید از عبارت زیر در بخش منابع و مراجع استفاده نمایید:

Yadollahi, Mohammad Mehdi and Asadpour, Masoud,1395,AWS: Automatic Webpage Segmentation,The Second International Conference on Web research,Tehran,https://civilica.com/doc/481676

در داخل متن نیز هر جا که به عبارت و یا دستاوردی از این مقاله اشاره شود پس از ذکر مطلب، در داخل پارانتز، مشخصات زیر نوشته می شود.
برای بار اول: (1395, Yadollahi, Mohammad Mehdi؛ Masoud Asadpour)
برای بار دوم به بعد: (1395, Yadollahi؛ Asadpour)
برای آشنایی کامل با نحوه مرجع نویسی لطفا بخش راهنمای سیویلیکا (مرجع دهی) را ملاحظه نمایید.

علم سنجی و رتبه بندی مقاله

مشخصات مرکز تولید کننده این مقاله به صورت زیر است:

رتبه علمی دانشگاه تهران

نوع مرکز: دانشگاه دولتی

تعداد مقالات: 100,697

در بخش علم سنجی پایگاه سیویلیکا می توانید رتبه بندی علمی مراکز دانشگاهی و پژوهشی کشور را بر اساس آمار مقالات نمایه شده مشاهده نمایید.

مقالات مرتبط جدید