MEDEIROS, JERRY FERNANDES. TagTheWeb: Using Wikipedia Categories to Automatically Categorize Text-Based Resources on the Web. 25/09/2018 104 f. Mestrado em INFORMÁTICA Instituição de Ensino: UNIVERSIDADE FEDERAL DO ESTADO DO RIO DE JANEIRO, Rio de Janeiro.
Dissertação Jerry Fernandes Medeiros.pdf
Dissertação de Mestrado
TagTheWeb: Using Wikipedia Categories to Automatically Categorize Text-Based Resources on the Web
Author
Jerry Fernandes Medeiros (UNIRIO)
Abstract
Identifying topics associated with a set of documents is a common task for many applications and can be used to improve various tasks involving documents on the Web, such as search, retrieval, recommendation, and clustering. Due to the significant amount of information produced and made available today, it becomes humanly impossible to organize, analyze, and extract the knowledge embedded. Consequently, mechanisms to accomplish such tasks as removing or at least diminishing the need for human intervention have gained importance in the last decades. One of the potential solutions for dealing with the challenge of organizing and retrieving documents is to use automated classification and categorization of Web information. In this research, a generic classification method to automatically categorize any text-based content on the Web according to the collective knowledge of Wikipedia contributors, through the semantic relation between nodes of the Wikipedia Category Graph, is proposed. The approach is based on three steps: extracting named entities from text, extracting categories associated with named entities, and finally representing and classifying the document. Computational experiments and a study involving users of a crowd-sourcing platform were used to validate the method. The results show that this approach can be used to correctly categorize most documents in a way that real users can understand, without the effort and input of domain experts.
Keywords:
Text Classification, Wikipedia, Categories, Category Graph.
Dissertação Jerry Fernandes Medeiros.pdf