Iscte

Mestrado

Engenharia Informática

Título

Classificação de emoções em redes sociais

Autor

Filipe, Soraia Alexandra Cardoso

Resumo

Este trabalho foca-se na deteção automática de emoções em tweets escritos em língua portuguesa. Com o objetivo de classificar cada um dos tweets com presença ou ausência de cada uma de oito possíveis emoções, são utilizadas duas abordagens. Com vista a avaliar as abordagens propostas, foi criado um conjunto de 1000 tweets, manualmente anotado com a presença ou não das emoções consideradas. Na primeira abordagem, é aplicado um léxico existente e diferentes estratégias para refinar e melhorá-lo, por meio de tradução automática e incorporação de palavras alinhadas com as existentes. Os resultados sugerem que se pode obter um melhor desempenho tanto através da melhoria de um léxico, como pela tradução direta de tweets portugueses para inglês e depois aplicando um léxico inglês existente. Relativamente à abordagem supervisionada, pretende-se criar modelos que generalizem melhor, com base em grandes quantidades de informação. Inicialmente é feita a anotação dos tweets disponíveis com base no léxico de emoções. Seguidamente, os primeiros 5 milhões de tweets são usados para treinar um modelo para cada uma das emoções. Esses modelos são então utilizados para anotar 9 milhões de tweets, que por sua vez são filtrados com base na confiança do modelo. Os tweets com uma confiança acima de um dado threshold são usados para treinar modelos finais. Todos os modelos são avaliados com base nos dados de referência, revelando que o segundo modelo mostra em geral um maior sucesso na previsão de emoções do que o primeiro modelo. Esta abordagem apresenta melhores resultados face à primeira.

This work focuses on the automatic detection of emotions in tweets written in Portuguese. In order to classify each one of the tweets with the presence or absence of each one of eight possible emotions, two approaches are used: the first is based on a emotion lexicon; the second is supervised and is based on logistic regression models. In order to evaluate the two proposed approaches, a set of 1000 tweets was created, manually annotated with the presence or absence of each of the emotions. Regarding the first approach, an existing and widely used lexicon is applied, as well as different strategies to refine and improve it, by means of machine translation and incorporating additional words. The results obtained suggest that a better performance can be achieved both by improving a lexicon and by directly translating Portuguese tweets into English and then applying an existing English lexicon. Regarding the supervised approach, it was intended to create models that generalize better, based on large amounts of information. Initially the available tweets are annotated based on the emotion lexicon. Then, the first 5 million tweets are used to train a model for each of the emotions. These models are then used to annotate another 9 million tweets, which in turn were filtered based on the model’s confidence. Tweets with a confidence above a given threshold are used to train final models. All models were evaluated based on the reference data, revealing that the first model generally shows greater success in predicting emotions. This approach presents better results than the previous one.