Indonesian Text Classification

Models and its Dataset for Indonesian Text Classification

Text Classification is the processing of labeling or organizing text data into groups. It forms a fundamental part of Natural Language Processing.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name Description Author Link
Indo RoBERTa Emotion Classifier Indo RoBERTa Emotion Classifier is emotion classifier based on Indo-roberta model. It was trained on the trained on IndoNLU EmoT dataset. The model used was Indo-roberta and was transfer-learned to an emotion classifier model. Based from the IndoNLU bencmark, the model achieve an f1-macro of 72.05%, accuracy of 71.81%, precision of 72.47% and recall of 71.94%. Steven Limcorn HuggingFace
Indonesian RoBERTa Base Sentiment Classifier Indonesian RoBERTa Base Sentiment Classifier is a sentiment-text-classification model based on the RoBERTa model. The model was originally the pre-trained Indonesian RoBERTa Base model, which is then fine-tuned on indonlu’s SmSA dataset consisting of Indonesian comments and reviews. Wilson Wongso HuggingFace

Datasets

Name Description Author Link
HoASA An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. Similar to the CASA dataset, each review is labeled with a single sentiment label for each aspect. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative. The positivenegative label is given to a review that contains multiple sentiments of the same aspect but for different objects (e.g., cleanliness of bed and toilet). A. N. Azhar, M. L. Khodra, and A. P. Sutiono HuggingFace
Indonesian Clickbait Headlines The CLICK-ID dataset is a collection of Indonesian news headlines that was collected from 12 local online news publishers; detikNews, Fimela, Kapanlagi, Kompas, Liputan6, Okezone, Posmetro-Medan, Republika, Sindonews, Tempo, Tribunnews, and Wowkeren. Andika William and Yunita Sari HuggingFace
CASA An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. The dataset covers six aspects of car quality. We define the task to be a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral. Arfinda Ilmania, Abdurrahman, Samuel Cahyawijaya, Ayu Purwarianti HuggingFace
SmSA This sentence-level sentiment analysis dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments on the SmSA dataset: positive, negative, and neutral. Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti HuggingFace
WReTE The Wiki Revision Edits Textual Entailment dataset consists of 450 sentence pairs constructed from Wikipedia revision history. The dataset contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise. Ken Nabila Setya and Rahmad Mahendra HuggingFace
EmoT An emotion classification dataset collected from the social media platform Twitter. The dataset consists of around 4000 Indonesian colloquial language tweets, covering five different emotion labels: anger, fear, happy, love, and sadness. Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani HuggingFace
SentiWS This dataset add sentiment lexicons for 81 languages generated via graph propagation based on a knowledge graph–a graphical representation of real-world entities and the links between them. Chen, Yanqing and Skiena, Steven HuggingFace
WiLI-2018 WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided. Thoma, Martin HuggingFace