Indonesian Text Classification

Models and its Dataset for Indonesian Text Classification

Text Classification is the processing of labeling or organizing text data into groups. It forms a fundamental part of Natural Language Processing.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name	Description	Author	Link
Indo RoBERTa Emotion Classifier	Indo RoBERTa Emotion Classifier is emotion classifier based on Indo-roberta model. It was trained on the trained on IndoNLU EmoT dataset. The model used was Indo-roberta and was transfer-learned to an emotion classifier model. Based from the IndoNLU bencmark, the model achieve an f1-macro of 72.05%, accuracy of 71.81%, precision of 72.47% and recall of 71.94%.	Steven Limcorn	HuggingFace
Indonesian RoBERTa Base Sentiment Classifier	Indonesian RoBERTa Base Sentiment Classifier is a sentiment-text-classification model based on the RoBERTa model. The model was originally the pre-trained Indonesian RoBERTa Base model, which is then fine-tuned on indonlu’s SmSA dataset consisting of Indonesian comments and reviews.	Wilson Wongso	HuggingFace

Datasets

Name	Description	Author	Link
HoASA	An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. Similar to the CASA dataset, each review is labeled with a single sentiment label for each aspect. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative. The positivenegative label is given to a review that contains multiple sentiments of the same aspect but for different objects (e.g., cleanliness of bed and toilet).	A. N. Azhar, M. L. Khodra, and A. P. Sutiono	HuggingFace
Indonesian Clickbait Headlines	The CLICK-ID dataset is a collection of Indonesian news headlines that was collected from 12 local online news publishers; detikNews, Fimela, Kapanlagi, Kompas, Liputan6, Okezone, Posmetro-Medan, Republika, Sindonews, Tempo, Tribunnews, and Wowkeren.	Andika William and Yunita Sari	HuggingFace
CASA	An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. The dataset covers six aspects of car quality. We define the task to be a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.	Arfinda Ilmania, Abdurrahman, Samuel Cahyawijaya, Ayu Purwarianti	HuggingFace
SmSA	This sentence-level sentiment analysis dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments on the SmSA dataset: positive, negative, and neutral.	Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti	HuggingFace
WReTE	The Wiki Revision Edits Textual Entailment dataset consists of 450 sentence pairs constructed from Wikipedia revision history. The dataset contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise.	Ken Nabila Setya and Rahmad Mahendra	HuggingFace
EmoT	An emotion classification dataset collected from the social media platform Twitter. The dataset consists of around 4000 Indonesian colloquial language tweets, covering five different emotion labels: anger, fear, happy, love, and sadness.	Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani	HuggingFace
SentiWS	This dataset add sentiment lexicons for 81 languages generated via graph propagation based on a knowledge graph–a graphical representation of real-world entities and the links between them.	Chen, Yanqing and Skiena, Steven	HuggingFace
WiLI-2018	WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.	Thoma, Martin	HuggingFace