Indonesian Text Classification
Models and its Dataset for Indonesian Text Classification
Text Classification is the processing of labeling or organizing text data into groups. It forms a fundamental part of Natural Language Processing.
By Wilson Wongso, Steven Limcorn and AI-Research.id team
June 1, 2021
Models
Name | Description | Author | Link |
---|---|---|---|
Indo RoBERTa Emotion Classifier | Indo RoBERTa Emotion Classifier is emotion classifier based on Indo-roberta model. It was trained on the trained on IndoNLU EmoT dataset. The model used was Indo-roberta and was transfer-learned to an emotion classifier model. Based from the IndoNLU bencmark, the model achieve an f1-macro of 72.05%, accuracy of 71.81%, precision of 72.47% and recall of 71.94%. | Steven Limcorn | HuggingFace |
Indonesian RoBERTa Base Sentiment Classifier | Indonesian RoBERTa Base Sentiment Classifier is a sentiment-text-classification model based on the RoBERTa model. The model was originally the pre-trained Indonesian RoBERTa Base model, which is then fine-tuned on indonlu’s SmSA dataset consisting of Indonesian comments and reviews. | Wilson Wongso | HuggingFace |
Datasets
Name | Description | Author | Link |
---|---|---|---|
HoASA | An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. Similar to the CASA dataset, each review is labeled with a single sentiment label for each aspect. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative. The positivenegative label is given to a review that contains multiple sentiments of the same aspect but for different objects (e.g., cleanliness of bed and toilet). | A. N. Azhar, M. L. Khodra, and A. P. Sutiono | HuggingFace |
Indonesian Clickbait Headlines | The CLICK-ID dataset is a collection of Indonesian news headlines that was collected from 12 local online news publishers; detikNews, Fimela, Kapanlagi, Kompas, Liputan6, Okezone, Posmetro-Medan, Republika, Sindonews, Tempo, Tribunnews, and Wowkeren. | Andika William and Yunita Sari | HuggingFace |
CASA | An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. The dataset covers six aspects of car quality. We define the task to be a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral. | Arfinda Ilmania, Abdurrahman, Samuel Cahyawijaya, Ayu Purwarianti | HuggingFace |
SmSA | This sentence-level sentiment analysis dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments on the SmSA dataset: positive, negative, and neutral. | Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti | HuggingFace |
WReTE | The Wiki Revision Edits Textual Entailment dataset consists of 450 sentence pairs constructed from Wikipedia revision history. The dataset contains pairs of sentences and binary semantic relations between the pairs. The data are labeled as entailed when the meaning of the second sentence can be derived from the first one, and not entailed otherwise. | Ken Nabila Setya and Rahmad Mahendra | HuggingFace |
EmoT | An emotion classification dataset collected from the social media platform Twitter. The dataset consists of around 4000 Indonesian colloquial language tweets, covering five different emotion labels: anger, fear, happy, love, and sadness. | Mei Silviana Saputri, Rahmad Mahendra, and Mirna Adriani | HuggingFace |
SentiWS | This dataset add sentiment lexicons for 81 languages generated via graph propagation based on a knowledge graph–a graphical representation of real-world entities and the links between them. | Chen, Yanqing and Skiena, Steven | HuggingFace |
WiLI-2018 | WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided. | Thoma, Martin | HuggingFace |