Indonesian Token Classification

Models and its Dataset for Indonesian Token Classification

The Token classification Task is similar to text classification, except each token within the text receives a prediction. A common use of this task is Named Entity Recognition (NER).

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name Description Author Link
Indonesian RoBERTa Base POSP Tagger Indonesian RoBERTa Base POSP Tagger is a part-of-speech token-classification model based on the RoBERTa model. The model was originally the pre-trained Indonesian RoBERTa Base model, which is then fine-tuned on indonlu’s POSP dataset consisting of tag-labelled news. Wilson Wongso HuggingFace

Datasets

Name Description Author Link
BaPOS This POS tagging dataset contains about 1000 sentences, collected from the PAN Localization Project. In this dataset, each word is tagged by one of 23 POS tag classes. Data splitting used in this benchmark follows the experimental setting used by Kurniawan and Aji (2018). Arawinda Dinakaramani, Fam Rashel, Andry Luthfi, and Ruli Manurung & Kemal Kurniawan and Alham Fikri Aji HuggingFace
POSP This Indonesian part-of-speech tagging (POS) dataset is collected from Indonesian news websites. The dataset consists of around 8000 sentences with 26 POS tags. The POS tag labels follow the Indonesian Association of Computational Linguistics (INACL) POS Tagging Convention. Devin Hoesen and Ayu Purwarianti HuggingFace
NERP This NER dataset (Hoesen and Purwarianti, 2018) contains texts collected from several Indonesian news websites. There are five labels available in this dataset, PER (name of person), LOC (name of location), IND (name of product or brand), EVT (name of the event), and FNB (name of food and beverage). Similar to the TermA dataset, the NERP dataset uses the IOB chunking format. Devin Hoesen and Ayu Purwarianti HuggingFace
KEPS This keyphrase extraction dataset consists of text from Twitter discussing banking products and services and is written in the Indonesian language. A phrase containing important information is considered a keyphrase. Text may contain one or more keyphrases since important phrases can be located at different positions. The dataset follows the IOB chunking format, which represents the position of the keyphrase. Miftahul Mahfuzh, Sidik Soleman, and Ayu Purwarianti HuggingFace
NERGrit This NER dataset is taken from the Grit-ID repository, and the labels are spans in IOB chunking representation. The dataset consists of three kinds of named entity tags, PERSON (name of person), PLACE (name of location), and ORGANIZATION (name of organization). NERGrit Developers HuggingFace
WikiANN WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng & Rahimi, Afshin and Li, Yuan and Cohn, Trevor HuggingFace
TermA This span-extraction dataset is collected from the hotel aggregator platform, AiryRooms. The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment. Yosef Ardhito Winatmoko, Ali Akbar Septiandri, Arie Pratama Sutiono & Jordhy Fernando, Masayu Leylia Khodra, Ali Akbar Septiandri HuggingFace