Indonesian Language Model

Indonesian Language Models and its Dataset

Language Model is a model that computes probability of a sentence (sequence of words) or the probability of a next word in a sequence.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Masked Language Models

Name Description Author Link
IndoConvBERT Base Model IndoConvBERT is a ConvBERT model pretrained on Indo4B. Akmal HuggingFace
Indonesian BERT Base 1.5G (uncased) It is BERT-base model pre-trained with indonesian Wikipedia and indonesian newspapers using a masked language modeling (MLM) objective. This model is uncased. Cahya Wirawan HuggingFace
Indonesian BERT Base 522M (uncased) It is BERT-base model pre-trained with indonesian Wikipedia using a masked language modeling (MLM) objective. This model is uncased: it does not make a difference between indonesia and Indonesia. Cahya Wirawan HuggingFace
Indonesian RoBERTa Base 522M (uncased) It is RoBERTa-base model pre-trained with indonesian Wikipedia using a masked language modeling (MLM) objective. This model is uncased: it does not make a difference between indonesia and Indonesia. Cahya Wirawan HuggingFace
Indonesian DistilBERT Base (uncased) This model is a distilled version of the Indonesian BERT base model. This model is uncased. This is one of several other language models that have been pre-trained with indonesian datasets. Cahya Wirawan HuggingFace
IndoELECTRA IndoELECTRA is a pre-trained language model based on ELECTRA architecture for the Indonesian Language. This model is base version which use electra-base config. Christopher Albert Lorentius HuggingFace
Indonesian RoBERTa Base Indonesian RoBERTa Base is a masked language model based on the RoBERTa model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_id subset. Flax Community HuggingFace
Indonesian RoBERTa Large Indonesian RoBERTa Large is a masked language model based on the RoBERTa model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_id subset. Flax Community HuggingFace
IndoBERT Base Model (phase1 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT Base Model (phase2 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT Large Model (phase1 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT Large Model (phase2 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT-Lite Base Model (phase1 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT-Lite Base Model (phase2 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT-Lite Large Model (phase1 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT-Lite Large Model (phase2 - uncased) IndoBERT is a state-of-the-art language model for Indonesian based on the BERT model. The pretrained model is trained using a masked language modeling (MLM) objective and next sentence prediction (NSP) objective. Indo Benchmark HuggingFace
IndoBERT Base (uncased) IndoBERT is the Indonesian version of BERT model. The model was trained using over 220M words, aggregated from three main sources: Indonesian Wikipedia, news articles, and an Indonesian Web Corpus. IndoLEM HuggingFace
IndoBERT (Indonesian BERT Model) IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language. This model is base-uncased version which use bert-base config. Sarah Lintang HuggingFace
Indo RoBERTa Small Indo RoBERTa Small is a masked language model based on the RoBERTa model. It was trained on the latest (late December 2020) Indonesian Wikipedia articles. Wilson Wongso HuggingFace

Causal/Generative Language Models

Name Description Author Link
GPT-2 Indonesian Small Kids Stories GPT-2 Indonesian Small Kids Stories is a causal language model based on the OpenAI GPT-2 model. The model was originally the pre-trained GPT2 Small Indonesian model, which was then fine-tuned on Indonesian kids' stories from Room To Read and Let’s Read. Bookbot HuggingFace
Indonesian GPT2 Small 522M It is GPT2-small model pre-trained with indonesian Wikipedia using a causal language modeling (CLM) objective. This model is uncased: it does not make a difference between indonesia and Indonesia. Cahya Wirawan HuggingFace
GPT2-small-indonesian This is a pretrained model on Indonesian language using a causal language modeling (CLM) objective. The training data used for this model are Indonesian websites of OSCAR, mc4 and Wikipedia. Flax Community HuggingFace
GPT2-medium-indonesian This is a pretrained model on Indonesian language using a causal language modeling (CLM) objective. The training data used for this model are Indonesian websites of OSCAR, mc4 and Wikipedia. Flax Community HuggingFace
Indonesian GPT-2 finetuned on Indonesian academic journals This is the Indonesian gpt2-small model fine-tuned to abstracts of Indonesian academic journals. All training was done on a TPUv2-8 VM sponsored by TPU Research Cloud. Galuh HuggingFace
Indonesian GPT-2-medium finetuned on Indonesian poems This is the Indonesian gpt2-medium model fine-tuned to Indonesian poems. Muhammad Agung Hambali HuggingFace
Indonesian GPT-2 finetuned on Indonesian poems This is the Indonesian gpt2-small model fine-tuned to Indonesian poems. Muhammad Agung Hambali HuggingFace
Indo GPT-2 Small Indo GPT-2 Small is a language model based on the GPT-2 model. It was trained on the latest (late December 2020) Indonesian Wikipedia articles. Wilson Wongso HuggingFace

Datasets

Name Description Author Link
mC4-ID Indonesian subset of multilingual C4 dataset, filtered using script provided by Clean Italian mC4. Akmal, Samsul Rahmadani HuggingFace
mC4-sampling This dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project. BERTIN Project & Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu HuggingFace
OPUS-100 OPUS-100 is English-centric, meaning that all training pairs include English on either the source or target side. The corpus covers 100 languages (including English). Selected the languages based on the volume of parallel data available in OPUS. Biao Zhang and Philip Williams and Ivan Titov and Rico Sennrich HuggingFace
mC4 A multilingual colossal, cleaned version of Common Crawl’s web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”. Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu HuggingFace
Indonesian Newspapers 2018 The dataset contains around 500K articles (136M of words) from 7 Indonesian newspapers: Detik, Kompas, Tempo, CNN Indonesia, Sindo, Republika and Poskota. The articles are dated between 1st January 2018 and 20th August 2018 (with few exceptions dated earlier). Feryandi Nurdiantoro HuggingFace
Indonesia Puisi Puisi (poem) is an Indonesian poetic form. The dataset contains 7223 Indonesian puisi with its title and author. Ilham Firdausi Putra HuggingFace
OSCAR OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form. Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoit HuggingFace
CC100 This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzmán, Francisco and Joulin, Armand and Grave, Edouard HuggingFace
Wikipedia Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). Wikimedia Foundation HuggingFace