Javanese Language Model

Javanese Language Models and its Dataset

Language Model is a model that computes probability of a sentence (sequence of words) or the probability of a next word in a sequence.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Masked Language Models

Name Description Author Link
Javanese BERT Small Javanese BERT Small is a masked language model based on the BERT model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. Wilson Wongso HuggingFace
Javanese DistilBERT Small Javanese DistilBERT Small is a masked language model based on the DistilBERT model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. Wilson Wongso HuggingFace
Javanese RoBERTa Small Javanese RoBERTa Small is a masked language model based on the RoBERTa model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. Wilson Wongso HuggingFace
Javanese BERT Small IMDB Javanese BERT Small IMDB is a masked language model based on the BERT model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace
Javanese DistilBERT Small IMDB Javanese DistilBERT Small IMDB is a masked language model based on the DistilBERT model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace
Javanese RoBERTa Small IMDB Javanese RoBERTa Small IMDB is a masked language model based on the RoBERTa model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace

Causal/Generative Language Models

Name Description Author Link
Javanese GPT-2 Small Javanese GPT-2 Small is a language model based on the GPT-2 model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. Wilson Wongso HuggingFace
Javanese GPT-2 Small IMDB Javanese GPT-2 Small IMDB is a causal language model based on the GPT-2 model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace

Datasets

Name Description Author Link
mC4-sampling This dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project. BERTIN Project & Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu HuggingFace
mC4 A multilingual colossal, cleaned version of Common Crawl’s web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”. Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu HuggingFace
OSCAR OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form. Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoit HuggingFace
CC100 This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzmán, Francisco and Joulin, Armand and Grave, Edouard HuggingFace
Wikipedia Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). Wikimedia Foundation HuggingFace