Javanese Language Model

Javanese Language Models and its Dataset

Language Model is a model that computes probability of a sentence (sequence of words) or the probability of a next word in a sequence.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Masked Language Models

Name	Description	Author	Link
Javanese BERT Small	Javanese BERT Small is a masked language model based on the BERT model. It was trained on the latest (late December 2020) Javanese Wikipedia articles.	Wilson Wongso	HuggingFace
Javanese DistilBERT Small	Javanese DistilBERT Small is a masked language model based on the DistilBERT model. It was trained on the latest (late December 2020) Javanese Wikipedia articles.	Wilson Wongso	HuggingFace
Javanese RoBERTa Small	Javanese RoBERTa Small is a masked language model based on the RoBERTa model. It was trained on the latest (late December 2020) Javanese Wikipedia articles.	Wilson Wongso	HuggingFace
Javanese BERT Small IMDB	Javanese BERT Small IMDB is a masked language model based on the BERT model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace
Javanese DistilBERT Small IMDB	Javanese DistilBERT Small IMDB is a masked language model based on the DistilBERT model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace
Javanese RoBERTa Small IMDB	Javanese RoBERTa Small IMDB is a masked language model based on the RoBERTa model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace

Causal/Generative Language Models

Name	Description	Author	Link
Javanese GPT-2 Small	Javanese GPT-2 Small is a language model based on the GPT-2 model. It was trained on the latest (late December 2020) Javanese Wikipedia articles.	Wilson Wongso	HuggingFace
Javanese GPT-2 Small IMDB	Javanese GPT-2 Small IMDB is a causal language model based on the GPT-2 model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace

Datasets

Name	Description	Author	Link
mC4-sampling	This dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project.	BERTIN Project & Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu	HuggingFace
mC4	A multilingual colossal, cleaned version of Common Crawl’s web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”.	Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu	HuggingFace
OSCAR	OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.	Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoit	HuggingFace
CC100	This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots.	Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzmán, Francisco and Joulin, Armand and Grave, Edouard	HuggingFace
Wikipedia	Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).	Wikimedia Foundation	HuggingFace