Javanese Language Model
Javanese Language Models and its Dataset
Language Model is a model that computes probability of a sentence (sequence of words) or the probability of a next word in a sequence.
By Wilson Wongso, Steven Limcorn and AI-Research.id team
June 1, 2021
Masked Language Models
Name | Description | Author | Link |
---|---|---|---|
Javanese BERT Small | Javanese BERT Small is a masked language model based on the BERT model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. | Wilson Wongso | HuggingFace |
Javanese DistilBERT Small | Javanese DistilBERT Small is a masked language model based on the DistilBERT model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. | Wilson Wongso | HuggingFace |
Javanese RoBERTa Small | Javanese RoBERTa Small is a masked language model based on the RoBERTa model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. | Wilson Wongso | HuggingFace |
Javanese BERT Small IMDB | Javanese BERT Small IMDB is a masked language model based on the BERT model. It was trained on Javanese IMDB movie reviews. | Wilson Wongso | HuggingFace |
Javanese DistilBERT Small IMDB | Javanese DistilBERT Small IMDB is a masked language model based on the DistilBERT model. It was trained on Javanese IMDB movie reviews. | Wilson Wongso | HuggingFace |
Javanese RoBERTa Small IMDB | Javanese RoBERTa Small IMDB is a masked language model based on the RoBERTa model. It was trained on Javanese IMDB movie reviews. | Wilson Wongso | HuggingFace |
Causal/Generative Language Models
Name | Description | Author | Link |
---|---|---|---|
Javanese GPT-2 Small | Javanese GPT-2 Small is a language model based on the GPT-2 model. It was trained on the latest (late December 2020) Javanese Wikipedia articles. | Wilson Wongso | HuggingFace |
Javanese GPT-2 Small IMDB | Javanese GPT-2 Small IMDB is a causal language model based on the GPT-2 model. It was trained on Javanese IMDB movie reviews. | Wilson Wongso | HuggingFace |
Datasets
Name | Description | Author | Link |
---|---|---|---|
mC4-sampling | This dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project. | BERTIN Project & Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu | HuggingFace |
mC4 | A multilingual colossal, cleaned version of Common Crawl’s web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”. | Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu | HuggingFace |
OSCAR | OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form. | Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoit | HuggingFace |
CC100 | This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. | Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzmán, Francisco and Joulin, Armand and Grave, Edouard | HuggingFace |
Wikipedia | Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). | Wikimedia Foundation | HuggingFace |