Sundanese Language Model
Sundanese Language Models and its Dataset
Language Model is a model that computes probability of a sentence (sequence of words) or the probability of a next word in a sequence.
By Wilson Wongso, Steven Limcorn and AI-Research.id team
June 1, 2021
Datasets
Name | Description | Author | Link |
---|---|---|---|
mC4-sampling | This dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project. | BERTIN Project & Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu | HuggingFace |
mC4 | A multilingual colossal, cleaned version of Common Crawl’s web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”. | Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu | HuggingFace |
OSCAR | OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form. | Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoit | HuggingFace |
CC100 | This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. | Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzmán, Francisco and Joulin, Armand and Grave, Edouard | HuggingFace |
Wikipedia | Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). | Wikimedia Foundation | HuggingFace |