Sundanese Language Model

Sundanese Language Models and its Dataset

Language Model is a model that computes probability of a sentence (sequence of words) or the probability of a next word in a sequence.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Datasets

Name Description Author Link
mC4-sampling This dataset builds upon the AllenAI version of the original mC4 and adds sampling methods to perform perplexity-based filtering on the fly. Please, refer to BERTIN Project. BERTIN Project & Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu HuggingFace
mC4 A multilingual colossal, cleaned version of Common Crawl’s web crawl corpus. Based on Common Crawl dataset: “https://commoncrawl.org”. Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu HuggingFace
OSCAR OSCAR or Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form. Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoit HuggingFace
CC100 This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzmán, Francisco and Joulin, Armand and Grave, Edouard HuggingFace
Wikipedia Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). Wikimedia Foundation HuggingFace