Javanese Text Classification

Models and its Dataset for Javanese Text Classification

Text Classification is the processing of labeling or organizing text data into groups. It forms a fundamental part of Natural Language Processing.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name	Description	Author	Link
Javanese BERT Small IMDB Classifier	Javanese BERT Small IMDB Classifier is a movie-classification model based on the BERT model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace
Javanese DistilBERT Small IMDB Classifier	Javanese DistilBERT Small IMDB Classifier is a movie-classification model based on the DistilBERT model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace
Javanese GPT-2 Small IMDB Classifier	Javanese GPT-2 Small IMDB Classifier is a movie-classification model based on the GPT-2 model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace
Javanese RoBERTa Small IMDB Classifier	Javanese RoBERTa Small IMDB Classifier is a movie-classification model based on the RoBERTa model. It was trained on Javanese IMDB movie reviews.	Wilson Wongso	HuggingFace

Datasets

Name	Description	Author	Link
IMDb Javanese	Large Movie Review Dataset translated to Javanese. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.	Wilson Wongso & Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher	HuggingFace
WiLI-2018	WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided.	Thoma, Martin	HuggingFace