Javanese Text Classification

Models and its Dataset for Javanese Text Classification

Text Classification is the processing of labeling or organizing text data into groups. It forms a fundamental part of Natural Language Processing.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name Description Author Link
Javanese BERT Small IMDB Classifier Javanese BERT Small IMDB Classifier is a movie-classification model based on the BERT model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace
Javanese DistilBERT Small IMDB Classifier Javanese DistilBERT Small IMDB Classifier is a movie-classification model based on the DistilBERT model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace
Javanese GPT-2 Small IMDB Classifier Javanese GPT-2 Small IMDB Classifier is a movie-classification model based on the GPT-2 model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace
Javanese RoBERTa Small IMDB Classifier Javanese RoBERTa Small IMDB Classifier is a movie-classification model based on the RoBERTa model. It was trained on Javanese IMDB movie reviews. Wilson Wongso HuggingFace

Datasets

Name Description Author Link
IMDb Javanese Large Movie Review Dataset translated to Javanese. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Wilson Wongso & Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher HuggingFace
WiLI-2018 WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced and a train-test split is provided. Thoma, Martin HuggingFace