Indonesian Automatic Speech Recognition

Speech Recognition for Indonesian, Javanese and Sundanese.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name Description Author Link
Wav2Vec2-Large-XLSR-Indonesian Fine-tuned facebook/wav2vec2-large-xlsr-53 on the Indonesian Artificial Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz. Cahya Wirawan HuggingFace
Wav2Vec2-Large-XLSR-Indonesian Fine-tuned facebook/wav2vec2-large-xlsr-53 on the Indonesian Common Voice dataset and synthetic voices generated using Artificial Common Voicer, which again based on Google Text To Speech. When using this model, make sure that your speech input is sampled at 16kHz. Cahya Wirawan HuggingFace
Wav2Vec2-Large-XLSR-Indonesian Fine-tuned facebook/wav2vec2-large-xlsr-53 on the Indonesian Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz. Cahya Wirawan HuggingFace
Wav2Vec2-Large-XLSR-Indonesian This is the model for Wav2Vec2-Large-XLSR-Indonesian, a fine-tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz. Galuh HuggingFace
Wav2Vec2-Large-XLSR-Indonesian This is the model for Wav2Vec2-Large-XLSR-Indonesian, a fine-tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz. Indonesian NLP HuggingFace
Wav2Vec2-Large-XLSR-Indonesian This is the baseline for Wav2Vec2-Large-XLSR-Indonesian, a fine-tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset. It was trained using the default hyperparamer and for 2x30 epochs. When using this model, make sure that your speech input is sampled at 16kHz. Indonesian NLP HuggingFace
Wav2Vec2-Large-XLSR-53-Indonesia Fine-tuned facebook/wav2vec2-large-xlsr-53 in Indonesia using the Common Voice When using this model, make sure that your speech input is sampled at 16kHz. Muhammad Agung Hambali HuggingFace
Wav2Vec2-Large-XLSR-53-Indonesia Fine-tuned facebook/wav2vec2-large-xlsr-53 in Indonesia using the Common Voice When using this model, make sure that your speech input is sampled at 16kHz. Muhammad Agung Hambali HuggingFace
XLSR-Indonesia Wav2Vec2 fine-tuned on Common Voice ID Test. Samsul Rahmadani HuggingFace

Datasets

Name Description Author Link
Common Voice The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 9,283 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines. Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G. HuggingFace
VolLingua107 VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of speech segments extracted from YouTube videos & post-processed. The Indonesian dataset has 40 hours (3.8G) Jörgen Valk, Tanel Alumäe bark.phon.ioc.ee