Indonesian Automatic Speech Recognition

Speech Recognition for Indonesian, Javanese and Sundanese.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name	Description	Author	Link
Wav2Vec2-Large-XLSR-Indonesian	Fine-tuned facebook/wav2vec2-large-xlsr-53 on the Indonesian Artificial Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.	Cahya Wirawan	HuggingFace
Wav2Vec2-Large-XLSR-Indonesian	Fine-tuned facebook/wav2vec2-large-xlsr-53 on the Indonesian Common Voice dataset and synthetic voices generated using Artificial Common Voicer, which again based on Google Text To Speech. When using this model, make sure that your speech input is sampled at 16kHz.	Cahya Wirawan	HuggingFace
Wav2Vec2-Large-XLSR-Indonesian	Fine-tuned facebook/wav2vec2-large-xlsr-53 on the Indonesian Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.	Cahya Wirawan	HuggingFace
Wav2Vec2-Large-XLSR-Indonesian	This is the model for Wav2Vec2-Large-XLSR-Indonesian, a fine-tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.	Galuh	HuggingFace
Wav2Vec2-Large-XLSR-Indonesian	This is the model for Wav2Vec2-Large-XLSR-Indonesian, a fine-tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset. When using this model, make sure that your speech input is sampled at 16kHz.	Indonesian NLP	HuggingFace
Wav2Vec2-Large-XLSR-Indonesian	This is the baseline for Wav2Vec2-Large-XLSR-Indonesian, a fine-tuned facebook/wav2vec2-large-xlsr-53 model on the Indonesian Common Voice dataset. It was trained using the default hyperparamer and for 2x30 epochs. When using this model, make sure that your speech input is sampled at 16kHz.	Indonesian NLP	HuggingFace
Wav2Vec2-Large-XLSR-53-Indonesia	Fine-tuned facebook/wav2vec2-large-xlsr-53 in Indonesia using the Common Voice When using this model, make sure that your speech input is sampled at 16kHz.	Muhammad Agung Hambali	HuggingFace
Wav2Vec2-Large-XLSR-53-Indonesia	Fine-tuned facebook/wav2vec2-large-xlsr-53 in Indonesia using the Common Voice When using this model, make sure that your speech input is sampled at 16kHz.	Muhammad Agung Hambali	HuggingFace
XLSR-Indonesia	Wav2Vec2 fine-tuned on Common Voice ID Test.	Samsul Rahmadani	HuggingFace

Datasets

Name	Description	Author	Link
Common Voice	The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 9,283 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines.	Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.	HuggingFace
VolLingua107	VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of speech segments extracted from YouTube videos & post-processed. The Indonesian dataset has 40 hours (3.8G)	Jörgen Valk, Tanel Alumäe	bark.phon.ioc.ee