Indonesian Machine Translation

Models and its Dataset for Indonesian Machine Translation

Machine Translation is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language.

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name Description Author Link
OPUS-MT-ID-EN Machine translation from Indonesian to English. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-EN-ID Machine translation from English to Indonesian. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-ES-ID Machine translation from Spanish to Indonesian. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-ID-ES Machine translation from Indonesian to Spanish. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-FR-ID Machine translation from French to Indonesian. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-ID-FI Machine translation from Indonesian to Finnish. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-FI-ID Machine translation from Finnish to Indonesian. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-ID-SV Machine translation from Indonesian to Swedish. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-ID-FR Machine translation from Indonesian to French. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-SV-ID Machine translation from Swedish to Indonesian. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-MUL-EN Machine translation from multiple languages to English. Language Technology Research Group at the University of Helsinki HuggingFace
OPUS-MT-EN-MUL Machine translation from English to multiple languages. Language Technology Research Group at the University of Helsinki HuggingFace
mT5-Translate-EN-ID mT5 machine translation from English to Indonesian. Samsul Rahmadani HuggingFace

Datasets

Name Description Author Link
Parallel Text Corpora for Multi-Domain Translation System Parallel Text Corpora for Multi-Domain Translation System created by BPPT (Indonesian Agency for the Assessment and Application of Technology) for PAN Localization Project (A Regional Initiative to Develop Local Language Computing Capacity in Asia). The dataset contains around 24K sentences divided in 4 difference topics (Economic, international, Science and Technology and Sport). Budiono, Hammam Riza, Chairil Hakim HuggingFace
Bible Para This is a multilingual parallel corpus created from translations of the Bible. Christos Christodoulopoulos and Mark Steedman HuggingFace
KDE4 A parallel corpus of KDE4 localization files. J. Tiedemann HuggingFace
Gnome A parallel corpus of GNOME localization files. J. Tiedemann HuggingFace
Ubuntu A parallel corpus of Ubuntu localization files. J. Tiedemann HuggingFace
Tanzil This is a collection of Quran translations compiled by the Tanzil project. J. Tiedemann HuggingFace
Tatoeba This is a collection of translated sentences from Tatoeba. J. Tiedemann HuggingFace
Microsoft Terminology Collection The Microsoft Terminology Collection can be used to develop localized versions of applications that integrate with Microsoft products. It can also be used to integrate Microsoft terminology into other terminology collections or serve as a base IT glossary for language development in the nearly 100 languages available. Terminology is provided in .tbx format, an industry standard for terminology exchange. Microsoft & Leo Zhao and Quentin Lhoest HuggingFace
Open Subtitles This is a new collection of translated movie subtitles from here. P. Lison and J. Tiedemann HuggingFace
QED The QCRI Educational Domain Corpus (formerly QCRI AMARA Corpus) is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. Qatar Computing Research Institute, Arabic Language Technologies Group HuggingFace
Asian Language Treebank (ALT) The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. It was first conducted by NICT and UCSY as described in Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch and Eiichiro Sumita (2016). Then, it was developed under ASEAN IVO as described in this Web page. Riza, Hammam and Purwoadi, Michael and Uliniansyah, Teduh and Ti, Aw Ai and Aljunied, Sharifah Mahani and Mai, Luong Chi and Thang, Vu Tat and Thai, Nguyen Phuong and Chea, Vichet and Sam, Sethserey and others HuggingFace
The Universal Declaration of Human Rights (UDHR) The Universal Declaration of Human Rights (UDHR) is a milestone document in the history of human rights. Drafted by representatives with different legal and cultural backgrounds from all regions of the world, it set out, for the first time, fundamental human rights to be universally protected. The Declaration was adopted by the UN General Assembly in Paris on 10 December 1948 during its 183rd plenary meeting. UDHR & Joe Davison HuggingFace
Web Inventory of Transcribed & Translated (WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Cettolo, Mauro and Girardi, Christian and Federico, Marcello HuggingFace