Indonesian Text Summarization

Model and its Dataset for Indonesian Text Summarization

The task of producing a shorter version of one or several documents that preserves most of the input's meaning

By Wilson Wongso, Steven Limcorn and AI-Research.id team

June 1, 2021

Models

Name Description Author Link
Indonesian T5 Summarization Base Model t5-base-indonesian-summarization-cased model is based on t5-base-bahasa-summarization-cased by huseinzol05, finetuned using id_liputan6 dataset. Cahya Wirawan HuggingFace
Indonesian BERT2GPT Summarization Model bert2gpt-indonesian-summarization model is based on cahya/bert-base-indonesian-1.5G and cahya/gpt2-small-indonesian-522Mby cahya, finetuned using id_liputan6 dataset. Cahya Wirawan HuggingFace
Indonesian BERT2BERT Summarization Model bert2bert-indonesian-summarization model is based on cahya/bert-base-indonesian-1.5G by cahya, finetuned using id_liputan6 dataset. Cahya Wirawan HuggingFace
Indonesian T5 Summarization Small Model t5-small-indonesian-summarization-cased model is based on t5-small-bahasa-summarization-cased by huseinzol05, finetuned using indosum dataset. Panggi Libersa Jasri Akadol HuggingFace
Indonesian T5 Summarization Base Model t5-base-indonesian-summarization-cased model is based on t5-base-bahasa-summarization-cased by huseinzol05, finetuned using indosum dataset. Panggi Libersa Jasri Akadol HuggingFace

Datasets

Name Description Author Link
WikiLingua A large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. Authors extracted article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown HuggingFace
Liputan6 A large-scale Indonesian summarization dataset. Authors harvested articles from an online news portal, and obtain 215,827 document-summary pairs. Fajri Koto and Jey Han Lau and Timothy Baldwin HuggingFace
XLSum A comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 45 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. Hasan, Tahmid and Bhattacharjee, Abhik and Islam, Md. Saiful and Mubasshir, Kazi and Li, Yuan-Fang and Kang, Yong-Bin and Rahman, M. Sohel and Shahriyar, Rifat HuggingFace