Latxa Language Model for Basque

##plugins.themes.bootstrap3.article.main##

##plugins.themes.bootstrap3.article.sidebar##

Published 24-09-2024
Naiara Perez Julen Etxaniz Oscar Sainz Itziar Aldabe German Rigau Eneko Agirre Ahmed Salem Aitor Ormazabal Mikel Artetxe Aitor Soroa

Abstract

We introduce the Latxa family of Large Language Models (LLMs), currently the largest developed for Basque. Latxa models range from 7 to 70 billion parameters and are built on LLama 2 models, which we continued pretraining on 4.3 million documents and 4.2 billion tokens of Basque. To address the scarcity of high-quality evaluation benchmarks for Basque, we collected four new datasets: EusProficiency, comprising 5,169 Atarikoa test questions of EGA exams; EusReading, comprisinsg 352 reading comprehension questions; EusTrivia, with 1,715 general knowledge questions across 5 areas; and EusExams, comprising 16,774 questions from public office exams. We conducted evaluations of Latxa and other LLMs (both monolingual and multilingual), with results showing Latxa's superiority over previous open models. Latxa also obtains competitive results with the commercial GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa model family, and our pretraining and evaluation data are publicly available under open licenses.

Abstract 45 | PDF (Euskara) Downloads 13

##plugins.themes.bootstrap3.article.details##

Section
Ale Berezia