First Massive Artificial Intelligence System in the Spanish Language, MarIA, Begins to Summarize and Generate Texts

Nov. 11, 2021 — The MarIA project, a language model system created at the Barcelona Supercomputing Center (BSC), from the web files of the National Library of Spain (BNE) and framed and financed with the Plan of Language Technologies of the Secretary of State for Digitalization and Artificial Intelligence (SEDIA), has advanced in its development and its new version allows summarizing existing texts and creating new texts from headlines or words.

The MarIA project is the first massive artificial intelligence system and expert in understanding and writing in the Spanish language. Due to its volume and capabilities, it has placed the Spanish language in third place among the languages that have massive open access models, after English and Mandarin. It has been built from the digital documentary heritage of the National Library of Spain, which tracks and archives the websites made in Spanish and has been trained with the MareNostrum 4 supercomputer. It is published openly so that application developers, companies, groups of research and society in general can use it in countless uses.

The latest advances of MarIA constitute a milestone in the achievement of the objectives of the National Strategy for Artificial Intelligence and the Recovery, Transformation and Resilience Plan, with which Spain intends to lead the development of tools, technologies and applications for projection and use of the Spanish language in the fields of application of AI. Specifically, the National Plan for Language Technologies in which this project is framed, aims to promote the development of natural language processing, automatic translation and conversational systems in Spanish and co-official languages.

MarIA has been created in the Barcelona Supercomputing Center, trained with more than 135 billion words from the web archive of the National Library of Spain.

Models to Understand the Language and Models to Generate Texts

A language model is an artificial intelligence system formed by a set of deep neural networks that have been trained to acquire an understanding of the language, its lexicon and its mechanisms to express meaning and write at an expert level. These complex statistical models that link words in texts in a systematic and massive way are capable of “understanding” not only abstract concepts, but also their context. With these models, developers of different applications can create tools for multiple uses, such as classifying documents or creating proof-readers or translation tools.

The first version of MarIA was made with RoBERTa, a technology that creates “encoder” -type language models. This type of model, given a text sequence, generates an interpretation that can be used to, for example, classify documents, answer multiple choice questions, find semantic similarities in different texts, or detect the feelings that are expressed in them.

The new version has been created with GPT-2, a more advanced technology that creates generative decoder models and adds features to the system. The decoder models, given a text sequence, can generate new texts. With this, they can be used, for example, to make automatic summaries, simplify complicated wording tailored to different user profiles, generate questions and answers, have complex dialogues with users and even write full texts (which could appear to be written by humans) from a headline or a small number of words.

These new capabilities make MarIA a tool that, with “ad hoc” training adapted to specific tasks, can be very useful for application developers, companies and public administrations. For example, the models that until now have been developed in English are used to generate text suggestions in writing applications, to summarize contracts or the complicated documents that detail the benefits of a product, depending on what each user wants to know, and to search for specific information within large text databases and relate it to other relevant information.

“With projects such as MarIA, which will be incorporated into the ‘PERTE for the development of a digital economy in Spanish,’ we are taking firm steps towards an artificial intelligence that thinks in Spanish, which will multiply economic opportunities for companies and the Spanish technology industry. Because language is much more than a means of communication. It is a projection of the way we see the world, also in the new digital reality”, says the Secretary of State for Digitalization and Artificial Intelligence, Carme Artigas.

“As an institution responsible for electronic legal deposit, the National Library of Spain (BNE) keeps millions of websites, millions of words that are repeated in a given context and that are the product of many collections of the Spanish website, both domain.es as selective, carried out for years by the BNE teams, which makes up the great corpus of Spanish spoken in our country today – explains Ana Santos, director of the BNE—. For us, it is a great satisfaction that these files are useful for this pioneering project, based on artificial intelligence technologies, which will allow machines to understand and write in the Spanish language, which represents a milestone in the field of computer processing. natural language”.

“We appreciate SEDIA’s initiative to promote future issues, such as the empowerment of the Spanish language in the digital world and the AI environment – says the director of the BSC-CNS, Mateo Valero—. We are delighted to put our experts in natural language and artificial intelligence and the calculation capacity of our infrastructures at the service of relevant challenges for society, such as the one provided by this initiative”.

The director of the BNE’s Digital Processes and Services Division, Mar Pérez Morillo, highlighted that “in the collections we focus on events that have influenced or marked society and its language.” Likewise, the BNE cooperates actively with the regional collection centers that use the tools that the BNE makes available to them. “We have a race against time, developing strategies and tools that fight against what they call the digital dark age,” explained Morillo.

Trained with Over 135 Billion Words and 9.7 Trillion Operations

In language models, the number of parameters with which the system is trained is the element that gives them the greatest capacity for generalization and, therefore, intelligence. The National Library data with which MarIA has been trained consists of more than 135 billion words (135,733,450,668, specifically), occupying a total of 570 Gigabytes.

To create and train MarIA, BSC’s MareNostrum supercomputer was used and a computing power of 9.7 trillion operations (969.exaflops) was required. A flop (floating point operation) is the unit of measure that expresses the computing power of a supercomputer per second and exa is the prefix that expresses 10¹⁸, that is, one trillion.

Of these 969 exaflops, 201 were necessary to process the data from the National Library, eliminate everything that was not well-formed text (page numbers, graphics, sentences that do not end, erroneous encodings, duplicate sentences, other languages, etc.) and save only the correct texts in the Spanish language, as it is actually used. The remaining 768 exaflops were used to train the neural networks of the GPT-2 model.

The current version of MarIA will now lead to specialized versions in different application areas, including biomedicine and legal, and will evolve to solve the specific problems mentioned above.

In parallel, PlanTL will continue to expand MarIA to: adapt to new technological developments in natural language processing (more complex models than the GP-T2 now implemented) trained with greater amounts of data, create workspaces to facilitate the use of MarIA by companies and research groups in the appropriate computing environments and embed them in systems of evaluation and certification of the quality of the systems developed in different domains.

Source: Barcelona Supercomputing Center