First Massive Artificial Intelligence System in the Spanish Language, MarIA, Begins to Summarize and Generate Texts

November 11, 2021

Nov. 11, 2021 — The MarIA project, a language model system created at the Barcelona Supercomputing Center (BSC), from the web files of the National Library of Spain (BNE) and framed and financed with the Plan of Language Technologies of the Secretary of State for Digitalization and Artificial Intelligence (SEDIA), has advanced in its development and its new version allows summarizing existing texts and creating new texts from headlines or words.

The MarIA project is the first massive artificial intelligence system and expert in understanding and writing in the Spanish language. Due to its volume and capabilities, it has placed the Spanish language in third place among the languages ​​that have massive open access models, after English and Mandarin. It has been built from the digital documentary heritage of the National Library of Spain, which tracks and archives the websites made in Spanish and has been trained with the MareNostrum 4 supercomputer. It is published openly so that application developers, companies, groups of research and society in general can use it in countless uses.

The latest advances of MarIA constitute a milestone in the achievement of the objectives of the National Strategy for Artificial Intelligence and the Recovery, Transformation and Resilience Plan, with which Spain intends to lead the development of tools, technologies and applications for projection and use of the Spanish language in the fields of application of AI. Specifically, the National Plan for Language Technologies in which this project is framed, aims to promote the development of natural language processing, automatic translation and conversational systems in Spanish and co-official languages.

MarIA has been created in the Barcelona Supercomputing Center, trained with more than 135 billion words from the web archive of the National Library of Spain.

Models to Understand the Language and Models to Generate Texts

A language model is an artificial intelligence system formed by a set of deep neural networks that have been trained to acquire an understanding of the language, its lexicon and its mechanisms to express meaning and write at an expert level. These complex statistical models that link words in texts in a systematic and massive way are capable of “understanding” not only abstract concepts, but also their context. With these models, developers of different applications can create tools for multiple uses, such as classifying documents or creating proof-readers or translation tools.

The first version of MarIA was made with RoBERTa, a technology that creates “encoder” -type language models. This type of model, given a text sequence, generates an interpretation that can be used to, for example, classify documents, answer multiple choice questions, find semantic similarities in different texts, or detect the feelings that are expressed in them.

The new version has been created with GPT-2, a more advanced technology that creates generative decoder models and adds features to the system. The decoder models, given a text sequence, can generate new texts. With this, they can be used, for example, to make automatic summaries, simplify complicated wording tailored to different user profiles, generate questions and answers, have complex dialogues with users and even write full texts (which could appear to be written by humans) from a headline or a small number of words.

These new capabilities make MarIA a tool that, with “ad hoc” training adapted to specific tasks, can be very useful for application developers, companies and public administrations. For example, the models that until now have been developed in English are used to generate text suggestions in writing applications, to summarize contracts or the complicated documents that detail the benefits of a product, depending on what each user wants to know, and to search for specific information within large text databases and relate it to other relevant information.

“With projects such as MarIA, which will be incorporated into the ‘PERTE for the development of a digital economy in Spanish,’ we are taking firm steps towards an artificial intelligence that thinks in Spanish, which will multiply economic opportunities for companies and the Spanish technology industry. Because language is much more than a means of communication. It is a projection of the way we see the world, also in the new digital reality”, says the Secretary of State for Digitalization and Artificial Intelligence, Carme Artigas.

“As an institution responsible for electronic legal deposit, the National Library of Spain (BNE) keeps millions of websites, millions of words that are repeated in a given context and that are the product of many collections of the Spanish website, both domain.es as selective, carried out for years by the BNE teams, which makes up the great corpus of Spanish spoken in our country today – explains Ana Santos, director of the BNE—. For us, it is a great satisfaction that these files are useful for this pioneering project, based on artificial intelligence technologies, which will allow machines to understand and write in the Spanish language, which represents a milestone in the field of computer processing. natural language”.

“We appreciate SEDIA’s initiative to promote future issues, such as the empowerment of the Spanish language in the digital world and the AI ​​environment – says the director of the BSC-CNS, Mateo Valero—. We are delighted to put our experts in natural language and artificial intelligence and the calculation capacity of our infrastructures at the service of relevant challenges for society, such as the one provided by this initiative”.

The director of the BNE’s Digital Processes and Services Division, Mar Pérez Morillo, highlighted that “in the collections we focus on events that have influenced or marked society and its language.” Likewise, the BNE cooperates actively with the regional collection centers that use the tools that the BNE makes available to them. “We have a race against time, developing strategies and tools that fight against what they call the digital dark age,” explained Morillo.

Trained with Over 135 Billion Words and 9.7 Trillion Operations

In language models, the number of parameters with which the system is trained is the element that gives them the greatest capacity for generalization and, therefore, intelligence. The National Library data with which MarIA has been trained consists of more than 135 billion words (135,733,450,668, specifically), occupying a total of 570 Gigabytes.

To create and train MarIA, BSC’s MareNostrum supercomputer was used and a computing power of 9.7 trillion operations (969.exaflops) was required. A flop (floating point operation) is the unit of measure that expresses the computing power of a supercomputer per second and exa is the prefix that expresses 1018, that is, one trillion.

Of these 969 exaflops, 201 were necessary to process the data from the National Library, eliminate everything that was not well-formed text (page numbers, graphics, sentences that do not end, erroneous encodings, duplicate sentences, other languages, etc.) and save only the correct texts in the Spanish language, as it is actually used. The remaining 768 exaflops were used to train the neural networks of the GPT-2 model.

The current version of MarIA will now lead to specialized versions in different application areas, including biomedicine and legal, and will evolve to solve the specific problems mentioned above.

In parallel, PlanTL will continue to expand MarIA to: adapt to new technological developments in natural language processing (more complex models than the GP-T2 now implemented) trained with greater amounts of data, create workspaces to facilitate the use of MarIA by companies and research groups in the appropriate computing environments and embed them in systems of evaluation and certification of the quality of the systems developed in different domains.


Source: Barcelona Supercomputing Center

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that have occurred about once a decade. With this in mind, the ISC Read more…

2024 Winter Classic: Texas Two Step

April 18, 2024

Texas Tech University. Their middle name is ‘tech’, so it’s no surprise that they’ve been fielding not one, but two teams in the last three Winter Classic cluster competitions. Their teams, dubbed Matador and Red Read more…

2024 Winter Classic: The Return of Team Fayetteville

April 18, 2024

Hailing from Fayetteville, NC, Fayetteville State University stayed under the radar in their first Winter Classic competition in 2022. Solid students for sure, but not a lot of HPC experience. All good. They didn’t Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use of Rigetti’s Novera 9-qubit QPU. The approach by a quantum Read more…

2024 Winter Classic: Meet Team Morehouse

April 17, 2024

Morehouse College? The university is well-known for their long list of illustrious graduates, the rigor of their academics, and the quality of the instruction. They were one of the first schools to sign up for the Winter Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pressing needs and hurdles to widespread AI adoption. The sudde Read more…

Kathy Yelick on Post-Exascale Challenges

April 18, 2024

With the exascale era underway, the HPC community is already turning its attention to zettascale computing, the next of the 1,000-fold performance leaps that ha Read more…

Software Specialist Horizon Quantum to Build First-of-a-Kind Hardware Testbed

April 18, 2024

Horizon Quantum Computing, a Singapore-based quantum software start-up, announced today it would build its own testbed of quantum computers, starting with use o Read more…

MLCommons Launches New AI Safety Benchmark Initiative

April 16, 2024

MLCommons, organizer of the popular MLPerf benchmarking exercises (training and inference), is starting a new effort to benchmark AI Safety, one of the most pre Read more…

Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

April 15, 2024

As the AI revolution marches on, it is vital to continually reassess how this technology is reshaping our world. To that end, researchers at Stanford’s Instit Read more…

Intel’s Vision Advantage: Chips Are Available Off-the-Shelf

April 11, 2024

The chip market is facing a crisis: chip development is now concentrated in the hands of the few. A confluence of events this week reminded us how few chips Read more…

The VC View: Quantonation’s Deep Dive into Funding Quantum Start-ups

April 11, 2024

Yesterday Quantonation — which promotes itself as a one-of-a-kind venture capital (VC) company specializing in quantum science and deep physics  — announce Read more…

Nvidia’s GTC Is the New Intel IDF

April 9, 2024

After many years, Nvidia's GPU Technology Conference (GTC) was back in person and has become the conference for those who care about semiconductors and AI. I Read more…

Google Announces Homegrown ARM-based CPUs 

April 9, 2024

Google sprang a surprise at the ongoing Google Next Cloud conference by introducing its own ARM-based CPU called Axion, which will be offered to customers in it Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Leading Solution Providers

Contributors

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire