Vietnam develops open-source Vietnamese-language dataset for AI training

The ViGen project, launched by Vietnam’s National Innovation Center, aims to improve AI’s understanding of Vietnamese with high-quality open-source datasets.

Vietnam currently has a very limited amount of Vietnamese-language data available for training artificial intelligence (AI) models. The ViGen project aims to create high-quality datasets to improve AI's ability to understand and use Vietnamese effectively.

Launching a Vietnamese-language dataset for AI development

ViGen 1.jpg — The Innovation Challenge 2025 program is launched to drive AI development in Vietnam. Photo: NIC

On March 14, the National Innovation Center (NIC) launched the "Innovation Challenge 2025" program to drive AI development in Vietnam.

The core of this initiative is the ViGen project, which focuses on building an open-source, high-quality Vietnamese-language dataset to train and evaluate large language models (LLMs).

By developing these datasets, ViGen aims to help AI models better understand Vietnamese culture, context, and linguistic nuances. The project is expected to strengthen the presence of Vietnamese in AI development while contributing to the digital economy.

ViGen: A collaboration to enhance AI in Vietnamese

ViGen is a collaborative effort between Meta, NIC, and the organization "AI for Vietnam." NIC serves as the lead agency, ensuring that the project aligns with Vietnam’s national AI development goals.

The project's mission is to make AI models support Vietnamese naturally and comprehensively at their core, unlocking AI's potential for Vietnam.

ViGen will develop large-scale, high-quality open-source Vietnamese datasets to train and assess AI models. Additionally, it will help ensure that AI development in Vietnam aligns with cultural values and ethical standards, fostering a responsible and locally adapted AI ecosystem.

To support the project, Meta will contribute its open-source datasets, including mobility and social connectivity data, as well as AI-assisted population mapping data.

The urgency of developing AI-ready Vietnamese datasets

ViGen 3.jpg — Tran Viet Hung, founder and CEO of AI for Vietnam. Photo: NIC

According to Vo Xuan Hoai, Deputy Director of the National Innovation Center, AI is transforming the world. As a result, developing large-scale, high-quality open-source Vietnamese datasets for AI training and evaluation has become an urgent priority.

"The ViGen project aligns with Resolution 57 of the Politburo, which calls for breakthroughs in science, technology, innovation, and national digital transformation. With joint efforts from policymakers, researchers, developers, experts, and users, we will make AI a powerful tool for every Vietnamese citizen and position Vietnam as a global AI powerhouse," Hoai stated.

Overcoming the lack of Vietnamese data in AI training

Although Vietnamese is spoken by over 100 million people, its representation in AI training datasets remains extremely low - less than 1%. As a result, current AI models generate information in Vietnamese, but the output often lacks natural expression and does not fully convey the richness of the language, reducing its usability and effectiveness.

Tran Viet Hung, founder and CEO of AI for Vietnam, highlighted that ViGen will contribute large, high-quality Vietnamese-language datasets to address this issue.

"ViGen will help ensure that Vietnamese is no longer an underrepresented language in AI," Hung emphasized.

Additionally, the project demonstrates the power and value of open-source models like LLaMA, which enable innovative AI solutions tailored to Vietnamese-language contexts.

In Vietnam, several AI-powered virtual assistants have already been developed using LLaMA-based large language models. For instance, MISA has introduced an AI assistant for automating information retrieval, while Viettel has developed a legal virtual assistant.

These early applications demonstrate AI's growing role in Vietnamese daily life, particularly in the public sector.

Trong Dat

Vietnam develops open-source Vietnamese-language dataset for AI training

The ViGen project, launched by Vietnam’s National Innovation Center, aims to improve AI’s understanding of Vietnamese with high-quality open-source datasets.

AI news

sci-tech news

Hot news