Naver unveils its hyperscale AI platform HyperCLOVA X on Aug. 24, 2023
A picture is worth a thousand words, as the old adage goes, to stress the power of vision over text.
People also say eyes are windows to the soul, emphasizing the importance of humans’ ability to take in visual information.
Naver Corp., a leading South Korean tech giant, said on Thursday it has trained the brains of its latest artificial intelligence platform, HyperCLOVA X, to understand images on top of text.
On Aug. 27, Naver plans to unveil HyperCLOVA X Vision (HCX Vision), another upgraded version of HyperCLOVA X, after training it with large amounts of text and image data to process visual information, including documents.
“We are adding image capabilities to HyperCLOVA X without compromising on its text capabilities,” the company said in a statement.
Naver’s CLOVA X Vision
Naver said HCX Vision has migrated from a large language model (LLM) to a large vision-language model (LVLM).
Trained on wide-ranging visual and language data, HCX Vision supports text and image modalities and performs tasks in various scenarios, such as recognizing documents and understanding text within images, it said.
SCORED HIGHER THAN GPT-4o
Naver said it used over 30 benchmarks to track the performance of HCX Vision relative to Open AI’s commercial AI models GPT-4v and GPT-4o.
One benchmark Naver used to showcase its model’s Korean capabilities was the Korean General Educational Development (K-GED) tests, which are primary and secondary education equivalency diplomas.
Naver’s CLOVA X Vision
The benchmark consisted of 1,480 four-option multiple-choice questions. When testing with image inputs, HCX Vision correctly answered 83.8% of the questions, surpassing the K-GED test’s 60% pass threshold and the 77.8% scored by GPT-4o, according to Naver.
Under the image captioning category, it said HCX Vision can accurately identify and describe small details in an image without using a separate object detection model.
HCX Vision can name historical figures, landmarks, products and food with just image inputs as well as reason and predict the next step based on images.
Naver’s CLOVA X Vision
UNDERSTANDING CHARTS, TABLES AND GRAPHS
Naver said the AI model also understands charts, tables and data in an Excel file.
“If the data is a screenshot of an image, getting responses for your prompts is more complicated because the model must first recognize text and understand how the numbers are related,” it said.
HCX Vision supports documents in Korean, English, Japanese, and Chinese, it said.
Naver said HCX Vision has been trained on large amounts of image and text pairs and can understand even humors and memes well.
Naver’s CLOVA X Vision
Other capabilities include understanding equations; code generation using shapes, charts or graphs; solving math problems that include shapes; and creative writing such as poems.
“Right now, HyperCLOVA X Vision can understand one image at a time. But soon, with context length support in the millions, we expect HCX Vision to understand hours-long movies and video streams,” Naver said.
SPEECH X
On Thursday, Naver also unveiled Speech X, a voice synthesis technology based on its HyperCLOVA X.
Naver’s CLOVA X Vision
Naver said Speech X is a model more advanced than existing voice recognition and synthesis technology, boasting improved language structure and pronunciation accuracy. It can also express emotions like a person, Naver said.
The company has already proven its technological competitiveness with various voice AI services such as AI voice recording Clova Note, AI phone service Clova Care Call and AI voice synthesis Clova Dubbing.
“HCX, which started as a large-scale language model, is evolving into a massive visual language model with added image understanding capabilities, and further into a voice multimodal language model,” said Sung Nako, head of Hyperscale AI Technology at Naver Cloud Corp., the AI affiliate of Naver Corp.
“We will expand our HCX ecosystem by applying HCX’s advanced capabilities to various Naver services, including ClOVA X.”
By Seung-Woo Lee
leeswoo@hankyung.com
In-Soo Nam edited this article.