Recently, new types of models that allow for image input into Large Language Models (LLMs) such as ChatGPT have been appearing one after another. Since these models can recognize images as well as text, they are being utilized in a variety of scenes. However, many people may be wondering how to choose the optimal model for the task they want to execute. In this article, we will explain model evaluation methods and points to note when making a selection in an easy-to-understand manner, using specific examples.
|
【Table of Contents】 |
VLM (Vision Language Model) is a general term for AI models that can receive both images and text as input and process them integrally. While LLMs (Large Language Models) handle only text information, VLMs are characterized by their ability to understand visual information such as images, and they are attracting attention as AI that can reason by combining "words" and "visual information" much like a human.
By leveraging these characteristics, VLMs are capable of handling various tasks that utilize both vision and language. Here are some representative examples:
They also have the ability to process many other image- and language-related tasks.
Although VLMs can correspond to many tasks, they do not always return accurate outputs. For example, they may misread characters in an image or misunderstand the content of an image, leading to incorrect situational judgments. Therefore, it is important to quantitatively evaluate "in what scenarios and to what degree of accuracy they operate."
The evaluation can be conducted from [three perspectives](https://mmmu-benchmark.github.io/): Perception, Knowledge, and Reasoning. Below, we introduce what is evaluated from these three perspectives and what evaluation datasets have been released, using specific examples.
Perception is the perspective that evaluates how accurately elements of an image or video (objects, colors, numbers, spatial relationships, etc.) can be recognized. For example, evaluation is based on whether appropriate answers can be provided for questions like the following:
Example 1
Question: How many traffic poles are there?
Correct Answer Example: There are 2 traffic poles.
Example 2
Question: What color is the bus?
Correct Answer Example: The bus is yellow.
The following datasets have been released mainly for the purpose of perception evaluation:
OCRBench: Evaluates the recognition of printed characters and handwritten characters.
COCO Caption: Evaluates the ability to generate concise captions that reflect the content of an image.
Knowledge is the perspective that evaluates whether the model possesses background knowledge regarding the people or elements shown in an image or video. For example, evaluation is based on whether appropriate answers can be provided for questions like the following:
Example 1
Question: What crop is being grown in this field?
Correct Answer Example: Rice plants are being grown in this field.
Example 2
Question: What maker is the red car?
Correct Answer Example: The red car is a Nissan.
The following datasets have been released mainly for the purpose of knowledge evaluation:
OK-VQA: Evaluates whether the model has general knowledge regarding images in a wide range of genres such as science, culture, and commercial products.
VQA-RAD: Evaluates whether the model has medical knowledge using images and questions specialized for healthcare.
Reasoning is the perspective that evaluates the ability to logically derive conclusions based on elements within an image or video and background knowledge. For example, evaluation is based on whether appropriate answers can be provided for questions like the following:
Example 1
Question: Why are concrete blocks placed here?
Correct Answer Example: To keep the sheets from blowing away.
Example 2
Question: What direction will the silver car in front go next?
Correct Answer Example: The silver car in front will turn left.
The following datasets have been released mainly for the purpose of reasoning evaluation:
PCA-Bench: Evaluates whether the model can reason the next action to take regarding images of autonomous driving, household robots, and open-world games.
MathVista: Evaluates mathematical reasoning ability.
In addition to those specialized in the three perspectives, datasets that can comprehensively evaluate perception, knowledge, and reasoning have also been released.
MMMU: A dataset collecting university-level problems from wide-ranging genres such as engineering, medicine, art, and society; solving them requires image recognition capability, specialized knowledge in each field, and reasoning ability.
SEED-Bench-H: Evaluates the three perspectives in a balanced manner through 34 tasks, including Text Recognition (Perception), Science Knowledge (Knowledge), and Visual Reasoning (Reasoning).
They also have the ability to process many other image- and language-related tasks.
As introduced in the previous chapter, various datasets according to tasks and perspectives have been released. By first referring to benchmarks of datasets close to the task you want to execute, candidates for suitable models will become clear. However, there are several other important factors to consider when choosing the optimal model for a task.
To process tasks with a VLM, there are two options: using a model you deployed yourself, or using a model provided by a third party. It is important to decide which to choose based on task requirements and available resources.
For example, when using LLaVA, an open-source model you deploy yourself, it is suitable for security-focused tasks because you can avoid the risk of data being utilized for learning. However, you must also consider the following disadvantages:
For example, when using GPT-4o provided by OpenAI, you can use cutting-edge super-large-scale models, making it suitable for accuracy-focused tasks. However, attention is required regarding the following points:
Inference speed is an important factor, especially for tasks where real-time execution is required. Generally, VLM processing is slow, so it is also effective to consider using lightweight models specialized for specific tasks other than VLMs.
Model performance depends heavily on the data used for learning. Therefore, it is important that the model was trained on data that appropriately includes the elements necessary for the task.
For example, a model trained on data specialized for Western culture may not be able to provide accurate answers when asked about Japanese culture.
In this way, because models operate based on knowledge provided at the time of learning, expected performance cannot be obtained unless the knowledge necessary for the task to be executed has been learned.
Many models have an upper limit set for the resolution of input images. If this limit is exceeded, the image is reduced in size, which may result in a loss of features and make accurate answers difficult.
However, recently, models like Qwen2.5-VL have appeared that increase or decrease the number of image tokens according to resolution, allowing for detailed capture of fine features in high-resolution images.
Especially for tasks that need to analyze details of high-resolution images, it is best to pay attention to how input image resolution is handled and select an appropriate model.
Even for models with similar performance on benchmarks, characteristics for each model may appear in perception, knowledge, and reasoning capabilities due to differences in model structure or learning data. It is best to select a model that fits the task after first verifying characteristics with a small amount of data.
Actually, we verified characteristics for each model using VQA regarding autonomous driving and several models. Details of the autonomous driving VQA and models used are as follows:
Autonomous Driving VQA
The image data uses the VehicleOrientationDataset. Questions and answers consist of 150 questions in total, with 50 created for each of the perspectives: Perception, Knowledge, and Reasoning.
Image Example
Prompt Example
Question: What lane is the ego car in this image?
A. Left lane
B. Right lane
Please answer directly with the correct option's letter and explain why you chose it in the following format: 'A. Because <REASON>.'.
Replace 'A' with the correct option and <REASON> with the reason you chose the option.
Answer:
Correct Answer Example
A. Because there is a sidewalk to the left of the current lane.
Models Used
From closed-source models, we used Claude 3.7 Sonnet, which is said to have high reasoning capability. From open-source models, we used Ovis2 and Qwen2.5-VL, which were high-performing among those with fewer than 20B parameters as of March 2025, are commercially usable, and had confirmed operation. Both Ovis2 and Qwen2.5-VL use the Qwen2.5 LLM, but differ in ViT and learning data.
The results of the verification (accuracy rate) are as shown in the figures below:
*: Multiple-choice questions (one or more correct options per question)
Looking at the accuracy rates in the figures, it can be confirmed that Claude 3.7 Sonnet has high precision for the task "Which city do you reach if you proceed in a certain direction?".
Image Example
Prompt
Question: According to the sign, where is the destination for going straight?
A. Shin-ohashi
B. Harumi
C. Ojima
D. Ginza
Please answer directly with all the correct options' letters and explain why you chose them in the following format: 'A, B, ... A: Because <REASON>, B: Because <REASON>, ...'.
Replace 'A, B, ...' with the correct options and <REASON> with the reason you chose each option.
Answer:
Correct Options
A, C
Claude 3.7 Sonnet Answer
A, C.
A: Because the blue directional sign shows an upward-pointing arrow that indicates going straight leads to "新大橋" (Shin-ohashi), which is written on the top row of the sign.
C: Because the blue directional sign shows the same upward-pointing arrow that indicates going straight also leads to "大島" (Ojima), which is also written on the top row of the sign alongside Shin-ohashi.
As a reason for the low accuracy rate in models other than Claude 3.7 Sonnet, they may not be able to map arrows and place names well due to a lack of proficiency in recognizing spatial relationships. Also, where only the place name directly above the arrow should originally be linked, misrecognition may be occurring where surrounding place names are linked together, as indicated by the red boxes in the figure.
For other tasks, it seems that the tasks each model is good at and bad at generally coincide, but characteristics can sometimes be found by checking output content in detail. One example is the following two images in the task "If there is an obstacle on the road, answer the type of obstacle." In the first image, a vehicle parked on the street exists as an obstacle, but in the second image, there is no obstacle.
Claude 3.7 Sonnet judged that the vehicles in both images were driving and recognized that no obstacles existed. On the other hand, open-source models Ovis2 and Qwen2.5-VL judged that the vehicles in both images were parked and captured them as obstacles hindering passage.
This is thought to be due to the difference where Claude 3.7 Sonnet easily recognizes vehicles as "moving things," whereas Ovis2 and Qwen2.5-VL easily recognize them as "stationary things." Since Ovis2 and Qwen2.5-VL support video input, it is possible that inputting them as images makes them more likely to recognize vehicles as "stationary things."
Furthermore, there is a case showing how much a model grasps the entire situation and makes complex judgments. The following image is a scene during a traffic jam where the traffic light is green, but the vehicle ahead has its brake lights on.
When asked "May the ego car proceed?" for this image, both Claude 3.7 Sonnet and Qwen2.5-VL answered "Yes, because the light is green."
On the other hand, the Ovis2 16B model also focused on the brake lights of the vehicle ahead in addition to the green light, and judged that "It should not proceed." Note that the Ovis2 8B model answered "It should not proceed" while mentioning only the green light, which may be affected by model capability limits.
From this difference, while Claude 3.7 Sonnet and Qwen2.5-VL are strongly drawn to prominent information such as "traffic lights," it is possible that Ovis2 is paying attention to other elements as well. However, it is also possible that Ovis2's tendency to capture vehicles as "stationary things" led to a correct judgment by chance in this scene.
In this way, characteristics in perception, knowledge, and reasoning capabilities can appear even in models with similar structures.
In this article, we introduced evaluation criteria and points for choosing a model regarding VLM, which can process images and text simultaneously. When selecting a VLM, evaluating the elements necessary for the task (Perception, Knowledge, Reasoning) first makes it easier to narrow down appropriate models. Also, the execution environment, inference speed, and content of learning data are important factors in selection. By being able to choose the optimal VLM based on these, you will be able to utilize it more effectively.