Skip to content

How Do You Assess the Quality of Training Data? What Are the Methods and Technical Levels to Improve Accuracy and Quality?

 

image (9)

 

Training data is data that has been labelled in order to teach AI. AI learns from this training data and analyses new data based on rules and patterns. The quality of the training data is directly linked to the accuracy of the AI, so high-quality training data is essential for advanced AI systems. But how do you actually judge the ‘quality’ of training data? In this article, we will look at the factors that are necessary for creating high-quality training data.

 

 

 

1. What are the key factors that determine the quality of training data?

image (10)


Annotation is the process of attaching labels, metadata and other information to large amounts of data. The quality of the training data is affected by this annotation process, so the accuracy of the annotation specifications is important.

Related article: What is annotation and why is it necessary for AI? Explanation of the process and tasks 
https://annotation.nextremer.com/blog/annotation-overview

To improve the accuracy of annotation specifications, it’s important to consider the following:

 

  • Define the tasks
  • Consistent annotation rules
  • Purpose of AI use (detection purpose)

 

We will explain each of these points.

 

Define the tasks

In order for multiple annotators to be able to perform the same annotation tasks, you need to define the content of the annotation tasks in advance. The content of the annotation tasks will differ depending on 'what the training data is being created for’. For example, if you want to make the system recognise an image of an apple as 'this is an apple', you will need to label the image data with the word 'apple'.

There are also different annotation methods for image recognition, which are as follows

 

  • Object detection: labelling the objects in the image data
  • Image segmentation: identifying which parts of the image are the target objects
  • Landmark annotation: Labelling specific points such as joints on the body or parts of the face for humans or animals
It is important to define the purpose of the training data and how it will be used to achieve your objectives.


Consistent annotation rules

Without consistent annotation rules, it is impossible to improve the quality of training data. By standardising the rules, all annotators can follow the same guideline, ensuring consistency of work and leading to improved data quality.

You need to set clear rules so that multiple annotators can make the same judgements. For example, you might consider setting rules like the following:

 

  • How far should images be enlarged and how much detail should be filled in during segmentation
  • Whether to separate bicycles and electric bicycles using bounding boxes
  • What colour level is significant for rust and colour loss on steel towers
As some of these judgements may be subjective, it's important to carry out pre-tests to ensure that everyone can make the same decisions as consistently as possible.


Purpose of using AI (detection purpose)

The rules for annotation need to be aligned with the purpose of using the training data. For example, if the goal is to recognise images of apples, annotating them as 'fruit', 'food' or 'red object' is not appropriate.

By creating rules and sharing them before the annotation work, you can create accurate and high-quality data that aligns with the detection objectives.

 

Nextremer offers data annotation services to achieve highly accurate AI models. If you are considering outsourcing annotation, free consultation is available. Please feel free to contact us.

 

2. What are the elements of annotated data that affect the quality of the training data?

image (11)


The volume and variety of annotation data required will vary depending on the purpose of the AI. These key factors play a significant role in determining the quality of the training data.

Data Volume

image (12)



The volume of annotation data will vary depending on the field, but a minimum of 1,000 is generally estimated and 5,000 to 10,000 is often considered standard. You will need to decide how much data to prepare and annotate, taking into account the purpose and deadline.


Data variety

image (19)-2


When it comes to image recognition, AI can make judgements based on the colours, patterns and shapes of the data it has learned so far. However, even if it is taught a large number of images of similar red apples, it may not be able to recognise a green apple as an apple.

It may also be necessary to annotate a wide range of data, such as tagging images of pears that look like apples as 'this is not an apple' or 'this is a pear', or extracting only apples from images that contain other objects.

Therefore, it is also important to decide in advance how extensive a range of data is required, depending on the purpose of the AI.

 

Nextremer offers data annotation services to achieve highly accurate AI models. If you are considering outsourcing annotation, free consultation is available. Please feel free to contact us.

 

3. Technical skills required for annotators

image (20)-2


Even if you define the annotation work and create rules, it is ultimately the annotators who perform the work. The quality of the training data is also affected by the technical level of the workers.

If there is inconsistency among the annotators, this will lead to variation in the detection results, so it is important to create annotation data that has a sense of unity. To achieve data consistency, annotators need the following skills:

 

  • Ability to follow rules accurately
  • Suitability for work
  • Smooth communication skills: Report, Communicate and Clarify
  • Ability to acquire domain knowledge


We will explain each of these points.

 

Ability to follow rules accurately

The purpose of defining annotation rules is to ensure that all annotators can perform the task in the same way. No matter how well the rules are set, the annotation data will be inconsistent if they are not followed accurately. Therefore, the ability to understand and follow the rules accurately is crucial for creating high-quality, consistent annotation data.


Suitability for work

Annotation work involves labelling a huge number of data. Even if you have the skills to understand and follow the rules, it can be easy to lose focus and make mistakes. Depending on the purpose of the AI, there may be cases where not all annotators are suitable for such tasks. In addition, the reviewer must be able to thoroughly check the work of the annotators. It is important to assess the types of tasks involved and ensure that personnel with appropriate skills are assigned.


Smooth communication skills: Report, Communicate and Clarify

Even if you think you have annotated everything according to the rules, there is a possibility that it may not be suitable for the purpose. Since annotation involves processing large quantities of data, unnoticed mistakes can lead to the need for extensive corrections. Additionally, despite having detailed rules, there will be times when you encounter things that you don't understand or are unsure about. The ability to “report, communicate and clarify’ with managers and other annotators is essential to ensure consistent and accurate work.


Ability to acquire domain knowledge

Domain knowledge refers to specialised knowledge about an industry or business, as well as insights and knowledge about trends. In order to tag not only general information, but also additional industry-specific information, it is necessary to have people with domain knowledge.

For example, in addition to understanding images of apples and similar fruits, there may be cases where the AI needs to learn about the different varieties of apples and their unique attributes. This can help the system to recommend apples that match individual preferences, or allow users to identify the variety and characteristics of an apple simply by taking a photo, which can help with purchasing decisions. Additionally, by training the AI on multiple states of the same apple variety, such as ripeness and colour, it becomes possible to automatically sort and categorise apples during selection or shipping.

In addition to clearly defining the annotation rules in advance, it may be necessary to provide training to bring all annotators to the same skill level. This ensures the creation of more sophisticated training data.

If outsourcing to offshore companies or individuals (such as crowdsourced workers) to reduce labour costs, domain knowledge may be lacking.  If you have an annotation system that retains domain knowledge as in-house know-how, you can create high-quality training data with added value.

 

 

4. Managing as an annotating organisation
image (15)


Even if you have the annotation specifications and the people to perform the annotation work, it is not enough. In order to maintain high quality annotation work, you also need to consider the following aspects:


Ability to maintain close and prompt communication with annotators

In order for all workers to understand and carry out the work, it is necessary to communicate the tasks, methods and policies. Whether before the work starts, or when the rules or policies change, it is necessary to share information carefully and in detail, and to communicate so that the work can be carried out accurately without any inconsistencies or errors.

It is also important to check with the workers to make sure they understand what you have told them. You must be constantly aware of whether they are still unsure about anything or whether they are still following the objectives, and you need to manage the process so that you can communicate with the annotators as you go along and ensure that the annotation is carried out correctly.


Ability to quickly address and resolve human errors

Human errors in annotation have a direct impact on the AI’s detection accuracy. Even if you improve the skill level of annotators, have them acquire domain knowledge, and have managers oversee them to prevent errors from occurring, it is impossible to completely prevent them. In order to create highly accurate training data, it is important to be able to react quickly and resolve any errors that occur.

If you have a system in place to identify the causes of errors, know how to respond to them, and have a clear chain of command, you will be able to respond quickly if an error occurs.

 

Quality control and workflow management system

In order to create high-precision training data, it is also important to have a system in place for quality control. You need to ensure that annotations made by annotators are never used in their initial state without being reviewed by a competent supervisor. Having supervisors who can communicate effectively with annotators and check for errors is essential.

It is also important to have more than one person responsible for reviewing the large number of annotations. 

The following is an explanation of how reviewers are assigned and why multiple reviewers are necessary.


Can the supervisors thoroughly review the work?

 

image (16)-4


Just like the annotators, the people in charge must also check the detailed annotations. They need to check more carefully than the annotators to see whether the work has been done in line with the work content and rules, whether the labelling is correct, and so on. Also, if they can check the trends in errors for each annotator and provide guidance, the data will be even more accurate.


Is there a system in place that allows more than one person to review?

 

image (17)-2


Annotation is a huge task that requires multiple people to complete, but if there is only one person responsible for reviewing, it may be impossible to check the data accurately due to the sheer volume of work. There are also cases where a different perspective can spot things that one person would not. 

Having multiple supervisors check the annotations for mistakes or missed labels helps prevent errors.

 

 

5. Summary

 

Training data is essential for AI, and its accuracy directly affects the quality of the AI. There are a number of factors needed to produce highly accurate training data, such as high-level technical annotation, comprehensive domain knowledge, and a strong quality control system.

Annotation work may seem like a simple task at first glance, such as it involves in labelling large amounts of data, so in some cases, companies try to keep labour costs down by outsourcing to individual cloud workers. However, in such cases, even if the work content and rules are decided, the accuracy of the annotation may be low due to the annotators' lack of understanding or knowledge, and in some cases, additional labour costs may be incurred for the correction work. In addition to highly skilled annotators, it is also necessary to create an organisation that can manage them with effective communication and achieve objectives and deadlines.

Furthermore, even with high levels of skill and quality control in creating training data, eliminating human error entirely is not easy. A prompt and effective response to errors as they occur is crucial for creating reliable data.

By addressing these factors, you can create high-quality training data and, in turn, build highly accurate AI.

 

 

Nextremer offers data annotation services to achieve highly accurate AI models. If you are considering outsourcing annotation, free consultation is available. Please feel free to contact us.

 

 

Author

 

nextremer-toshiyuki-kita-author

 

Toshiyuki Kita
Nextremer VP of Engineering

After graduating from the Graduate School of Science at Tohoku University in 2013, he joined Mitsui Knowledge Industry Co., Ltd. As an engineer in the SI and R&D departments, he was involved in time series forecasting, data analysis, and machine learning. Since 2017, he has been involved in system development for a wide range of industries and scales as a machine learning engineer at a group company of a major manufacturer. Since 2019, he has been in his current position as manager of the R&D department, responsible for the development of machine learning systems such as image recognition and dialogue systems.

 

Latest Articles