Training an AI with highly accurate data is essential to building a highly accurate AI. In order for the AI to learn effectively, it is necessary to have data that has been annotated appropriately.
Of course, annotation is an important process, but equally important is the quality of the original 'data' being annotated. No matter how well the annotation is done, if the quality of the original data is poor, the accuracy of the AI will not improve.
In this article, we will introduce six key considerations when collecting data. By reading through to the end, you will learn more about data collection, from the basics to advanced applications, including how to choose data and how to prevent over-learning.
In the field of AI, annotation refers to the process of adding information labels to images, videos, audio and other data. Since AI learns from annotated data, annotation plays a crucial role in determining the accuracy of AI.
Among all types of annotation, the quality of the original data is very important. For example, if you want to build an AI to analyse the movement of dogs, but you collect only images of cats or dogs in the same pose, then most of that data will be largely useless.
Furthermore, if you want to analyse the behaviour of sick dogs but only train the AI with videos of healthy dogs, it won't be able to understand the behavioural patterns of sick dogs. Bias in the data leads to inaccurate AI. Therefore, even if you have trained with a lot of data, if the original data is biased and of poor quality, the AI will be less accurate.
Here are six points to keep in mind when collecting data
|
Let’s see each point in detail.
The ultimate goal of data collection is to "improve the accuracy of the AI". However, the purpose of data collection changes depending on the stage of completion of the AI, and the characteristics of the data collected also change as follows.
|
Development stage |
Purpose of data collection |
|
|
1 |
Initial testing stage |
Measure the initial accuracy of AI |
|
2 |
Model deployment |
Run the AI in real conditions |
|
3 |
Model improvement |
Improve areas where accuracy is lower than expected |
During the "Initial testing stage" you need to collect the minimum required data to measure the AI's initial accuracy. This process is necessary to determine the accuracy of the AI when it is first built.
In the “data collection for model deployment” stage, it is necessary to collect the data needed to actually run the AI. This is the stage where the accuracy of the AI is determined, and a large amount of data is required.
In the “data collection for model improvement” stage, data is collected with a focus on areas where accuracy is lower than initially expected. For example, if the accuracy of detecting moving vehicles in image analysis for automated driving is low, it may be necessary to train the AI on additional vehicle image data.
Thus, the type and amount of data collected will differ depending on the purpose, so it's very important to be clear about the objectives of data collection for each stage. You should carefully consider which data to collect and train on to create a highly accurate AI, as the data to be collected will be different for each of the three stages.
If the number of data samples is too small, the AI won’t be able to learn adequately, and the prediction accuracy of the AI will be low.
AI learns using training data (annotated data). For this reason, it is much more challenging to analyse new situations that are not included in the training data. With a small amount of data and limited examples, the AI will not be able to make accurate predictions or analyses.
For the AI to recognise data accurately, it is crucial to collect data that reflects the actual operating environment. If the system is to be used in a specific or unique environment, data must be collected on-site or selected to reflect those real-world situations.
When selecting and collecting data suitable for the operating environment, it is important to pay attention to the types of data. Here we will explain the most commonly used formats.
For images and videos, you need to train the AI with the same perspective and resolution that it will use in operation. So, if you want to use a highly accurate AI, you will need to attach a camera to the actual location to collect the relevant data.
For example, if you want to capture a car from a bridge, you need data from the angle of view from the bridge. Images of a car taken from the front on the road would be irrelevant, because the image of a car seen from the front is different from the image of a car seen from above.
Audio
For audio data, it is necessary to teach the AI to understand filler words such as "um" or "uh"
For instance, an AI that hasn’t been trained on conversational speech won’t be able to distinguish between "uhh, the house..." and "that house...". However, an AI trained to understand interjections will recognise the former "uhh" as meaningless.
In addition, dialects, regional variations, slang and jargon are also features unique to the spoken language. If the speaker of the audio you want to analyse has a strong dialect, the AI should be trained accordingly.
It might seem that all text is the same, but for advanced analysis, data must be collected according to the specific purpose.
For example, if you want to analyse texts with unique styles, such as classical literature or old newspapers, you will need to collect data suitable for each of these. Similarly, if you are analysing academic papers with many technical terms, you will need to collect data that includes not only the technical terms and their context, but also the equipment used in the text.
In such cases, it is recommended to collect transcribed text data to train the AI for this particular environment.
For AI to work effectively, the data used for training needs to include a variety of patterns that are relevant to the AI's purpose. If you want the AI to analyse more detailed information, you will need a wider range of data. To achieve this, you need to train the AI with a balanced selection of images showing both 5-passenger and 7-passenger cars to enable it to learn these distinctions.
For example, if the goal of the AI is to detect moving vehicles in videos, you may want it to not only recognise the vehicles but also analyse the "passenger capacity" or "type of vehicle."
However, if there is a lack of data on 7-passenger vehicles, or if they are not included at all, the AI's accuracy in recognising this type of vehicle will be lower. So even the same AI may require different data patterns depending on its purpose. Moreover, if you need to measure the type of vehicle but do not need to identify the gender of pedestrians, detailed training on human images will not be necessary, but a wide variety of vehicle image patterns will be.
Make sure that your data collection covers all the necessary patterns based on the objectives.
It is not a good approach to achieve a high-quality AI if you think "It doesn't matter what it is, it's better to have lots of data to learn from." While it's true that having more data is generally better for training, the AI will become less accurate if the data is not diverse and balanced. Here are two key considerations to avoid data bias:
If the AI is trained with too much data from one specific category (class), its ability to predict other categories will be affected.
Let's say you collect image data and train it to measure the flow of people, cars and bicycles. However, If the data contains many images of people and cars, but very few of bicycles, the AI may become good at recognising people and cars, but struggle to distinguish between bicycles and motorcycles.
In this way, if you end up with a bias towards a particular class of data, you may end up with an AI that is highly accurate for that class, but not very accurate overall. To avoid this, it’s important to collect data in a balanced way.
Generalisation refers to 'the ability of AI to accurately predict new, unseen data'. If the data you collect is too biased, the AI's ability to generalise can be weakened. For example, even if it is the first time you have seen an animal, the human brain can determine which part is the leg and can predict how it will move with a high degree of accuracy. This ability to generalise based on the rule of nature is what makes the human brain so powerful.
If the dataset is too small or biased, the AI will learn from this limited data, which can lead to overfitting. Overfitting occurs when the AI learns the unique features of the biased data that don't apply to real-world situations. As a result, the AI may make predictions based on these biased patterns rather than general rules.
If the data is heavily skewed, the AI may start treating the biased data as normal, so the more data you add, the worse the prediction accuracy will become. Once this happens, no matter how much training is done, the AI's accuracy will continue to decline, and you will end up wasting your time and effort.
When collecting data, avoid any bias in both the amount and the content of the data.
Image and sound data contain personal information. If it is possible to identify a specific individual from the data, it becomes personal information, so you need to be careful about how you collect and handle the data once it has been collected.
Also, if you outsource data collection or annotation to a specialist company, make sure you choose one that maintains a high level of security. If individuals can be identified, even audio and text data can be considered personal information, so it's important to be careful with all types of data.
In this article, we have introduced some points to keep in mind when collecting data.
A well-constructed AI can make highly accurate predictions and analyses that humans cannot, but if the very foundation of the AI - data collection - is not done properly, the result will be an inaccurate and ineffective AI. Make sure that someone with a solid understanding of AI does the data collection.
Not only will you not have to spend on staff and training costs, but by outsourcing to a company that has the right expertise and resources for data collection, you will be able to collect high quality data.