What is a bounding box? How is it used in YOLO? A thorough explanation of the advantages and disadvantages of object detection methods

Written by Toshiyuki Kita | Jan 16, 2026 11:12:10 AM

AI-based image analysis technologies, such as AI-driven image generation and autonomous driving, are now being utilized in earnest. However, many people may not yet fully understand the mechanisms through which image analysis is performed.

In this article, we will provide a detailed explanation of everything from the benefits and drawbacks of bounding boxes, which are frequently used in object detection for image analysis, to representative methods of object detection.

By reading to the end, you will understand the overview of bounding boxes and the characteristics of YOLO, a representative object detection method that utilizes them.

【Table of Contents】

Bounding Boxes Are Rectangles Used for Object Detection
Representation Methods for Bounding Boxes
Three Benefits of Bounding Boxes
Three Drawbacks of Bounding Boxes
Common Methods for Object Detection in Images
Features of YOLO and the Use of Bounding Boxes
Summary | Bounding Boxes Are Advantageous in Terms of Man-Hours and Cost

1. Bounding Boxes Are Rectangles Used for Object Detection

A bounding box refers to a rectangle used for object detection. It is a method that allows image annotation to be performed through a simpler process compared to polygons (multilateral shapes) or segmentation (pixel-level labeling).

Bounding boxes are generally used to determine the position of a specific object within an image and its class classification—in other words, identifying what that object is.

Class classification is a method of categorizing what an object is by type. For example, in a photo containing a person and a dog, they would be classified as "Person" and "Dog" respectively. Additionally, the probability of the subject being a person or a dog can also be calculated.

Going further, utilizing bounding boxes allows for the identification of exactly where a specific object exists within an image. This is used to recognize "what is where" and plays an extremely important role in a wide variety of fields, from autonomous vehicles to security camera systems.

2. Representation Methods for Bounding Boxes

Although bounding boxes appear to be simple rectangles at first glance, there are three representation methods depending on position information, size, and whether they are 2D or 3D.

① Representation via two diagonal points
② Representation via coordinates
③ 3D representation via cubes

Here, we explain the mechanism of each representation method.

① Representation via two diagonal points

Representation via two diagonal points is a method of representing a rectangle through the coordinates of two diagonal points. Once two diagonal points are determined, the other two coordinates can also be found; thus, a bounding box can be represented using only two coordinates.

The primary benefit of the two-diagonal-point representation method is its conciseness and ease. In other words, a bounding box can be completely described with just two sets of coordinates.

② Representation via coordinates

Representation via coordinates is a method of representing a rectangle through three pieces of information: the coordinates of the center point, the height, and the width. Since this is also a 2D representation method, it does not differ significantly from the two-diagonal-point representation method.

③ 3D representation via cubes

3D representation via cubes is a method of representing objects three-dimensionally within an image. In this case, an object cannot be represented by a single rectangle alone.

The main benefit is that it allows for more detailed object representation by taking the 3D environment into account. This representation method is used when utilizing 3D information obtained from sensors such as LiDAR. Since the object can be represented using more vertices, the precision of the representation is improved.

However, a drawback is the increased computational complexity, requiring more computing resources.

3. Three Benefits of Bounding Boxes

Here are three benefits of bounding boxes.

① Low annotation man-hours
② Low annotation costs
③ Utilization in YOLO for rapid object detection

We will explain each benefit.

① Low annotation man-hours

Bounding boxes can be annotated just by specifying two vertices. Therefore, they can be annotated with less effort than other representative methods such as polygons or segmentation.

However, while man-hours are low, accuracy will not improve unless the quality of the annotation work is high. Low-quality annotation lowers the training precision of the AI, ultimately causing increased costs through additional corrections and rework.

Therefore, bounding box annotation requires progress while striking a balance between efficiency and quality. Please ensure that annotation is performed with high precision.

② Low annotation costs

Bounding boxes have relatively low annotation costs because they require fewer man-hours compared to other annotation methods.

The market prices for image annotation are as follows:

Annotation Type	Market Price
Bounding Box (Rectangle)	10 yen (per object)
Polygon	20–50 yen (per object)
Segmentation	100–300 yen (per image)
Landmark	5–10 yen (per point)

In this way, bounding boxes tend to keep costs lower compared to other types of annotation. Especially for startups looking to minimize initial investment as much as possible or projects handling large-scale image data, bounding boxes are an extremely useful method.

③ Utilization in YOLO for rapid object detection

Bounding boxes are also used in a high-speed object detection algorithm called YOLO. YOLO is an object detection method capable of high-speed and high-precision prediction, named after the acronym for "You Only Look Once" (referring to its speed being enough to detect things by looking just once).

YOLO is used in fields where real-time performance is required, such as autonomous driving and industrial robotics, and its object detection method utilizes bounding boxes.

4. Three Drawbacks of Bounding Boxes

While bounding boxes offer many benefits due to their convenience and efficiency, there are also drawbacks that must be considered. Below are the main drawbacks of bounding boxes.

① Can only be used for partial object detection
② Difficulty with detailed classification within classes
③ Accuracy may drop depending on the target shape

Understanding the scenarios in which bounding boxes are suitable and those in which they are not is necessary for selecting the optimal image annotation method. By choosing the best method for each purpose and need, you can achieve efficient and high-precision AI training.

We will explain each drawback.

① Can only be used for partial object detection

A bounding box is a method for detecting specific objects within an image. Therefore, it is not suitable for classifying an image as a whole or for detecting objects with complex shapes.

For example, it is not suitable for tasks such as determining the season from an image of natural scenery.

As a result, in scenarios where it is necessary to view the entire image, it is necessary to select an appropriate method, such as using image classification methods in combination with bounding boxes.

② Difficulty with detailed classification within classes

While bounding boxes allow for the classification of classes (types), classification of fine attributes or states within them is difficult, and identifying those requires more detailed labeling.

For example, when recognizing human facial expressions, bounding boxes can identify the class as "Person." However, identifying fine states such as a "Happy Person" or "Angry Person" is difficult.

To achieve detailed classification, it becomes necessary to introduce more detailed labeling or other image recognition technologies (such as facial recognition or expression recognition). (Keypoint annotation is more suitable for identifying human movements or emotions.)

③ Accuracy may drop depending on the target shape

If the target object has a complex shape, it is difficult for a bounding box to completely capture the object's outline. Compared to annotation methods that can capture more detailed shapes, such as polygons or segmentation, accuracy may be inferior.

For example, when detecting plant leaves, the shapes of the leaves are varied and complex. Enclosing a leaf in a bounding box will also enclose parts other than the leaf, resulting in lower accuracy. In such cases, methods that allow for pixel-level labeling, such as segmentation, are more suitable.

5. Common Methods for Object Detection in Images

There are various methods for object detection in images, such as the following:

① YOLO
② R-CNN
③ Fast R-CNN
④ Faster R-CNN
⑤ SSD
⑥ DCN
⑦ DETR

We will explain each method.

① YOLO (You Only Look Once)

YOLO is one of the representative object detection methods that uses bounding boxes, and it processes the entire image at once. It divides the image into multiple cells and simultaneously predicts the probability that each cell contains the center of a bounding box, the position and size of that bounding box, and the object class.

This allows object detection to be performed by looking at the entire image only once. While YOLO is extremely fast, it faces the issue of struggling with the detection of small objects.

Regarding detection accuracy for small objects or cases where objects are large and overlap in the image, improvements have been made gradually as the versions of YOLO have evolved.

We will explain the characteristics of YOLO in detail in the next chapter.

② R-CNN

R-CNN (Regions with CNN features) is a method that proposes candidate regions for objects and classifies them using a CNN (Convolutional Neural Network). Since R-CNN appeared in 2013, object detection development utilizing deep learning has become highly active.

Specifically, it proposes candidate regions where objects are likely to exist in the image using an algorithm called Selective Search and treats them as bounding boxes. Subsequently, features are extracted by a CNN for the parts within each bounding box, and classification is performed.

In R-CNN, before analysis, it selects around 2,000 regions likely to have features and then extracts each feature, requiring an enormous amount of calculation. To improve this drawback, derivative methods such as Fast R-CNN and Cascade R-CNN have been born since the development of R-CNN.

③ Fast R-CNN

Fast R-CNN is an improved version of R-CNN, which is faster than R-CNN because it simultaneously performs the proposal of object candidate regions and feature extraction by CNN. Additionally, Fast R-CNN can simultaneously learn the object's position and the object's class.

In Fast R-CNN, bounding boxes are used as representations of object candidate regions, just like in R-CNN.

④ Faster R-CNN

Faster R-CNN is a further improvement of Fast R-CNN, which introduces a network called RPN (Region Proposal Network) to perform the proposal of object candidate regions.

This improved the accuracy of object candidate region proposals and achieved further speed increases.

Faster R-CNN also uses bounding boxes, but it differs in that it proposes bounding boxes directly from the CNN's feature maps.

⑤ SSD

SSD (Single Shot MultiBox Detector), like YOLO, is a method that performs object detection in a single inference. However, while YOLO uses only a single-scale grid, SSD differs in that it utilizes feature maps of multiple scales.

This feature allows SSD to effectively detect objects of various sizes.

⑥ DCN

DCN was developed to capture objects that could not be captured by bounding boxes alone. This allowed object detection to be performed with high precision even if the target object's shape is irregular.

In DCN, by not restricting the shape in the convolutional layer to a rectangle, objects with special shapes can also be detected with high precision. Bounding boxes and DCN can be said to have a complementary relationship.

⑦ DETR

DETR (Detection Transformer) is a new approach that applies the Transformer architecture to object detection tasks. Normally, R-CNN, YOLO, SSD, etc., have been primarily used for object detection, but DETR takes a completely different approach.

DETR utilizes a Transformer consisting of an encoder and a decoder to detect objects within an image. The encoder extracts features from the image, and the decoder estimates the position and class of objects from those features. This allows for bounding box and class prediction to be performed simultaneously.

However, since bounding boxes are fixed, there are limitations if the object's shape is complex. On the other hand, DETR utilizes the characteristics of the Transformer to detect objects while taking the overall context of the objects into account. This overcomes some of the challenges faced by traditional object detection methods.

How does object detection work? A thorough explanation of the amount of data required, use cases, and construction steps! provides a detailed explanation of object detection mechanisms and application examples. Reading it together will further deepen your understanding of this article.

6. Features of YOLO and the Use of Bounding Boxes

YOLO (You Only Look Once) detects objects by "looking at the image only once," as its name suggests. Therefore, real-time object detection becomes possible, and it is used in fields requiring high-speed object detection, such as autonomous driving technology.

YOLO, which performs object detection using bounding boxes, has three features that overcome traditional challenges.

① Ability to perform object detection in real-time
② Reduction of false detections
③ License-free for commercial use

Here, we explain each feature.

① Ability to perform object detection in real-time

YOLO detects all objects simultaneously. In other words, by looking at the entire image only once, it predicts the positions and classes of the objects contained within it; thus, it is an object detection method that has achieved real-time performance and high precision.

In YOLO, analysis is performed after previously excluding regions unlikely to contain target objects. Additionally, by performing object detection and identification simultaneously, rapid object detection is possible. This is in contrast to some object detection algorithms that detect objects sequentially.

Because object recognition has become possible in a short period, the use of object detection has become practical even in fields requiring rapid processing, such as autonomous driving and the medical industry.

② Reduction of false detections

YOLO divides the image into a grid and predicts bounding boxes and classes for each cell. Furthermore, because it performs analysis after previously cutting out ranges unlikely to contain target objects, the risk of falsely detecting blank backgrounds or scenery as target objects can be reduced.

Additionally, because it performs inference through a combination of grid cells and bounding boxes, ranges and objects can be predicted with high precision.

③ License-free for commercial use

YOLO is provided as open source and can be used for commercial purposes. As of June 2023, versions up to YOLOv8 have been released, and they can be used for free as long as rules regarding copyright and attribution are followed.

YOLO can be implemented individually if Python is available. If you are interested, it may be worth trying it out.

7. Summary | Bounding Boxes Are Advantageous in Terms of Man-Hours and Cost

In this article, we explained the overview of bounding boxes and object detection methods that utilize them.

A bounding box is a method for enclosing a specific object in an image with a rectangle and is widely used as part of object detection. Due to its excellent efficiency and simplicity, it is widely used in various machine learning models, especially in the field of object detection.

While bounding boxes can be used in various object detection methods such as YOLO and R-CNN, each has its own unique characteristics, so the optimal method differs depending on the scenario and purpose of use. If the optimal method cannot be set according to the scenario, the annotation man-hours and costs, which affect the precision of the AI system, will also change significantly.

Because it is an important method widely used in the field of image analysis, understanding the advantages and disadvantages of bounding boxes allows for more efficient use.

Furthermore, How does object detection work? A thorough explanation of the amount of data required, use cases, and construction steps! provides a detailed explanation of object detection mechanisms and application examples. Reading it together will further deepen your understanding of this article.

Author

Latest Articles

View full post