<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Object-Detection on Oriol Alàs Cercós</title><link>https://oriolac.github.io/tags/object-detection/</link><description>Recent content in Object-Detection on Oriol Alàs Cercós</description><generator>Hugo -- 0.150.0</generator><language>en-us</language><copyright>2024 Oriol Alàs Cercós</copyright><lastBuildDate>Sat, 25 Apr 2026 20:10:23 +0100</lastBuildDate><atom:link href="https://oriolac.github.io/tags/object-detection/index.xml" rel="self" type="application/rss+xml"/><item><title>Reviewing YOLO: You Only Look Once</title><link>https://oriolac.github.io/posts/20260501-yolo/</link><pubDate>Sat, 25 Apr 2026 20:10:23 +0100</pubDate><guid>https://oriolac.github.io/posts/20260501-yolo/</guid><description>&lt;p&gt;Object detection is one of the most popular tasks in computer vision, since it can be applied to a wide range of
applications: robotics, autonomous driving or fault detection. In this post, we will try to give a brief overview of
the YOLO algorithm and the components that make it work.&lt;/p&gt;
&lt;p&gt;To do that, I have classified the main components of the algorithm into three categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Characteristics based on the &lt;strong&gt;model architecture&lt;/strong&gt;, how YOLO-based models improved the performance by using a new
architecture and which are the improvements made.&lt;/li&gt;
&lt;li&gt;Strategies based on the &lt;strong&gt;model training&lt;/strong&gt;, such as the function loss or data augmentation.&lt;/li&gt;
&lt;li&gt;Methods for &lt;strong&gt;post-processing the output&lt;/strong&gt; of the model, such as the non-maximum suppression (NMS) and the
confidence threshold.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="two-stage-vs-one-stage-detectors"&gt;Two-stage vs One-stage Detectors&lt;/h2&gt;
&lt;p&gt;Before YOLO, SoTA detectors were based on a &lt;strong&gt;two-stage detector&lt;/strong&gt;: the first stage is used to detect the bounding
boxes,
and the second stage is used to classify the bounding boxes. This kind of model is called region-based detectors,
because they need the region to then run the classification.&lt;/p&gt;</description><content:encoded><![CDATA[<p>Object detection is one of the most popular tasks in computer vision, since it can be applied to a wide range of
applications: robotics, autonomous driving or fault detection. In this post, we will try to give a brief overview of
the YOLO algorithm and the components that make it work.</p>
<p>To do that, I have classified the main components of the algorithm into three categories:</p>
<ul>
<li>Characteristics based on the <strong>model architecture</strong>, how YOLO-based models improved the performance by using a new
architecture and which are the improvements made.</li>
<li>Strategies based on the <strong>model training</strong>, such as the function loss or data augmentation.</li>
<li>Methods for <strong>post-processing the output</strong> of the model, such as the non-maximum suppression (NMS) and the
confidence threshold.</li>
</ul>
<h2 id="two-stage-vs-one-stage-detectors">Two-stage vs One-stage Detectors</h2>
<p>Before YOLO, SoTA detectors were based on a <strong>two-stage detector</strong>: the first stage is used to detect the bounding
boxes,
and the second stage is used to classify the bounding boxes. This kind of model is called region-based detectors,
because they need the region to then run the classification.</p>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/yolo/rcnn.png#center"
         alt="Region-based Convolutional Neural Network" width="80%"/> <figcaption>
            <p>Fig. 1. RCNN architecture. <a href="https://www.listendata.com/2022/06/region-proposal-network" target="_blank" rel="noopener">Check image source</a></p>
        </figcaption>
</figure>

<p>In contrast, YOLO is a <strong>one-stage detector</strong>, YOLO models skip the first stage and runs directly over a dense sampling
of possible locations and gives the bounding boxes and the classification all at once. The first idea behind YOLO was
to reduce the computational cost of the region-based models (increasing the FPS) although mantaining or decreasing a
little bit the performance. This idea was inspired by Single Shot MultiBox Detector (SSD), introduced in 2016.</p>
<blockquote>
<p>Frames per second (FPS) is a measure of the number of frames that can be processed per second, critical factor for
real-time applications.</p></blockquote>
<h2 id="the-architecture-backbone-neck-and-head">The architecture: Backbone, neck and head</h2>
<p>A YOLO model usually has three main parts:</p>
<ul>
<li>The <strong>backbone network</strong>, which extracts features from the input image. The backbone progressively reduces spatial
resolution while increasing semantic richness.</li>
<li>The <strong>neck</strong>, which combines the feature maps from the backbone with a convolutional layer. So the neck helps the
detector
handle objects of different sizes.</li>
<li>The <strong>head</strong>, which produces the final prediction.</li>
</ul>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/yolo/arch.png#center"
         alt="YOLO-based network architecture" width="80%"/> <figcaption>
            <p>Fig. 2. YOLO-based network architecture. <a href="https://www.mdpi.com/2072-4292/15/16/3963" target="_blank" rel="noopener">Check image source</a></p>
        </figcaption>
</figure>

<p>These parts are strongly connected to the idea of <strong>multi-scale detection</strong>. YOLO does not predict bounding boxes from
only
one feature map. Instead, it preodicts object at different scale resolutions. For example, if the input image has size
640 × 640, a YOLO model may produce <strong>three detection scales</strong>:</p>
<pre tabindex="0"><code>P3 -&gt;  80 x 80  -&gt; small objects
P4 -&gt;  40 x 40  -&gt; medium objects
P5 -&gt;  20 x 20  -&gt; large objects
</code></pre><p>At the beginning of the <strong>backbone</strong>, the feature maps have high spatial resolution and contain low-level information,
such
as edges, corners, textures, and small visual patterns. As the image passes through deeper layers, the spatial
resolution decreases, but the semantic meaning of the features increases. The backbone network gives intermediate
feature maps to the neck according to the idea of multi-scale detection. Usually the backbone is already pre-trained on
a large dataset.</p>
<blockquote>
<p>Since the backbone is meant to extract features from the input image in different scales, it is important to create a
network that has neither stride operations nor pooling layers, as they can reduce the spatial resolution and semantic
information.</p></blockquote>
<p>The neck takes the feature maps from the backbone and mixes these features so that each detection scale benefits from
both.</p>
<p>The detection heads are the final prediction layers. Usually, YOLO has one detection head per scale. Each head predicts
bounding boxes (4 values), confidence score (1 value), and class probabilities (C values) for its corresponding feature
map. The final prediction output
of each head has the shape \(S \times S \times (5B + C) \). If there are 3 different scales, then the output will have
three tensors with the previous shape.</p>
<blockquote>
<p>B is the number of bounding boxes per grid cell. It can only be 1, 2 or more. If B is 1, then the model predicts only
one bounding box per grid cell.</p></blockquote>
<h3 id="the-stride">The stride</h3>
<p>The <strong>stride (S)</strong> tells you how much the input of the image has been downsampled. Therefore, each cell in the feature
map corresponds to a region of the original image. The \(5B + C\) part of the output is the bounding box (x, y, height
and width) and class probabilities.</p>
<blockquote>
<p>This is important because YOLO predicts object centers relative to grid cells. Small objects need high-resolution
feature maps, so they are usually predicted at lower stride, such as stride 8.</p></blockquote>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/yolo/strides.png#center"
         alt="Example of strides using different scales" width="100%"/> <figcaption>
            <p>Fig. 3. Example of strides using different scales, with the centroid of the bounding box to determine which is the stride cell of the image to predict.</p>
        </figcaption>
</figure>

<h3 id="the-confidence-score">The confidence score</h3>
<p>A YOLO head usually predicts something like: <code>tx, ty, tw, th, objectness, class probabilities</code>
While <code>tx</code> and <code>ty</code> are the predicted center offset of the bounding box and <code>tw</code> and <code>th</code> are the predicted width and
height of the bounding box, the <code>objectness</code> value is the probability that an object exists in this prediction.
The <strong>confidence score</strong> is commonly calculated by multiplying the <code>objectness</code> value with the class probabilities.</p>
\[\text{Confidence score} = \text{Objectness} \times \text{Class probability}\]<p>Confidence score is used to remove weak predictions from the output of the model and reduces the number of low-quality
detections</p>
<h3 id="the-anchor-boxes">The anchor boxes</h3>
<p>YOLO models can be divided into two families:</p>
<ul>
<li>Anchor-free models.</li>
<li>Anchor-based models.</li>
</ul>
<p>Anchors are predefined box shapes. YOLO models use anchors to predict offsets relative to these boxes. In anchor-based
model of three scales, we can have the following anchors:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>p3 <span style="color:#f92672">=</span> [(<span style="color:#ae81ff">62</span>, <span style="color:#ae81ff">66</span>), (<span style="color:#ae81ff">45</span>, <span style="color:#ae81ff">213</span>), (<span style="color:#ae81ff">105</span>, <span style="color:#ae81ff">104</span>)]
</span></span><span style="display:flex;"><span>p4 <span style="color:#f92672">=</span> [(<span style="color:#ae81ff">196</span>, <span style="color:#ae81ff">76</span>), (<span style="color:#ae81ff">153</span>, <span style="color:#ae81ff">143</span>), (<span style="color:#ae81ff">96</span>, <span style="color:#ae81ff">316</span>)]
</span></span><span style="display:flex;"><span>p5 <span style="color:#f92672">=</span> [(<span style="color:#ae81ff">266</span>, <span style="color:#ae81ff">266</span>), (<span style="color:#ae81ff">350</span>, <span style="color:#ae81ff">465</span>), (<span style="color:#ae81ff">420</span>, <span style="color:#ae81ff">500</span>)]
</span></span></code></pre></div><p>Anchors are usually defined by classifying the objects of the training set using an unsupervised clustering algorithm,
such as k-means.</p>
<p>When using anchors, the output the model is a tensor of shape \(S \times S \times A \times (5B + C) \). Suppose we are
at grid cell <code>(i, j)</code> on a feature map with stride <code>s</code>. The model predicts raw values: <code>tx, ty, tw, th</code>. Hence, a
classical YOLO-style decoding example with anchor <code>(60, 40)</code> is:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>bx <span style="color:#f92672">=</span> (sigmoid(tx) <span style="color:#f92672">+</span> j) <span style="color:#f92672">*</span> stride  <span style="color:#75715e"># e.g. (0.55 + 30) × 8 = 244.4</span>
</span></span><span style="display:flex;"><span>by <span style="color:#f92672">=</span> (sigmoid(ty) <span style="color:#f92672">+</span> i) <span style="color:#f92672">*</span> stride  <span style="color:#75715e"># e.g. (0.40 + 20) × 8 = 163.2</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>bw <span style="color:#f92672">=</span> anchor_w <span style="color:#f92672">*</span> exp(tw)  <span style="color:#75715e"># e.g. 40 × 1.10 = 44.0</span>
</span></span><span style="display:flex;"><span>bh <span style="color:#f92672">=</span> anchor_h <span style="color:#f92672">*</span> exp(th)  <span style="color:#75715e"># e.g. 60 × 0.82 = 49.2</span>
</span></span></code></pre></div><p>In modern anchor-free YOLO variants such as YOLOX, anchors may not be used explicitly. Instead, the model directly
predicts box distances or center-based boxes.</p>
<h2 id="model-training">Model training</h2>
<p>Unlike image classification, where the model predicts one label for the whole image, object detection requires solving
<strong>several problems at the same time</strong>: deciding whether an object exists in a given location, estimating the coordinates
of
its bounding box, and assigning the correct class. For this reason, YOLO training is usually based on a multi-part loss
function that combines localization, objectness, and classification terms.</p>
<h3 id="intersection-over-union-iou">Intersection over Union (IoU)</h3>
<p><strong>Intersection over Union (IoU)</strong> is a measure of the similarity between two bounding boxes. It is the division between
the area of the intersection and the area of the union.</p>
\[IoU = \frac{\text{Area of Intersection}}{\text{Area of union}}\]<p>It is used in several steps of the training process:</p>
<ul>
<li>Training loss.</li>
<li>Anchor assignment.</li>
<li>Post-processing techniques such as non-maximum suppression (NMS).</li>
<li>Evaluation metrics such as mAP.</li>
</ul>
<figure class="align-center ">
    <img loading="lazy" src="/posts/2026/yolo/IoU.png#center"
         alt="Intersection over Union (IoU) formula" width="50%"/> <figcaption>
            <p>Fig. 4. Intersection over Union (IoU) formula.</p>
        </figcaption>
</figure>

<blockquote>
<p><strong>mAP (mean Average Precision)</strong> is a metric used to evaluate the performance of a detection model across multiple
classes
and IoU thresholds. It is computed as the mean of the Average Precision values across all classes. For one class, YOLO
sorts all predictions by their confidence score. For each confidence score, it calculates the precision-recall curve
and the area under the curve (AUC). The mAP is then the average of these AUCs across all classes.
</p>
\[mAP = \frac{1}{C}\sum_{c=1}^{C}AP_c\]<p>
</p>
\[AP_c = \sum_{n} (R_n - R_{n-1}) P_n\]<p>
being \(R_n\) the recall at the nth confidence score and \(P_n\) the best precision at the nth confidence score.</p></blockquote>
<h3 id="loss-function">Loss function</h3>
<p>The loss function tells the model how wrong its predictions are during training. The YOLO loss trains the model to solve
three tasks at the same time:</p>
<ul>
<li>The <strong>box loss</strong>: localize the object correctly.</li>
<li>The <strong>objectness loss</strong>: predict whether an object exists.</li>
<li>The <strong>class loss</strong>: classify the object correctly.</li>
</ul>
<p>A simplified YOLO loss can be written as:</p>
\[
L = \lambda_{\text{box}}L_{\text{box}} + \lambda_{\text{obj}}L_{\text{obj}} + \lambda_{\text{cls}}L_{\text{cls}}
\]<p>where \(\lambda_{\text{box}}\), \(\lambda_{\text{obj}}\), and \(\lambda_{\text{cls}}\) are weighting factors used to
balance the contribution of each term.</p>
<h4 id="the-box-loss">The box loss</h4>
<p>The <strong>box loss</strong> measures how well the predicted bounding box matches the ground-truth box. Older YOLO versions used
mean squared error (MSE) over the box coordinates, but modern YOLO models usually use IoU-based losses because they are
more directly aligned with the object detection objective.
A simple IoU loss can be defined as:</p>
\[
L_{\text{box}} = 1 - IoU(b, \hat{b})
\]<p>where \(b\) is the ground-truth box and \(\hat{b}\) is the predicted box.</p>
<p>However, modern detectors often use more advanced variants such as GIoU, DIoU, or CIoU. For example, CIoU includes not
only the overlap between boxes, but also the distance between their centers and their aspect-ratio consistency:</p>
\[
L_{\text{CIoU}} = 1 - IoU + \frac{\rho^2(b, \hat{b})}{c^2} + \alpha v
\]<p>where \(\rho^2(b, \hat{b})\) is the squared distance between the centers of the predicted and ground-truth boxes,
\(c^2\) is the squared diagonal length of the smallest enclosing box, and \(\alpha v\) penalizes differences in aspect
ratio.</p>
<h4 id="the-objectness-loss">The objectness loss</h4>
<p>The <strong>objectness loss</strong> teaches the model whether a prediction contains an object.
YOLO makes thousands of predictions per image. Therefore, it is important to balance the <strong>false positives</strong>. How? By
adding weights when the model predicts a positive object. For example, a training implementation may use two different
weights:</p>
<ul>
<li><code>neg_obj_weight_with_pos</code>: the weight applied to negative predictions in a scale where at least one positive object
exists.</li>
<li><code>neg_obj_weight_no_pos</code>: the weight applied to negative predictions in a scale where no positive object exists.</li>
</ul>
<p>This distinction is useful in multi-scale YOLO training. Suppose that an image contains a small object assigned to the
P3 scale, but no objects are assigned to P4 or P5. In that case, P3 contains both positive and negative samples, while
P4 and P5 contain only negative samples. If the loss gives too much weight to all negative predictions, <strong>the model may
learn to predict background everywhere and become too conservative</strong>. These weights help balance the objectness loss so
that negative examples are useful but do not dominate the training signal.</p>
<h4 id="the-class-loss">The class loss</h4>
<p>The <strong>class loss</strong> teaches the model which class is present in a positive prediction. There are two common ways to
compute it. If each object belongs to exactly one class, the model can use a softmax activation followed by categorical
cross-entropy:</p>
\[
L_{\text{cls}} = - \sum_{c=1}^{C} y_c \log(\hat{p}_c)
\]<p>where \(y_c\) is the ground-truth class indicator and \(\hat{p}_c\) is the predicted probability for class \(c\).</p>
<p>However, many YOLO implementations use binary cross-entropy independently for each class:</p>
\[
L_{\text{cls}} = - \sum_{c=1}^{C}
\left[
y_c \log(\hat{p}_c) + (1-y_c)\log(1-\hat{p}_c)
\right]
\]<p>This formulation treats class prediction as \(C\) independent binary classification problems. It is especially useful
when multi-label classification is possible, although it is also commonly used in single-label YOLO detectors.</p>
<blockquote>
<p>Check that each stride \((S, S)\) has only one positive prediction. Therefore, the model cannot learn to predict two
objects in the same location.</p></blockquote>
<h3 id="data-augmentation">Data Augmentation</h3>
<p><strong>Data augmentation</strong> is another important part of YOLO training. Its goal is to expose the model to more visual
variation
without manually collecting more data. Common augmentations include random scaling, cropping, horizontal flipping, color
jittering, mosaic augmentation, and MixUp.</p>
<p><strong>MixUp</strong> combines two images and their labels into a single training example. The resulting image is a weighted
combination of both images:</p>
\[
\tilde{x} = \lambda x_1 + (1-\lambda)x_2
\]<p>where \(x_1\) and \(x_2\) are two training images, and \(\lambda\) controls how much each image contributes to the
final mixed image.</p>
<blockquote>
<p>MixUp is used for example in YOLOX models and it has been found to be more effective in larger models.</p></blockquote>
<h2 id="post-processing">Post-processing</h2>
<p>After the model produces its raw predictions, these outputs still need to be converted into final detections. A YOLO
model usually predicts many candidate boxes for the same object, many low-confidence boxes, and sometimes overlapping
detections from different scales. Post-processing transforms these dense predictions into a clean final set of bounding
boxes by applying confidence filtering, decoding the predicted coordinates, and removing duplicated detections with
non-maximum suppression.</p>
<h3 id="non-maximum-suppression-nms">Non-maximum suppression (NMS)</h3>
<p><strong>Non-Maximum Suppression (NMS)</strong> is used to remove duplicate detections in bounding box prediction. YOLO often predicts
many boxes around the same object. NMS keeps the strongest one and removes highly overlapping boxes.</p>
<p>The process is as follows:</p>
<ol>
<li>Sort the predictions by their confidence score.</li>
<li>For each class:
<ol>
<li>Keep the box with the highest confidence score.</li>
<li>Check other boxes with high IoU overlap with the kept box.</li>
</ol>
</li>
<li>Repeat until all boxes have been processed.</li>
</ol>
<blockquote>
<p><strong>NMS</strong> can be class-agnostic or not. If it is class-agnostic, then there is no need to consider the class of the
object.</p></blockquote>
<hr>
<p>D. Bhalla, “Region Proposal Network (RPN): A Complete Guide,” ListenData, Jun. 2022. [Online].
Available: <a href="https://www.listendata.com/2022/06/region-proposal-network.html" target="_blank" rel="noopener">https://www.listendata.com/2022/06/region-proposal-network.html</a></p>
]]></content:encoded></item></channel></rss>