Most tutorials on YOLO training show you how to train on a toy dataset with 200 images and a perfect annotation file that someone else prepared. Then they show you 85% accuracy and call it done.
This post does not do that.
Here you will learn how to build a real detection system with accuracy above 95%. You will understand what data you actually need, how much of it, why annotation quality matters more than annotation quantity, and what the difference between fine-tuning and training from scratch means in practice. At the end there is a complete industrial POC you can download and run today.
The business context throughout this post is factory floor safety monitoring. It is a real use case that Bithost has deployed in manufacturing facilities. The model detects faces, helmets, and safety vests in real time from a camera feed and logs every violation. But everything you read here applies equally to retail footfall analysis, hospital patient monitoring, construction site compliance, and any other domain where you need a model that recognises specific objects reliably.
1. What YOLO is and why it became the industry standard
YOLO stands for You Only Look Once. The name describes what makes it different from earlier detection approaches.
Before YOLO, the dominant approach was a two-stage pipeline. The first stage proposed regions of the image that might contain objects. The second stage classified each proposed region. This worked well but it was slow. Running at 7 frames per second is fine for analysing a photograph. It is useless for monitoring a factory floor in real time.
YOLO treats detection as a single regression problem. It looks at the entire image once, divides it into a grid, and predicts bounding boxes and class probabilities for every grid cell simultaneously. The first version ran at 45 frames per second on a GPU in 2016. The current version, YOLOv8 maintained by Ultralytics, runs at over 200 frames per second on modern hardware and achieves accuracy that was impossible even two years ago.
YOLOv8 is what this post uses. It is not the only option but it has the best combination of documentation, community support, ease of training, and deployment flexibility. The API is clean Python, the training loop handles augmentation automatically, and exporting to ONNX or TensorRT for production is a single function call.
YOLO family quick reference: YOLOv5 Older but battle-tested. Large community. Still used widely. YOLOv8 Current standard. Better accuracy than v5 at equal speed. YOLOv9 Newer architecture with better gradient flow. Slightly higher accuracy. YOLOv10 Very recent. No NMS post-processing step. Faster at inference. RT-DETR Transformer-based. Best accuracy available. Slowest inference.
For most industrial projects, YOLOv8 medium (yolov8m) is the right choice. It gives you strong accuracy without requiring an A100 GPU. That is what the POC at the end of this post uses.
2. Training from scratch versus fine-tuning
This is the question most people get wrong in the planning stage of a project.
Training from scratch means initialising the model with random weights and letting it learn everything from your dataset. It requires very large amounts of labelled data — typically 50,000 or more images per class for a detection model — and takes days or weeks of GPU time. Almost no commercial project should do this.
Fine-tuning means starting from a model that was already trained on a large general dataset (in YOLO's case, ImageNet and COCO) and continuing the training process on your specific dataset. The model already knows how to detect edges, textures, shapes, and general object boundaries. You are teaching it to apply those skills to your specific classes in your specific environment.
Fine-tuning needs far less data (500 to 2,000 images per class is enough for most cases), takes hours rather than weeks, and consistently outperforms training from scratch at low data volumes. There is essentially no reason to train from scratch for a custom detection project.
What you will do in this post is fine-tune YOLOv8m on your own dataset. The base model weights download automatically the first time you call YOLO("yolov8m.pt").
3. What data you need — the honest answer
The data question is where most projects either succeed or fail before training begins. Here is the honest breakdown.
How many images
For a new class that YOLO has never seen before (something genuinely unusual that would not appear in COCO), you need a minimum of 300 annotated images to start seeing reasonable detection and roughly 1,000 to 2,000 to hit consistent performance above 90%.
For classes that are visually similar to things already in COCO (human faces, protective helmets, vests, vehicles, tools), the base model already has strong features. 200 to 500 high-quality annotated images per class can get you above 90% if the annotations are clean. Above 1,000 images per class with clean annotations, you should expect 93 to 97% mAP@0.5.
The number that matters most is not the total image count. It is the number of unique, diverse annotated instances of each class.
1,000 photos taken from the same angle in the same lighting conditions of the same person is far less valuable than 300 photos from varied angles, varied lighting conditions, varied distances, and varied people. Diversity is more important than volume.
What diversity means in practice
For a factory safety application, diversity means:
| Variable | What to capture |
| Lighting | Daylight, fluorescent, night shift, direct glare, backlit |
| Distance | Workers at 1m, 3m, 8m, 15m from camera |
| Angle | Front-facing, side profile, partial occlusion by machinery |
| Helmet types | Multiple helmet colours and brands used at your facility |
| Vest types | All vest colours used — yellow, orange, green, red |
| People | Diverse ages, heights, skin tones, and body types |
| Backgrounds | All zones of the actual facility you will deploy in |
If your validation set is captured under the same conditions as your training set, your validation accuracy will look good and your production accuracy will be much lower. This is the most common accuracy problem in deployed vision systems.
Data types and formats
YOLO expects images in JPEG or PNG format. There is no strong preference between the two. Avoid low-quality JPEG compression (quality below 80) because compression artefacts degrade the fine-grained features the model relies on for small object detection.
Label files are plain text files in YOLO format. Each label file has the same name as its corresponding image, with a .txt extension. Each line in the file describes one object:
class_id x_center y_center width height
All five values are normalised to the range 0.0 to 1.0 relative to the image dimensions. This format is what every major annotation tool exports when you select YOLO format.
An example label file for an image containing one face and one helmet:
0 0.512 0.234 0.180 0.310 1 0.508 0.115 0.192 0.175
Class 0 is face. Class 1 is helmet. The face centre is at 51.2% across and 23.4% down the image, and the bounding box is 18% of image width by 31% of image height.
What about face detection specifically
Human face detection is one of the best-supported detection tasks in computer vision. You do not necessarily need to annotate thousands of your own face images from scratch. High-quality open datasets are available for free:
- WiderFace — 32,000 images with 400,000 face annotations across varied occlusion, pose, and scale conditions. This is the standard benchmark dataset for face detection.
- FDDB — 5,000 images with 10,000 face annotations. Older but clean.
- CelebA — 200,000 celebrity face images. Strong diversity of facial features.
- Open Images v7 — Google's dataset includes millions of annotated objects including faces. Free to download.
For a face-plus-PPE system, the practical approach is to start with WiderFace for the face class, annotate your own facility images for helmet and vest classes (since those are facility-specific), and combine them into a single dataset.
4. Data collection strategy for real projects
For the factory safety use case, here is the collection strategy that gives you the most value per image captured.
Phase 1 — Controlled capture (Day 1).
Set up a camera at each position where the final system will be installed. Capture 30-second video clips at different times of day — early morning, midday, evening shift. Ask workers to walk through naturally without posing. Extract frames at 1 frame per 3 seconds to avoid near-duplicate images. Target 50 to 100 images from this phase.
Phase 2 — Edge case collection (Week 1).
Look at what Phase 1 images contain and identify gaps. If you have no images of workers at the edge of frame, capture those. If all workers in Phase 1 wore helmets, find or stage a few images without helmets (for the no_helmet class). The model needs to see negative examples to learn that class. Target 100 to 200 additional images.
Phase 3 — Augmentation planning (before annotation).
Before annotating everything, look at your image set and determine which conditions are underrepresented. You do not necessarily need to capture more images — the training pipeline's augmentation settings can simulate some of this. But severe underrepresentation (for example, zero nighttime images if the system will run 24 hours) must be addressed with real data.
Phase 4 — Supplement with public data.
Download WiderFace for the face class. Convert its annotation format to YOLO format using the conversion script below. Mix these with your own images — aim for no more than 70% public data in your training set, or the model will optimise for the public dataset's conditions rather than yours.
Converting WiderFace annotations to YOLO format
"""
Convert WiderFace annotation format to YOLO format.
Run once after downloading the WiderFace dataset.
"""
from pathlib import Path
def convert_widerface_to_yolo(
wider_annotation_file: str,
image_root: str,
output_label_dir: str
):
"""
WiderFace format:
path/to/image.jpg
N (number of faces)
x1 y1 w h blur expr ill pose invalid ... (repeated N times)
YOLO format:
class_id cx cy w h (normalised 0 to 1)
"""
out_dir = Path(output_label_dir)
out_dir.mkdir(parents=True, exist_ok=True)
with open(wider_annotation_file) as f:
lines = f.read().splitlines()
i = 0
converted = 0
while i < len(lines):
# Image path line
img_rel_path = lines[i].strip()
i += 1
# Read image size to normalise coordinates
from PIL import Image
img_path = Path(image_root) / img_rel_path
if not img_path.exists():
i += 1 # skip count line
continue
img = Image.open(img_path)
iw, ih = img.size
num_faces = int(lines[i])
i += 1
yolo_lines = []
for _ in range(max(num_faces, 1)):
parts = lines[i].split()
i += 1
if num_faces == 0:
break
x1, y1, w, h = int(parts[0]), int(parts[1]), int(parts[2]), int(parts[3])
invalid = int(parts[7]) if len(parts) > 7 else 0
if invalid or w <= 0 or h <= 0:
continue
cx = (x1 + w / 2.0) / iw
cy = (y1 + h / 2.0) / ih
nw = w / iw
nh = h / ih
# Clamp to valid range
cx = max(0.0, min(1.0, cx))
cy = max(0.0, min(1.0, cy))
nw = max(0.0, min(1.0, nw))
nh = max(0.0, min(1.0, nh))
yolo_lines.append(f"0 {cx:.6f} {cy:.6f} {nw:.6f} {nh:.6f}")
if yolo_lines:
label_path = out_dir / (Path(img_rel_path).stem + ".txt")
label_path.write_text("\n".join(yolo_lines))
converted += 1
print(f"Converted {converted} images to YOLO format.")
5. Annotation — the step that determines your ceiling
Your model accuracy cannot exceed the quality of your annotations. This is not a figure of speech. If your annotators draw boxes that cut off 20% of a face, the model will learn to predict boxes that cut off 20% of a face. If your annotators mark some helmet violations as helmets, the model will learn to ignore some violations. Garbage annotations produce garbage models regardless of how much compute time you throw at the training process.
Annotation tools worth using
Roboflow is the fastest option for teams new to annotation. You upload your images, draw bounding boxes in the browser, and export in YOLO format with one click. The free tier handles up to 10,000 source images. It also provides dataset health statistics and duplicate detection.
Label Studio is an open-source annotation tool you can self-host. Better for teams with data privacy requirements who cannot upload images to a third-party service. More setup required but all data stays on your infrastructure.
CVAT (Computer Vision Annotation Tool) is the tool used by the teams at Intel and Meta who annotate at scale. It supports video annotation (labelling directly on video rather than extracted frames), which saves significant time for camera-based systems.
Annotation guidelines for high accuracy
Write a one-page annotation guide before your annotators start. These are the rules that matter most.
Tight boxes, not loose boxes. The bounding box should touch the edges of the object. A box that includes 30% background around a face gives the model an incorrect spatial relationship to learn from. Train your annotators to pull the box edges flush with the object boundary.
Partial occlusion has rules. When a face is 50% occluded by machinery or another person, still annotate it if you can identify it as a face. When it is more than 80% occluded, skip it. Annotating heavily occluded objects creates contradictory training signal because the visible portion rarely generalises.
Annotate every instance. A missed annotation is as harmful as a wrong annotation. If there are five workers in a frame and an annotator labels four of them, the model learns that the fifth person is background. It will consistently miss detections in the conditions where that fifth person appeared.
Crowd distance cutoff. When workers are very distant and appear smaller than 10×10 pixels in the original image, skip them. The model cannot learn useful features from objects that small. Set a minimum size rule — for a 1080p image, a face must be at least 20×20 pixels to annotate.
Annotation quality checks
After annotation, run a sample review. Take 10% of your annotated images, have a second person review the boxes, and count disagreements. If disagreement rate on box tightness is above 20%, run a calibration session with your annotators before continuing.
Roboflow and CVAT both have annotation review workflows built in. Use them.
6. Setting up your environment
Hardware requirements
| Scenario | Hardware | Training time (100 epochs, 1,000 images) |
| Development machine | NVIDIA RTX 3060 or better | 2 to 4 hours |
| Good training setup | NVIDIA RTX 4090 or A10G | 40 to 90 minutes |
| Fast training setup | NVIDIA A100 or H100 | 15 to 30 minutes |
| No GPU available | CPU only | 20 to 40 hours (not recommended) |
For a first project, the most practical option is renting a GPU instance from AWS, Google Cloud, or Paperspace. An NVIDIA A10G on AWS g5.xlarge costs roughly $1.00 per hour. A 100-epoch training run costs $1.00 to $2.00 total. That is the most cost-effective path if you do not already own a GPU.
Python environment setup
# Create a clean virtual environment
python3 -m venv yolo_env
source yolo_env/bin/activate # Linux and Mac
yolo_env\Scripts\activate # Windows
# Install YOLOv8 (this installs PyTorch as a dependency)
pip install ultralytics
# Verify the installation and check CUDA availability
python3 -c "
from ultralytics import YOLO
import torch
print('Ultralytics installed correctly')
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
print('GPU:', torch.cuda.get_device_name(0))
print('VRAM:', round(torch.cuda.get_device_properties(0).total_memory / 1e9, 1), 'GB')
"
If CUDA is available and PyTorch recognises your GPU, you are ready to train. If CUDA is not available and you are on a machine that has a GPU, reinstall PyTorch with the CUDA version matching your driver:
# Find your CUDA version first nvidia-smi # Install PyTorch with CUDA 12.1 support (adjust version as needed) pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
7. Dataset structure and the YAML config
YOLOv8 expects a specific folder structure. The prepare_data.py script in the POC zip creates this automatically, but it is useful to understand the layout.
data/
images/
train/ image0001.jpg image0002.jpg ...
val/ image0501.jpg image0502.jpg ...
test/ image0601.jpg image0602.jpg ...
labels/
train/ image0001.txt image0002.txt ...
val/ image0501.txt image0502.txt ...
test/ image0601.txt image0602.txt ...
Every image in images/train/ must have a corresponding .txt file in labels/train/ with the exact same filename (different extension). Images with no objects get an empty .txt file — these are called background images and they teach the model to not fire false positives on empty scenes.
The dataset YAML file tells YOLOv8 where to find the data and what your classes are:
# configs/dataset.yaml path: ../data # Root folder train: images/train val: images/val test: images/test nc: 5 # Number of classes names: 0: face 1: helmet 2: no_helmet 3: safety_vest 4: no_vest
The class order in the names section must exactly match the integer class IDs in your .txt label files. If class 0 in your label files is helmet but your YAML says class 0 is face, the model will learn backwards and every face will be predicted as a helmet.
Split ratios
A standard split for a 1,000-image dataset:
Train: 750 images (75%) The model learns from these. Val: 150 images (15%) Evaluated after every epoch to track progress. Test: 100 images (10%) Evaluated once at the very end for your final accuracy number.
The test set must be completely unseen during training and validation. Never use your test set to make decisions about hyperparameters. If you do, your test accuracy will be optimistic and your production accuracy will disappoint you.
8. Fine-tuning YOLOv8 step by step
With data prepared and the YAML config in place, fine-tuning is three lines of Python:
from ultralytics import YOLO
model = YOLO("yolov8m.pt") # Downloads base weights automatically
results = model.train(
data = "configs/dataset.yaml",
epochs = 100
)
That is the minimum. In practice, you want control over the training settings to get above 95% accuracy. The train.py in the POC zip shows the full configuration with every parameter documented. Here are the parameters that matter most.
Model size selection
YOLO("yolov8n.pt") # Nano — 3.2M parameters Fast on any hardware
YOLO("yolov8s.pt") # Small — 11.2M parameters Good balance
YOLO("yolov8m.pt") # Medium — 25.9M parameters Best for industrial use
YOLO("yolov8l.pt") # Large — 43.7M parameters High accuracy, slow
YOLO("yolov8x.pt") # XL — 68.2M parameters Highest accuracy
For a factory floor camera running at 1080p, yolov8m.pt runs at 35 to 50 FPS on an RTX 3060. That is fast enough for real-time monitoring. yolov8n.pt runs at 120 FPS but loses meaningful accuracy on small objects like distant faces.
Key training parameters explained
model.train(
data = "configs/dataset.yaml",
epochs = 100,
imgsz = 640, # Resize all images to 640x640 during training
batch = 16, # Reduce to 8 if you hit GPU memory errors
device = "0", # GPU index. Use "cpu" if no GPU available.
optimizer = "AdamW", # Better convergence than SGD for fine-tuning
lr0 = 0.001, # Learning rate at epoch 0
lrf = 0.01, # Learning rate at final epoch = lr0 * lrf
patience = 20, # Stop early if no improvement for 20 epochs
save_period = 10, # Save a checkpoint every 10 epochs
plots = True, # Generate accuracy and loss plots
# Augmentation — the single biggest lever for accuracy above 90%
mosaic = 1.0, # Combines 4 training images into one. Teaches
# the model to handle small and partial objects.
mixup = 0.15, # Blends two images with random weights.
copy_paste = 0.1, # Copies objects from one image into another.
flipud = 0.1, # Random vertical flip (realistic for ceiling cameras)
degrees = 10.0, # Random rotation up to 10 degrees
scale = 0.5, # Random scale change up to 50%
hsv_v = 0.4, # Random brightness change (handles lighting variation)
hsv_s = 0.7, # Random saturation change
)
Image size and small object detection
The default imgsz=640 works well for objects that take up at least 3 to 5% of the image area. For detecting faces at distance (workers 10 to 15 metres from a camera, where faces appear as 20 to 40 pixel regions in a 1080p image), train with imgsz=1280 instead. This roughly doubles training time and GPU memory usage but significantly improves small object recall.
# For systems with distant subjects
model.train(
data = "configs/dataset.yaml",
epochs = 100,
imgsz = 1280,
batch = 4 # Reduce batch size because larger images need more VRAM
)
9. Reading your training results and knowing when to stop
After each training run, YOLOv8 saves a results.png plot in your output folder. Understanding this plot is how you diagnose training problems before spending more GPU time.
The metrics you are watching
mAP@0.5 (mean Average Precision at IoU 0.5)
The primary accuracy metric. This measures how accurately the model predicts both the location and the class of objects, requiring at least 50% overlap between the predicted box and the ground truth box. A production system targeting above 95% accuracy should reach mAP@0.5 above 0.94 to 0.96 on the validation set.
mAP@0.5:0.95
A stricter version that averages accuracy across multiple IoU thresholds from 0.5 to 0.95. This is harder to achieve. For face detection, aim for above 0.75. For safety equipment detection, above 0.70 is solid.
Precision
Of all the detections the model made, what fraction were correct? High precision means few false positives. In the safety context, high precision means the system does not alarm when workers are wearing their equipment correctly.
Recall
Of all the real objects in the images, what fraction did the model find? High recall means few missed detections. For safety monitoring, recall matters more than precision. A missed helmet violation is more dangerous than a false alarm.
What the loss curves tell you
Box loss decreasing smoothly: Good. Model is learning object locations. Cls loss decreasing smoothly: Good. Model is learning class distinctions. DFL loss decreasing smoothly: Good. Model is learning distribution refinement. Loss plateaus early (epoch 20): Learning rate may be too low. The model stopped learning. Loss spikes then recovers: Normal during augmentation. Not a problem. Validation loss rises while Classic overfitting. Training set is memorised. training loss keeps falling: Reduce epochs or increase regularisation. Both losses stop improving but You have hit the data ceiling. More and better accuracy is below target: annotated data is needed.
When to stop training
YOLOv8's patience=20 setting handles early stopping automatically. If the validation mAP does not improve for 20 consecutive epochs, training stops and saves the best checkpoint found so far.
If you reach epoch 100 and mAP is still climbing slowly, extend to 150 or 200 epochs. Many industrial datasets have complex enough variation that the model continues to improve past 100 epochs.
If mAP plateaus at 80% and does not budge with additional epochs, the problem is almost certainly the data, not the training settings. Go back to Section 3 and add more diverse annotated images.
10. How to push accuracy above 95%
Hitting 95% mAP@0.5 is achievable but it requires addressing five specific things in order. Do not skip to step 5.
Step 1 — Fix your annotations
Run a review of your annotation files before doing anything else. Look for:
- Boxes that are too loose (leave significant background inside the box)
- Objects that were missed entirely
- Incorrect class labels (a vest labelled as a helmet)
- Very small objects annotated at less than 10×10 pixels (remove these)
A 10% improvement in annotation quality typically produces a 3 to 6 point improvement in mAP with no other changes.
Step 2 — Increase dataset diversity, not just size
If you have 500 images from one shift and one lighting condition, adding 500 more from the same shift and lighting produces diminishing returns. Adding 100 images from night shift, 100 from a different camera angle, and 50 images with partial occlusion will outperform doubling your identical-condition dataset.
Step 3 — Train longer with the right augmentation
Enable mosaic, mixup, and copy-paste augmentation. These are the three augmentation methods with the strongest empirical impact on small-object detection accuracy. All three are on by default in YOLOv8 but mixup and copy-paste default to 0.0 and must be explicitly set:
model.train(
mosaic = 1.0,
mixup = 0.15,
copy_paste = 0.1,
epochs = 150 # Train longer with strong augmentation
)
Step 4 — Tune the confidence threshold for your use case
The default confidence threshold for inference is 0.25. A model reporting 91% mAP@0.5 at threshold 0.25 might report 95% precision at threshold 0.45 with acceptable recall for your use case. Run a threshold sweep on your validation set:
from ultralytics import YOLO
import numpy as np
model = YOLO("models/runs/safety_detector/weights/best.pt")
# Evaluate at multiple confidence thresholds
for conf in np.arange(0.25, 0.75, 0.05):
metrics = model.val(conf=conf, iou=0.45, verbose=False)
print(
f"conf={conf:.2f} "
f"mAP50={metrics.box.map50:.4f} "
f"P={metrics.box.mp:.4f} "
f"R={metrics.box.mr:.4f}"
)
Choose the threshold where precision meets recall at the point your application needs. For safety monitoring, favour recall. For a counting application, favour precision.
Step 5 — Use test-time augmentation for final evaluation
Test-time augmentation (TTA) runs inference on multiple augmented versions of the same input image and averages the predictions. It adds latency but improves accuracy by 1 to 3 mAP points. For a final accuracy report or a non-real-time batch processing job, enable it:
metrics = model.val(augment=True) # Enable TTA during validation
TTA is not suitable for real-time camera feeds because of the added latency. Use it for the final accuracy measurement and for batch processing offline video.
Industrial POC — factory safety monitor
The POC in the download zip is a complete, runnable safety detection system. It does not require a trained model to explore the code — you can swap in any YOLOv8 .pt file, including the pre-trained yolov8m.pt itself, to see the inference pipeline running on your camera.
What the system does
Every camera feed is processed in real time. Each frame runs through the YOLO model and returns bounding boxes with class labels. The system draws annotated boxes on screen, maintains a live HUD with FPS and violation count, and writes every safety violation to a JSON Lines log file.
The log file format is designed for integration with existing ERP or SCADA systems. Each line is a valid JSON object:
{"timestamp": "2026-02-14T09:23:11.442Z", "camera_id": "CAM-01", "violation": "no_helmet", "confidence": 0.871, "session_total": 3}
Any system that can read a file or tail a log can consume this output. Connecting it to a Slack alert, an email trigger, or a database insert is a standard automation task.
Project structure
poc/
src/
detect.py Main inference script
train.py Fine-tune YOLOv8 on your dataset
prepare_data.py Split and verify annotated data
configs/
dataset.yaml Dataset paths and class names
requirements.txt
Running the system
# 1. Install dependencies pip install -r requirements.txt # 2. Run with your webcam (source 0) and a trained model python src/detect.py \ --model models/safety_yolov8.pt \ --source 0 \ --conf 0.50 \ --camera CAM-ENTRANCE-01 # Run on a video file instead of a live camera python src/detect.py \ --model models/safety_yolov8.pt \ --source videos/factory_floor_recording.mp4 # Run on an IP camera RTSP stream python src/detect.py \ --model models/safety_yolov8.pt \ --source "rtsp://admin:password@192.168.1.100:554/stream"
Running the full training pipeline
# Step 1 — Prepare your dataset python src/prepare_data.py \ --raw_images data/raw/images \ --raw_labels data/raw/labels \ --output data # Step 2 — Verify the split looks correct python src/prepare_data.py \ --raw_images data/raw/images \ --raw_labels data/raw/labels \ --verify # Step 3 — Train python src/train.py \ --data configs/dataset.yaml \ --model yolov8m.pt \ --epochs 100 \ --batch 16 \ --device 0 # Step 4 — Validate the trained model python src/train.py \ --data configs/dataset.yaml \ --validate # Step 5 — Run inference with the trained model python src/detect.py \ --model models/runs/safety_detector/weights/best.pt \ --source 0
What a real deployment looks like
The POC above runs on a laptop or workstation. A production deployment on a factory floor typically looks like this:
IP cameras (RTSP) ──→ NVIDIA Jetson AGX Orin (on-site edge device)
↓
Python inference service (systemd daemon)
↓
violation.jsonl log file
↓
┌─────────────────────────┐
│ Local network │
│ REST API or MQTT │
└─────────────────────────┘
↓
Factory MES / SCADA system
Supervisor Slack alerts
Daily violation report email
The Jetson AGX Orin runs the ONNX-exported model via TensorRT, which gives 4 to 6× faster inference than the PyTorch model on the same hardware. Export is one line:
model = YOLO("models/safety_yolov8.pt")
model.export(format="tensorrt", imgsz=640, half=True)
# Generates safety_yolov8.engine — load this with YOLO("safety_yolov8.engine")
Deploying to production
Inference on a trained model is simpler than training but there are decisions to make that affect reliability, latency, and cost.
Choosing your inference runtime
# PyTorch (.pt) — simplest, use for development and testing
model = YOLO("best.pt")
# ONNX — use for cross-platform production deployment
model = YOLO("best.onnx")
# TensorRT (.engine) — use for NVIDIA GPU production deployment
# Gives 3 to 5× speedup versus PyTorch on same GPU
model = YOLO("best.engine")
Running multiple cameras
For a facility with 8 to 16 cameras, the most practical architecture is one Python process per camera running on the same machine, with each process writing to its own log file. A single NVIDIA RTX 3090 can handle 8 simultaneous camera streams at 1080p and 30 FPS with headroom to spare.
import multiprocessing
from src.detect import SafetyDetector
def run_camera(camera_config: dict):
detector = SafetyDetector(
model_path = camera_config["model"],
camera_id = camera_config["id"]
)
detector.run(camera_config["source"])
if __name__ == "__main__":
cameras = [
{"id": "CAM-01", "source": "rtsp://192.168.1.101/stream", "model": "models/best.pt"},
{"id": "CAM-02", "source": "rtsp://192.168.1.102/stream", "model": "models/best.pt"},
{"id": "CAM-03", "source": "rtsp://192.168.1.103/stream", "model": "models/best.pt"},
]
with multiprocessing.Pool(len(cameras)) as pool:
pool.map(run_camera, cameras)
Handling model updates without downtime
When you retrain with new data, you want to update the model without stopping the camera feeds. The cleanest approach is blue-green deployment at the process level: start the new inference processes pointed at the new model weights, verify they are running correctly, then stop the old processes. All of this can be scripted in a 20-line shell script and run as a cron job or triggered by your CI pipeline when a new best.pt is uploaded to S3.
How Bithost can help
Getting a YOLO model to 75% accuracy is straightforward. Getting it to 95% in a real production environment involves solving a chain of problems that are not covered in most tutorials: annotation quality at scale, dealing with lighting variation across shifts, handling edge cases that only appear after the system goes live, and integrating the detection output into the systems your operations team already uses.
Bithost has deployed computer vision systems in manufacturing facilities, logistics warehouses, and construction sites across India. The pattern is the same across all of them: data collection and annotation take three to four times longer than the teams expected, training takes less time than they feared, and production integration takes longer than both if it is not planned from the start.
What we offer for computer vision projects:
Scoping and feasibility assessment. Before you spend time collecting data, we will tell you whether the accuracy you need is achievable with the budget you have, and what data strategy will get you there fastest. This is a one-week engagement and is often the most valuable step in the whole project.
Annotation pipeline setup. We set up Roboflow or CVAT with your annotation guidelines, train your annotators, and implement a quality review process. We have seen too many projects fail because annotation quality was treated as an afterthought.
Model training and optimisation. We handle the training loop, hyperparameter tuning, augmentation strategy, and the threshold calibration needed to hit your accuracy target. We also handle edge cases: what the model does at night, in rain, with new PPE types not in the original dataset.
Edge deployment. If your facility needs on-site processing rather than cloud inference (for latency, data privacy, or connectivity reasons), we deploy to NVIDIA Jetson or other edge hardware and set up the RTSP ingestion, inference service, and alerting integration.
Sovereign AI deployment. If your facility handles sensitive operations or operates in a regulated sector, we deploy the entire stack on your own infrastructure with no data leaving your network. The violation logs, camera feeds, and model weights never touch a third-party server.
The right time to talk to us is before you start annotating, not after you have hit a wall on accuracy.
Email sales@bithost.in Visit bithost.in/ai-integration-service We respond within 48 hours