Playing with surveillance cameras, part 2

(If you need context, check out the Part 1 of this adventure.)

Generate highlight videos

It is humanly impossible to review 24x7 footage of surveillance cameras. On the other hand, if nobody ever takes a look, the whole thing is pointless. But technology can lend a hand by generating highlight videos based on motion detection techniques. I have tried DVR-Scan with very good results for a no-ML solution.

Here goes my "scanner" script. If you plan to use it, get acquainted with DVR-Scan parameters and adjust for your particular needs.

Video processing is a CPU hog, and proportional to the product FPS x resolution. Don't even think about running DVR-Scan on a Raspberry, unless you want to splurge money and purchase one Raspy per camera. An old PC may be enough for a handful of cameras, if a) footage is judicious in FPS and resolution, and b) DVR-Scan is correctly configured given the limited processing power.

I use an old mini-PC and it manages to do 80-90FPS on HD videos (1MP), enough for my current setup. The trick is to use the parameter -df 3 that reduces the resolution analysis by 3. If you go the opposite direction and purchase a big box PC in order to support high-res, high-FPS cameras, get an NVida GPU and invest some time to make DVR-Scan work with CUDA.

The sensitivity level will take adjustments and experiments as you go by, with different optimal levels per camera location. Start with the default (0.15) which is pretty sensitive, and increase it slowly until results are best. In my own setup, every spot has its own value, between 0.2 and 0.4.

The easiest way to play with DVR-Scan is using the mode -m opencv -o somefile.avi which saves all motion events in a single file. This has two drawbacks: a) only the .AVI format is supported (which cannot be played directly by a browser); b) OpenCV transcodes the video to XviD (it is fast but is also not supported by any browser); c) the final bitrate may be higher than the original, which nullifies our efforts on controlling the bitrate of the original footage.

At first I transcoded the .AVI output from DVR-Scan to .MP4. But the double transcoding added mystery defects and artifacts. A more effective solution was to use the mode -m copy that simply splices from the original stream and generates a separate .MP4 for each event. Then, concatenate them using ffmpeg -c copy. This way, the final deliverable is almost perfect: same bitrate as the original, can be played on browser, no CPU cycles spent on transcoding, no artifacts.

A drawback of -m copy is the splice "precision". A splice must start on an I-frame. When the events are concatenated, the camera clock sometimes "goes backwards" when two events are too close in time. But there is a more insidious problem.

For example, given an I-frame rate of 3s, and given an event starting at 0:45, the splice may start between 0:42 and 0:45. (DVR-Scan may start too soon, but will never start too late.) Even with -tb 0, the splice will be 1.5s early on average. But DVR-Scan does not compensate for this at the other side. The splice will terminate too early, up to 3s, 1.5s on average. Depending on -tp parameter, the splice may not contain the event of interest at all! Short-duration events are particularly at risk.

The workaround is to make -tp equal to I-frame rate (3s in my case) to make sure the splice always contains the event of interest. Of course, this will add 3s of extra footage per event. Another mitigation is to increase the I-frame rate, reducing the error range at the cost of a higher bitrate. If neither solution is acceptable, you can resort to -m ffmpeg and bear the associated costs of transcoding.

Even with optional sensitivity adjustment, highlight videos generated by DVR-Scan will contain mostly false positives and are quite boring to watch. You will use 2x or 4x playback all the time. You can bake this into the video using ffmpeg by changing FPS rating without discarding any footage. For example, a 5 FPS video can be watched at 20 FPS, and interesting parts can be played at 0.5x or 0.25x to restore the normal speed, without any loss.

Watch highlight videos

At first, we published videos in a Web site with autoindex. Now they go to an S3 bucket published as a static website. Since the highlights are .MP4, the browser can play them back without any backend page or media service. It is crude, but works for us. The wife is not complaining; this is a sure sign it is good enough :)

Cloud mass storage like EBS (the "disk" of an EC2 instance) is expensive, so the cost-conscious options are a) an edge server or DVR, accessible via reverse NAT or reverse proxy; or b) go serverless using S3. S3 does charge by storage and by traffic, but the rates are pretty low.

Artificial intelligence

DVR-Scan generates many false positives, to the point of being unbearable sometimes. The IR illumination attracts a lot of insects at night, and nights with windy fog will drive DVR-Scan crazy. The highlight videos will contain almost 100% of the original footage.

So, the next logical step is try to improve event detection using IA/ML. I have been using Ultralytics YOLO with very promising results. The thing really works, and reaches a decent FPS rate even without GPU. Of course it is much slower than DVR-Scan, but it is a 20x difference, not 200x or 500x as one would expect.

The highest upfront costs of ML are a) to get a good library (in this case, an image library) to train the network, and b) the CPU/GPU effort of the training process. But YOLO offers, free of charge, some pre-trained models that recognize almost a hundred of commonplace objects (people, pets, cars, fruits, etc.) These models are more than enough for amateur use.

Moreover, YOLO offers the tools to create your own model. One of the virtues of YOLO is need much less samples than other frameworks. A friend of mine asked the maid to take pictures of every dog poo, in order to create a Roomba-style robot dedicated to collect the poos. (I find more effective not to have dogs at all, but that's me.)

I feel YOLO is not precise enough to replace an alarm system (perhaps there are other models out there that are good enough for that?). Sometimes the pre-trained model took a concrete mixer for a sheep, and a bush for a person, particularly at night with B&W IR footage. (That bush really looked like a ghost in a fast glance! It was our "EVP" moment.) But it is every effective in finding video highlights. I run the YOLO detection on motion events pre-filtered by DVR-Scan.

Information about YOLO install and initial usage can be found at this site (FreeCodeCamp).

Examples

The example below is a slight improvement over some code found on YOLO documentation. It accepts a video as parameter, and generates another video with the frames that contain some object:

import cv2, sys
from ultralytics import YOLO

cap = cv2.VideoCapture(sys.argv[1])
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = int(cap.get(cv2.CAP_PROP_FPS))

classes = [0, 1, 2, 3, 5, 7]

out = cv2.VideoWriter(sys.argv[1] + '.webm',
        cv2.VideoWriter_fourcc('V', 'P', '8', '0'),
        fps,
        (width, height))

model = YOLO("./yolov8n.pt")

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        break
    results = model.predict(frame, conf=0.45, classes=classes,
                            boxes=True, verbose=False)
    if len(results[0].boxes):
        annotated_frame = results[0].plot()
        # out.write(frame)
        out.write(annotated_frame)

out.release()

Coupling YOLO with OpenCV is a dynamite thing: so powerful and so easy to use.

The next logical step is to detect only the object classes that make sense for our environment, reducing the chance of false positives (e.g. there will never be elephants or sheeps around my house, let alone inside it).

The following script simply judges whether a video contains some object of interest:

import cv2, sys, os
from ultralytics import YOLO

classes = [0, 1, 2, 3, 5, 7]
model = YOLO("./yolov8n.pt")

cap = cv2.VideoCapture(sys.argv[1])
fpstime = 1.0 / int(cap.get(cv2.CAP_PROP_FPS))

if not cap.isOpened():
    sys.exit(2)

detected = False
interval = 0.5
# Force analysis of first frame
elapsed_time = 999999.9

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        break

    elapsed_time += fpstime
    if elapsed_time < interval:
        continue
    elapsed_time = 0.0

    results = model.predict(frame, conf=0.5, classes=classes,
                            boxes=True, verbose=False)
    if len(results[0].boxes):
        sys.exit(0)

sys.exit(1)

We run this script against every motion-detected event video, spliced by DVR-Scan from the raw footage. Since every video contains a single event, they tend to be very short.

We employ an heuristic that may or may not be adequate for you: it evaluates only 2.5 frames per second to save CPU. We assume that an object of interest will show up in several adjacent frames. A colleague of mine even suggested looking just 1 frame per second.

Another heuristic that we sometimes use, and sometimes remove because we are not 100% sure it is sound, is to skip more and ore frames as we progress. The idea is, when DVR-Scan splices a big video (with more than e.g. 15 seconds), the motion-detected event is likely a false positive due to insects, rain or wind. If there is something interesting, it would be at the beginning anyway.

Machine learning pipeline

My pipeline of highlight video generation is:

DVR-Scan does motion detection, generating a mini-video per event.
YOLO further filters these events.
A video is complied with the filtered events. Since very few make it, footage from all cameras is used (making a movie-like video that is more entertaining to watch).
A secondary, per-camera highlights video, using events filtered solely by DVR-Scan, is still generated for "cold storage".

The main reason of doing this two-stage analysis is to reduce YOLO's workload. A minor reason is we use the motion-detected events as "cold storage" since they are much smaller than the original footage, and we can store them for months instead of days.

Sometimes the "dumb" detection of DVR-Scan ends up being smarter than YOLO. Suppose a) a thief disguised as a sheep (false negative) or b) a parked car overnight (false positive).

I can envision YOLO replacing DVR-Scan entirely in the future, but two things need to happen. First, I need to get a big-box PC with GPU to make realtime processing. Second, we need to add object tracking to the YOLO script e.g. trigger events when objects of interest show up or move, but ignore them while stationary.

Then we can go even forward; read car plates, do some face recognition, use YOLO as an auxiliary sensor for the alarm, and so on.

ML hardware

My YOLO sample scripts start with the following line:

model = YOLO("./yolov8n.pt")

The file yolov8n.pt is a pre-trained model offered by the YOLO project. Actually, this is the smallest amongst models. It is often used in tutorials since it is usable even without a GPU. But this is not optimal; YOLO is much faster with a GPU, and using bigger models will improve the precision.

As a basis of comparision, the model above can achieve a rate of 5 FPS in a Mini PC (Intel NUC J4125). Yes, it is slow. On the same computer, DVR-Scan does 110 FPS.