Playing with surveillance cameras

I've finally installed some surveillance cameras at my new home. Besides the deterrence effect, it is a way to get some seasoning in video, that I knew almost nothing about. Being an open-source enthusiast and adept of UNIX way of life, I am trying to avoid the "wide gates and broad ways" of purchasing a DVR or using a ready-made cloud service.

The hardware

My standard camera is a local Brazilian brand, Intelbras VIP 1230. It is a PoE IP camera, reasonable price, weather-resistant, wide-angle, with 100ft of IR illumination and secondary low-res stream capability. I think this camera actually runs some embedded Linux. The Web interface is good and it is pretty stable. Perhaps I will experiment with some HikVision cam in the future, since they are cheaper.

First disillusionment

Since I hadnt't bought an NVR, and my homemade solution wasn't ready, I would just use the camera app to have remote access. Unfortunately, the cloud solution of this manufacturer seems to have been sunset. So I had to rush my own solution: publish the video streams on Web. This cost a lot more work than expected.

Wish list

I guess that whoever has surveillance cameras at home has the following wish list:

Watch live streams from anywhere, using a computer or a phone.
Record the streams in a durable way, resistant to Murphy-like and also to Machiavelli-like incidents.
Cloud/off-site storage for even more durability.
Auto generation of videos with highlights that are short and pleasant to watch, and also small enough to function as long-term cold storage.
Be able to watch these compact videos from anywhere.
Automated event detection beyond motion detection, possibly with machine learning.
Realtime automated event detection coupled to alarm.

Every one of these wishes present non-trivial technical challenges. The biggest problem of handling video, particularly 24x7 footage, is the sheer amount of data. A single camera with bitrate of 1Mbps generates 10GB per day, 300GB per month. Ten cameras mean 100GB per day, 3TB per month. Transmission uses a lot of bandwidth; processing (transcoding, motion detection, etc.) uses a lot of CPU and/or will need GPUs.

A DVR or NVR will fulfill some of these needs. It is the go-to solution if you don't want to bother with the technology. I didn't purchase a DVR because I am a masochist (**).

Another venue is to purchase a cloud service. But they are not exactly cheap (US$ 12/cam/month in average) and a lot of upstream bandwidth will be necessary. The guaranteed upload of most residential Internet is just 10Mbps (*).

On the other hand, direct connection from Internet to your cameras is difficult if your ISP uses CGNAT. Some sort of cloud "reflector" service will be necessary, and such service will have big operation costs, so you won't find such a service for free, not even from the camera's manufacturer.

This is a niche where the cloud computing failed to fulfill the promise of abundant and cheap services. The expectation is to watch your livestreams from YouTube, detect events using ChatGPT and store years of footage on RedShift, and this whole bundle costing 5 cents per month. In the real world, the only way to watch livestreams is to use a crummy, outdated app that doesn't even work.

The solution is to bring back the good old local network server, rechristened as "edge server".

Death and ressurection of the client-server architecture. Credits: Daniel Stori, site Turnoff.us (https://turnoff.us/geek/edge/). By the way, my edge service is an old Mini PC that used to be a NAS.

Last but not least, the cloud depends on a funcional Internet connection. Between Murphy and Machiavelli, it is certain the ISP will fail you when you need it most. Even if you use a cloud service, there should be an edge device like a DVR, as backup.

But enough of this talk, from now on it is all open source and free software.

Watch livestreams from everywhere

The easiest way to make a livestream available is to publish as a Web page, since every device and their mother can open a Web page, out of the box. The downside is having to setup a Web server in the cloud.

A Web page cannot consume an RTSP stream directly. In the past, one would employ a Flash or ActiveX plug-in. Now, in the HTML5 era, there are native solutions in two major flavors: transmit the stream via HTTP, or use WebRTC.

We started with streaming via HTTP, since it is easier to set up and works everywhere everytime, even with restrictive firewalls and proxies in the way. If you can watch YouTube, you can watch HTTP streaming. But it has higher latency and needs some sort of RTSP–HTTP converter at server side.

The most well-known open-source project is MediaMTX. It has an embedded Web server and the RTSP converter. It serves a ready-made page to show a livestream on your browser. If all you want is to publish a livestream, MediaMTX is a one-stop solution.

I prefer to "hide" it behind an NGINX reverse proxy — it is easier to configure HTTPS on NGINX. My first setup ran MediaMTX in a cloud server, and RTSP cameras were reached through VPN. For reasons to be explained later, we have moved MediaMTX to an edge server in the same LAN of the cameras.

Let's first explore how HTTP streaming works.

The gory details of HTTP streaming

There are two methods of HTTP streaming: Media Source Extensions (MSE) and HTTP Live Streaming (HLS).

MSE is conceptually very simple. It is a Javascript API that allows to feed <video> and <audio> programatically. It is the oldest method and works in almost every browser, with the glaring exception of iPhone.

A <video> element expects to be fed with a H.264 stream (or any other codec supported by the browser). Where does Javascript gets this stream from? MSE does not specify that. The site (i.e. you) has to come up with some protocol to send H.264 media over HTTP in segments (continuous data HTTP exists but it has many caveats, so one can't count on that).

Note that <video> won't accept an RTSP stream; it expects a pure H.264 stream. The media stream must be extracted from the RTSP encapsulation at some point, and that's why we need the RTSP–HTTP converter. Fortunately, it is already bundled with MediaMTX.

HLS was proposed by Apple, is obviously supported by iPhone, and is the dominant standard now. The specification affects both clients and servers. For the webmaster, it could not be simpler: just point the src of the <video> element to some URL with extension .m3u8.

The m3u8 format is a playlist, it lists the media segments available via HTTP. Here starts the magic. For a movie, the playlist could be a static file. For live TV, or livestream of surveillance cameras, the playlist must be dynamically generated and updated.

The HLS media segments also have a predefined format. They must encapsulate the H.264 stream as MPEG-TS or MP4. This encapsulation is also tasked to a RTSP–HTTP converter. For a movie, the segments may be static files (perhaps even distributed by a CDN). For livestreams, new segments must be generated in realtime (and the playlist must be updated accordingly). The past livestream segments or may be kept around if the site offers rewind capabilities.

There are Javascrypt polyfills that implement m3u8 media support in MSE-only browsers. The server side can be HLS-only and still support every browser. By the way, this is exactly how MediaMTX does it.

HLS pitfalls

On my first try implementing HLS using MediaMTX, I managed to make it work in 15 minutes — for desktop computers and for Android. Than it took 3 hours more to make it work for the iPhone...

First problem was, not every combination of browser and platform supports the H.265 codec, so I had to reconfigure the cameras to H.264. Pity, since H.265 compression is vastly superior. (If the camera were H.265 only, we'd have to resort to transcoding, and bear the CPU costs.) In time, all browsers and platforms will eventually support H.265 and then we switch back.

The second issue was a certain difficulty to start playback on iPhone. It took two or three attempts, even the cameras went under suspicion. But it was rather a combination of HLS characteristics, MediaMTX configuration, and camera configuration:

MediaMTX was configured to get the RTSP stream on-demand, when someone was watching, in order to save bandwidth (at this point, I still ran MediaMTX in the cloud).
An HLS client needs to receive at least 3 media segments before playback starts.
The media segment size is configurable at MediaMTX, the default is 1s, but every segment must have at least one I-frame.
The I-frame is a video frame that contains a complete image, like a JPEG picture. A H.264 or H.265 stream emits an I-frame every "n" differential frames.
My cameras offer an "intelligent compression" mode where the I-frame rate is kept as low as possible. This saves a lot of bandwidth and avoids the headache of configuring the I-frame rate manually.
Since the cameras generated I-frames few and far between, the HLS segments were getting big, with several seconds worth of footage.
Since MediaMTX delayed the RTSP streaming until a client was connected, it could take almost a minute until the first three HLS segments could be produced and delivered.
iPhone's Safari has shorter timeouts than other browsers, failing the playback before the initial HLS segments were ready.
MediaMTX does not close an on-demand RTSP stream immediately upon client disconnection. So, if the failed page was reloaded soon, then there would be HLS segments for immediate delivery.

The initial workaround was to turn off the "intelligent compression" and configure the I-frame rate to 1 per second. Unfortunately, this increased the camera bitrate brutally. We had to reduce resolution from Full HD to HD, and the bitrate was still around 2.3Mbps. Just to make it work for the wife's iPhone...

A better solution was to move MediaMTX from the cloud to an edge server in the same LAN of the cameras. MediaMTX can stay connected to RTSP streams continuously for free, and HLS is always ready to go for the clients.

Still, it still remains a tradeoff between I-frame rate and video delay. For example, if there is one I-Frame every 3 seconds, the HLS playback will be delayed 9s. With the "intelligent compression" enabled, this delay could be as high as 60s, so we had to pay the price of a fixed I-Frame rate (one every 2s is our current configuration).

In a surveillance camera, a long delay is detrimental. Many things can happen in 9 seconds. On the other hand, WebRTC has near-zero delay. For WebRTC, the only delay associated to I-frame is the playback start.

Resolution, bitrate, FPS, CBR and VBR

Big dilemma: which is the ideal configuration for a surveillance camera? In a dream scenario we would max out all parameters. But we need to keep the bitrate under control, because it takes a toll on storage and on upload. We also need to limit resolution and FPS if we want automated video analysis. First, let's concentrate in the bitrate.

CBR x VBR

Should you use CBR (constant bitrate) or VBR (variable bitrate)? People think CBR guarantees a fixed quality for a fixed bitrate, but actually the stream quality is constantly adjusted to keep the bitrate around the target. If you use CBR with a bitrate just enough to encode a simple/static scene, the quality will drop as soon as the scene becomes dynamic and/or more complex, since it has to fit in the same bitrate.

On the other hand, VBR guarantees a certain quality by varying the bitrate as necessary. Each camera expresses this "quality" in a different unit: average bitrate, average/maximum bitrate, or a unitless number. In my cameras, it is a value 1..6, which translates to a tenfold variation of typical bitrate. I use level 4, which is 1/3 of the maximum bitrate.

Given the same footage quality, VBR translates to much smaller average bitrates than CBR, but the spot bitrate may flare up to 3x the average. If your network can handle such spikes, VBR is best. If the bitrate must absolutely not flare, CBR is best. Note that the ideal bitrate for CBR will be near the maximum VBR bitrate for the same subjective quality; a low CBR bitrate will deliver blurry footage exactly when it shouldn't i.e. when something is actually happening. I can't see why someone would use CBR at all.

I-frame rate

For surveillance cameras, the I-frame is the biggest "villain" of bitrate. The final bitrate is almost linearly proportional to the I-frame rate. Increasing it from e.g. 1:15 to 1:5 costs a tripled bitrate for VBR, or a severe image quality drop for CBR.

As mentioned earlier, the I-frame rate has impact on latency. For the livestream consumer, the more I-frames, the better. What the "intelligent compression" does is simply generate as few I-frames as possible, which is not good for livestreaming. You should weigh the trade-offs and choose the I-frame rate manually.

Personally, I use a rate of 1:10 and 5 FPS, which means an I-frame every 2s.

Resolution

The relationship between bitrate and resolution is complex. It can be sublinear, linear or even superlinear, depending on other factors.

For example, in my cameras, the relationship is sublinear with VBR quality level 4, linear for quality 5, and superlinear for quality 6 (max). Assuming that most people will select an average VBR, it means sublinear.

FPS

Thanks to the efficiency of H.264 and H.265 compression, the relationship between bitrate and FPS is definitively sublinear. Increasing FPS 4x increases bandwidth by 50% only. (Of course, the I-frame rate per unit of time must stay the same, so if your I-frame rate is one every 2s, this means 1:10 for 5 FPS, so it must be reduced to 1:40 if FPS is increased to 20.)

On the other hand, resolution and FPS have big impacts on post-processing (transcoding, motion detection, etc.) The processing cost is linearly proportional to the product FPS × resolution, regardless of bitrate. We will revisit this later, but consider yourself warned.

H.264 x H.265

As mentioned earlier, H.265 is much better than H.264, but unfortunately H.265 is still not supported by every browser and other software, so we had to settle with H.264. In the future, we hope to go back to H.265. The compression gain is higher for higher-quality videos: bitrate is -25% for VBR 4, and incredible -40% for VBR 6 (max).

How to find a bitrate for CBR

Even though we use VBR, we did some experiments with CBR. The best "tool" to find the ideal CBR bitrate is to set the camera to VBR and monitor the bitrate using some network tool like iptraf. Make sure you test for static and dynamic scenes (ask someone to wave their hands, etc.). Your ideal CBR bitrate is the maximum observed VBR bitrate.

WebRTC

We were content with HLS, with no plans to mess with WebRTC. But when we were at home, it grew old to watch the surveillance streams with some 10-second delay. We could resort to the Web interface of the camera itself, but it shows only one stream, it asks login every time you open it, etc. By this time I had already put together a single "dashboard" page with all cameras and did want to preserve the UX.

Thanks to MediaMTX, it is quite easy to serve WebRTC streaming along with HLS. Just activate it on configuration. The edge server was already in the LAN, so there was no entanglement with NAT. Zero-latency video with zero effort!

It is best to "hide" MediaMTX behind a reverse proxy like NGINX. It should be correctly configured to handle WebSockets. You can follow the instructions for other products that also use WebSockets e.g. Grafana (that's exactly what I did).

Now, the next step is to make WebRTC work when cameras are watched from the Internet. This is more complicated and will imply some monetary costs. From now on, assume the MediaMTX runs in a server behind one or more NAT routers.

The simplest option by far is to lease a fixed IP address from your ISP. (My ISP uses CGNAT but offers public, fixed IPv4 addresses for a reasonable extra.) Configure reverse NAT in your router, so incoming connections go to your edge server.

This way, your setup may even work without a TURN server; the public Google STUN server is enough. Example of relevant config section for MediaMTX:

    webrtcICEServers: [stun:stun.l.google.com:19302]
    webrtcICEHostNAT1To1IPs: [fixed IPv4, fixed IPv6 if any]
    webrtcICEUDPMuxAddress: :8885
    webrtcICETCPMuxAddress: :8885

In the example above the incoming connections to port 8885 TCP and UDP have to be accepted by the reverse NAT router and forwarded to the edge server running MediaMTX.

If getting a fixed IP is not an option, you need either to a) run a TURN server yourself, or b) purchase this service from a third party. The MediaMTX configuration would then look like

    webrtcICEServers: [turn:user:password:name:port]
    webrtcICEHostNAT1To1IPs: []
    webrtcICEUDPMuxAddress: :8885
    webrtcICETCPMuxAddress: :8885

A professional site with streaming would do both: configure a public IP and also run a TURN server, to ensure the widest coverage possible for their clients.

Even then, WebRTC may fail if the client is behind a very restrictive firewall and/or transparent proxy, found in hotels or cafés. I keep the HLS streaming available as a secondary option for these cases.

Durable recording

Recording RTSP is simple. All it takes is ffmpeg. Since most RTSP cameras can serve multiple clients, one may have a secondary or even a tertiary computer for recording, depending on the durability you need. If you don't use transcoding, the ffmpeg recording is easy on the CPU. Even a Raspberry Zero can manage a handful of streams.

This script tries to be robust while using ffmpeg for recording. Since ffmpeg may (and does) break or hang during recording, the script must handle these situations gracefully.

The MP4 format stores metadata at header *and* trailer of the file. If ffmpeg is interrupted, even using Ctrl-C, the result is a corrupted file. Recording to MKV or WebM file won't corrupt when interrupted, so prefer these formats.

Regardless of the size of your HDD or SSD, recording video 24x7 will eventually fill up any storage. This is a suggestion of script to remove old footage.

If you record 10 cameras in 2 computers, plus livestreaming, you have about 60Mbps of traffic (assuming 2Mbps per stream). This is a lot, and dangerously near 100Mbps which is the limit of many cheaper PoE switches. Also, I don't trust having so many RTSP clients connected to the cameras; these IoT devices tend to be buggy and act up outside the vanilla usage patterns.

Again, MediaMTX saves the day: it doubles as an RTSP repeater, so it can forward a RTSP stream to other clients while still serving HLS and WebRTC. This is particularly useful of the MediaMTX server is also a recording device; it makes sense not to connect twice to the same camera from the same computer. For secondary or tertiary recorders, it may or may not make sense, because then MediaMTX becomes a SPOF.

The RTSP protocol

The de facto standard for surveillance cameras is the RTSP protocol, so it is good to know its basics.

Whoever had the displeasure to deal with RTP, H.323 or SIP/SDP knows how difficult it is to put VoIP and media streaming to work across NATs. Whoever developed these standards tried to cover every possible scenario, and created monsters. But NAT was not one of their preoccupations.

To be fair, these protocols were created in the 1990s, and the future looked rosier back then. We envisioned a IPv6-only, NAT-free Internet with pervasive multicast support and peer-to-peer communication. Nobody foresaw the bleak Internet of 2023 with CGNAT, IPv6 still struggling, and 99% of the user facetime concentrated on just 3 or 4 apps.

In the case of RTP, three protocols must work in concert: RTP, RCTP, and RTSP. RTP carries the media proper. RTCP does synchronization and QoS. RTSP is the "remote control". Imagine the difficulty of making all these protocols pass through firewalls and NATs, to achieve a single video playback.

But RTSP has two interesting features: it can use TCP transport and can multiplex packets from the other two protocols. This is very handy in a NAT-ridden network. One playback = one TCP connection. And so it became a de facto standard.

An RTSP server can also function as an application-level proxy, or switchboard, which allows e.g. MediaMTX to forward and distribute an RTSP stream to other clients.

The media is still encapsulated in RTP packet format, which are in turn multiplexed in the RTSP stream. At receiver side, someone must do the unpacking. In our setup, it is MediaMTX.

WebRTC in detail

FWIW we will describe how WebRTC works, in order to understand the challenges:

The server of a Web site is used as a reflector for a signalling channel, so the client browser can easily exchange information with the media server, generally over WebSockets.
Client and server exchange SDP messages through the signalling channel (yes, that bloody SDP from VoIP SIP).
The SDP messsages contain bids of codecs and peer-to-peer channels (IP addresses, ports, protocols) to carry the media or data content.
Both sides agree on a codec and and a channel, provided there is common ground.
The peer-to-peer channel is opened, and content starts to flow.

The sequence is not unlike a SIP VoIP call, in which SIP is responsible by the signalling.

The biggest issue of SDP is, it sends IP addresses to the peer. This is not protocol kosher (the application layer should never delve on details of the network layer) and it creates problems when hosts are behind NAT, since SDP will find and bind non-public IP addresses. Working around this problem drove many people crazy back then, when SIP VoIP started to get popular and people tried to make them work with NAT.

In time, the protocols STUN, TURN and ICE were developed to automate the NAT piercing steps. STUN is a way for a host to discover its own public IP, which may or may not be enough to make a peer-to-peer connection, depending on the NAT type. It is certainly not enough when there is CGNAT.

TURN is a big step-up from STUN. A TURN server can be a "reflector" i.e. a traffic proxy. Running a TURN server with public IP addresses guarantees the peer-to-peer connection can always happen, regardless of how many NATs are in the way.

But operating a TURN server is costly, so nobody does it for free. Every streaming service must supply its own. The most well-known open-source implementation is coturn. You can also purchase cloud services like Twilio.

ICE is a consolidation of the techniques above, and uses STUN and TURN to identify the best peer-to-peer channel. The candidates are given priorities e.g. a LAN candidate is better than STUN, STUN is better than TURN, UDP is better than TCP, and so on.

Now, the ace in the hole: ICE actively tests every possible combination of local and remote peers, sending STUN queries directly to each candidate (not to the STUN server). If some query gets a response, it means the local-remote peer pair is a capable communication channel.

The downside is the combinatorial explosion. If one side has 3 candidates and the other side has 2, each side has 6 pairs to test, meaning a grand total of 12 tests. The viable combinations are compiled and the most desirable is selected as media channel.

ICE can almost always find the best path, even figuring out when the peers are in the same LAN. Since WebRTC includes ICE, it Just Works™ for the user. For the website operator, making WebRTC happen will cost money — lots of it for a high-traffic site.

Check out the Part 2 of this adventure.

Notes

(*) Here I am wearing the hat of the ISP. GPON shares 1.25Gbps of upload among 128 users at the most. If everyone in your neighborhood decides to upload camera streams to the cloud, 10Mbps is all the upstream band everyone will get.

(**) DVRs and NVRs are best for controlled environments, with security guards, etc. In a residential house or unattended place, the DVR is the first thing a burglar will look for. The DVR is still useful in home, but we need something else, like a second hidden DVR, or record to cloud, to have durable recording.