New demo: video encoding #456

jholveck · 2026-01-17T01:47:47Z

Changes proposed in this PR

It seems that there are a number of users who use MSS for video encoding. Add a demo showing how to do this.

This again highlights multithreaded pipelining, like the TinyTV demo, but is more accessible (as few users have a TinyTV). It also shows a number of pitfalls that are common when encoding video, such as failing to correctly record timestamps.

Tests added/updated - N/A
Documentation updated
Changelog entry added
./check.sh passed

I'll probably break this into a simple and advanced version too. I may have to take out the audio code. This also currently uses some of my work in the (unmerged) feat-buffer branch, so I'll need to switch it to use what's available now.

Very much incomplete, sometimes stopping mid-sentence. But I've written enough that I don't want to lose it, so here's an intermediate commit.

Also reformat the comments.

BoboTiG · 2026-01-17T04:49:06Z

Nice!!

BoboTiG · 2026-01-17T05:06:22Z

This is really good information here, priceless!

If I wanted to use h265, is it as easy as serting "h265", or does it need special handling?

jholveck · 2026-01-17T09:18:24Z

You could pass --codec libx265 on the command line. The thing you use there is the same as ffmpeg's -c:v flag.

You can get a list of available codecs in the PyAV build using python3 -m av --codecs; libx265 is among those.

You'd need to comment out the "profile":"high" in CODEC_OPTIONS: libx265 doesn't recognize the high profile. Most, if not all, of the features in the H.264 "high" profile are already part of the H.265 "main" (default) profile.

You can look at other flags for libx265 using ffmpeg --help encoder=libx265, if your ffmpeg build has libx265 compiled in. The libx265 library is GPL-only, so some builds might not include it, but the one included in PyAV does.

I'll add some comments to this effect.

demos/video-capture-simple.py

bboudaoud-nv · 2026-01-20T15:32:54Z

This is really good information here, priceless!

If I wanted to use h265, is it as easy as serting "h265", or does it need special handling?

@BoboTiG codec names can be a bit finicky here. IIRC on Linux x265 should work for encode. On Windows I believe its hevc. The optional parameters passed in, as well as the supported encoding formats for frames may change with the codec depending on the system though. Currently h264 encoding is definitely the safe choice for multi-platform.

You could also experiment with codec names like mpeg or h264 to see if you get better multiplatform support here.

BoboTiG · 2026-01-20T16:46:28Z

Currently h264 encoding is definitely the safe choice for multi-platform.

Fully agree on that.

Thank you for the useful review :)

BoboTiG · 2026-01-20T19:40:50Z

@jholveck should we do checks for third-party modules?

I was testing, and I was missing multiple modules, I needed to look on PyPI, a process which could be improved.

Should we check at import time, something like this?

try:
    import av
except ImportError:
    print("The PyAv module is missing, run: `python -m pip install av`")
    sys.exit(1)

try:
    import si_prefix
except ImportError:
    print("The si-prefix module is missing, run: `python -m pip install si-prefix`")
    sys.exit(1)

This feels like a horrible solution haha, I'm simply putting words on a idea.

BoboTiG · 2026-01-20T19:42:31Z

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.

jholveck · 2026-01-20T20:40:06Z

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.

You might try turning on DISPLAY_IS_SRGB.

jholveck · 2026-01-20T21:23:15Z

@jholveck should we do checks for third-party modules?

I think that doing explicit checks like that might be more mess than success. But I can easily add a comment above where we import third-party modules giving the right pip command to install them.

jholveck · 2026-01-20T21:29:58Z

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.
You might try turning on DISPLAY_IS_SRGB.

By the way, this relates to #207. I think that tagging screenshots with the display's colorspace would be a useful future addition to MSS, for just this sort of thing.

BoboTiG · 2026-01-20T22:30:28Z

@jholveck should we do checks for third-party modules?

I think that doing explicit checks like that might be more mess than success. But I can easily add a comment above where we import third-party modules giving the right pip command to install them.

Yes, lets keep it simple then: at the top of the file, one line to install everything.

BoboTiG · 2026-01-20T22:31:48Z

Out of curiosity, do you plan to add more stuff in that PR? Wondering if you keep it as a draft for a specific reason :)

jholveck · 2026-01-21T00:46:23Z

Out of curiosity, do you plan to add more stuff in that PR? Wondering if you keep it as a draft for a specific reason :)

Not specifically, but I've asked my colleague, @bboudaoud-nv , to review this. He's worked with a number of internal devs who use MSS and other libraries for video encoding, so I wanted to get any insights he might have. He's already provided comments on the simple demo; I've asked him to also look at the full one too.

Once he's done his review, I expect that I'll be ready for your final review and commit.

Edit: Actually, I'll probably also fix what's needed to make it Windows-compatible, as he mentioned in his review comments.

BoboTiG · 2026-01-21T09:15:33Z

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.

You might try turning on DISPLAY_IS_SRGB.

Actually, DISPLAY_IS_SRGB being True or False changes nothing. and that's OK, I was just noting the fact. No need to provide a script to handle all cases.

bboudaoud-nv · 2026-01-21T16:20:13Z

demos/video-capture.py

+# Timestamps (PTS/DTS)
+# --------------------
+#
+# Every frame has a *presentation timestamp* (PTS): when the viewer should see it.


Might be worth noting a PTS is an integer value scaled by the time base (and maybe introduce time base before this section for logical flow)

bboudaoud-nv · 2026-01-21T16:21:14Z

demos/video-capture.py

+# Constant Frame Rate (CFR) and Variable Frame Rate (VFR)
+# -------------------------------------------------------
+#
+# Many video files run at a fixed frame rate, like 30 fps.  Each frame is shown at 1/30 sec intervals.  This is called


Might be good to point out in cases like this the time base can just be the frame rate/time and then the PTS values are simply integers increasing by 1 at each frame

bboudaoud-nv · 2026-01-21T16:22:27Z

demos/video-capture.py

+
+    # Keep running this loop until the main thread says we should stop.
+    while not shutdown_requested.is_set():
+        # Wait until we're ready.  This should, ideally, happen every 1/fps second.


Same as other script I'd be tempted to capture then sleep for the remaining time to retain per-call perf/avoid slowing over the target rate.

I don't quite get that concept. Can you elaborate on the difference?

One reason for doing the sleep right before the capture is because the capture is what we want to do at precise intervals. Putting the sleep here means that jitter in the time taken by other steps (such as blocking in the yield for the mailbox to become empty) doesn't translate into the capture interval.

I'm still not clear on why you are saying that this would impact per-call perf and slow the overall rate. If we were sleeping 1/30 sec each time, I'd agree with you, but here we're sleeping until 1/30 sec since the previous frame's target time. Is there something I'm missing?

This is the same idea as above, the goal for video is to have each frame be "captured" 1 / fps apart. The idea that you fall behind 1-off then get many short frames is less desirable than simply resuming regularly spaced frames after. As you point out technically which side of the capture this is on isn't all that important. But if you want to maintain frame rate as best as possible I'd do something like:

dt = 1 / fps while not shutdown_requested.is_set(): now= time.time() screenshot = sct.grab() dur = time.time() - tstart yield screenshot, now remaining = dt - dur if remaining > 0: time.sleep(remaining)

This way each frame is timed independently and one long frame doesn't cause impacts on other frames. (e.,g., the fps will resume/not "catch up" on the other side of the hitch)

One thing I'm assuming here is that there is more runtime variation in the sct.grab() call duration than there is in the sleep() call here. That may not be true on all platforms.

bboudaoud-nv · 2026-01-21T16:27:13Z

demos/video-capture.py

+        ndarray = np.frombuffer(screenshot.bgra, dtype=np.uint8)
+        ndarray = ndarray.reshape(screenshot.height, screenshot.width, 4)
+        frame = av.VideoFrame.from_numpy_buffer(ndarray, format="bgra")


I'm not 100% sure what the type of screenshot.bgra is but you can do direct plane assignment to speed this up a bit/avoid the av.VideoFrame.from_[x] call.

This is an option, a more performant model might be something like:

frame = av.VideoFrame(format="bgr24") frame.width = screenshot.width frame.height = screenshot.height frame.planes[0] = screenshot.bgra[:,:,0] (or similar) frame.planes[1] = screenshot.bgra[:,:,1] ...

If you can do this you avoid the conversion to/back from the np array and speed up encode slightly

This doesn't allow things like reformatting or using the numpy conversion, but is the fastest model I've seen up to now

To answer your question about the type, it's bytes. (I'm looking at making it a memoryview in the future to avoid the blit.) It's just a packed linear array of BGRA bytes. We do also have other ways to access the data: for instance, using the NumPy-compatible __array_interface__ will give a HWC shape (without a copy; it's just giving that as the array shape specification).

frame.planes is a tuple; you can't assign to its elements. You also can't assign to frame.planes either. In fact, planes is a @property that will create a new tuple of new VideoPlane objects each time you access it. (I've checked in PyAV 15.1.0 and 16.1.0, and a quick look at the git history seems to confirm that it's been read-only for over seven years.)

Additionally, bgr24 is a packed, not planar, format. It expects a single plane of BGR packed pixels, not three planes. gbrp is the usual planar RGB format, although planar formats are much more common for YUV.

In [15]: av.VideoFormat("bgr24").is_planar Out[15]: False In [16]: frame = av.VideoFrame.from_image(Image.new(mode="RGB", size=(16, 16), color="red")) In [17]: frame.format Out[17]: <av.VideoFormat rgb24, 16x16> In [18]: frame.planes Out[18]: (<av.VideoPlane 768 bytes; buffer_ptr=0x279a000; at 0x7fb7c6c08450>,) In [19]: list(memoryview(frame.planes[0])[:10]) Out[19]: [255, 0, 0, 255, 0, 0, 255, 0, 0, 255]

I do have vague memories of creating an empty VideoFrame and populating the plane, but I can't find those experiments (I probably deleted them at some point), and I can't remember the specifics of how I did it; possibly using copy_bytes_to_plane. I did find, to my surprise at the time, that it was slower than the alternatives.

The main reason I'm using NumPy as an intermediary is to avoid copying the data at all. While from_ndarray will blit the data, the undocumented from_numpy_buffer will reuse pointers into the ndarray. I haven't found any other way to avoid copying the pixel data when using PyAV. A 4k screenshot is a 32 MB buffer, and that takes about 3.12 ms to just memcpy that much data (on my Ryzen 7 2700X with DDR4-2133). Additionally, even with non-temporal stores (which Python won't be using for a list slicing operation), it's still going to blow away even the L3 cache (or whatever your LLC is), and hog the DRAM bandwidth from all the other cores. When using from_numpy_buffer, it can become just a pointer copy.

I did verify that using from_numpy_buffer is faster than even using VideoFrame.from_bytes directly (which we can't actually use because it only supports RGBA - not BGRA - and Bayer formats, but it's the lowest-overhead mechanism other than from_numpy_buffer).

The easiest way to test this is in video-capture-simple.py: since that's single-threaded, the time spent creating the frame shows up very directly in the FPS numbers. It's easy enough to test different ways of creating the VideoFrame there.

This can really be a significant part of the time, and this is how we'll be recommending end users use MSS, so it's worth making sure that we've got it right. We want our end users to get the best performance they can.

Do you have the code that you're thinking of, to verify the mechanism?

The exact details aren't hanging around too well, but I recall something like:

# This is a 4 channel bgra image as uint8 image: ndarray # Create the frame and update its planes frame = VideoFrame(width=w, height=h, format="bgra") frame.planes[0].update(image.tobytes())

Ah, gotcha. Yeah, that incurs a memcpy, which is what I was trying to avoid. I'll test to be sure, since this is an important thing to accurately present to end users. One of the few things in this whole demo that's actually directly related to MSS is showing the best way to transfer pixel data from MSS to PyAV and other libraries!

bboudaoud-nv · 2026-01-21T16:27:55Z

demos/video-capture.py

+        if first_frame_at is None:
+            first_frame_at = timestamp
+        frame.pts = int((timestamp - first_frame_at) / TIME_BASE)
+        frame.time_base = TIME_BASE


I don't believe you need to set the individual frame time bases here, just the PTS value. In fact I think you may even be able to set the frame.time and let it handle the conversion to PTS here, but I haven't tried that lately.

bboudaoud-nv · 2026-01-21T16:29:31Z

demos/video-capture.py

+            # The rate= parameter here is just the nominal frame rate: some tools (like file browsers) might display
+            # this as the frame rate.  But we actually control timing via the pts and time_base values on the frames
+            # themselves.
+            video_stream = avmux.add_stream(codec, rate=fps, options=CODEC_OPTIONS)


The use of the rate arg here with the fps may in some cases set a different time base/assume a CFR encoding. I tend to avoid rate here as its not required.

Can you elaborate on those circumstances? I haven't seen that happen when I also provide an explicit time base on the stream, as long as I assign the time base after setting the rate.

As the comment mentions, the rate= parameter does set a nominal frame rate for the file, which some tools use for quick display or estimation, or to convert to CFR. It seems to make sense to include that in the metadata if possible. But if it causes problems, then yeah, I'll eliminate it.

I think this is more confusing to folks than actually a problem. I believe what happens here is the time base gets set using rate then this just gets overridden later in the process and PTS values are assigned to each frame. I'm not sure if adding rate to this constructor has any other effect to the encoded video.

demos/video-capture.py

bboudaoud-nv · 2026-01-21T16:32:17Z

demos/video-capture.py

+            stage_mux.start()
+            stage_show_stats.start()
+            stage_video_process.start()
+            stage_video_encode.start()


In theory these could probably be a single thread. You do get some speed up here, but I'd guess encode is the cycle hog and the rest are fairly fast (process may have a bit less work to do if you directly access video frame planes). Definitely not required/more for readability than perf (this is better perf)

Yes, encode is definitely the bottleneck. But the performance difference multithreading makes can be significant, in the range of 10-25%.

Capturing a 4k screen on my home computer while idle:

libx264, single-threaded: 13.5 fps

libx264, multi-threaded: 15.3 fps (13% improvement over single-threaded)

h264_nvenc, single-threaded: 35.6 fps

h264_nvenc, multi-threaded: 43.5 fps (22% improvement over single-threaded)

The video-capture-simple.py demo is single-threaded, for readability. The video-capture.py demo is multi-threaded, to show best practices for serious applications.

Edit: The reason I separate encode from the others is because it's the long pole. The reason I don't combine capture and process is to have one thread, exclusively used for capture, to help minimize the timing variance. The reason I have a separate muxing stage is because an earlier version of the code also had an audio stream also feeding the same mux stage, and it seems like something end users might want to add themselves. (I don't have an audio stream in this demo because audio capture is quite platform-specific.) But really, there are certainly many ways to structure the pipeline that would work equally well.

I've thought about adding some code to record the percentage of wall time that a thread is processing, blocked on send, and blocked on receive. That would give a bit more information about structuring things. But right now, the CPU% is a reasonable proxy for that, so I just print the thread IDs so you can run top -H and see.

bboudaoud-nv · 2026-01-21T16:34:58Z

demos/video-capture.py

+        yield video_stream.encode(frame)
+    # Our input has run out.  Flush the frames that the encoder still is holding internally (such as to compute
+    # B-frames).
+    yield video_stream.encode(None)


You want to be careful w/ flushing the stream here. If you run out of frames IIRC you want to be careful w/ calling this flush midstream. It may write some sort of end of file signal to the video. May want to not wait for no frames here, but a shutdown signal?

If we run out of frames, then by definition, we're not midstream. We're at the end of the video stream. The for loop doesn't exit until the source exits and closes the mailbox. In other words, the source's StopIteration is acting as a shutdown signal.

IOW: if the next frame isn't ready at the top of the for loop, then the input iterator will block, not terminate.

Ah okay I think I'm not totally up on how these threads are coordinated then, I guess this gets triggered by the end of the process call here and the in mailbox cannot be empty during normal operation?

It can be empty, but in that case the input iterator will block until something does come available. The comments at the top of common/pipeline.py explain that part.

It's similar to doing something like for line in input_file: when input_file is a socket or pipe or something. If there's presently no data (but more is presumably on the way), it'll just sit there and wait for more data to arrive. It doesn't end the iteration until the sender signals that no more is forthcoming.

bboudaoud-nv · 2026-01-22T14:16:06Z

demos/video-capture.py

+        while (now := time.monotonic()) < next_frame_at:
+            time.sleep(next_frame_at - now)


This seems biased towards likely to oversleep the goal. Not a problem for relaxed CFR video, but certainly something to consider. On many platforms threaded sleep is a promise to sleep for "at least [X]" but not exactly/under [X]. For example if you do sleep to within say < 1ms of the goal you might just wanna call that good enough here.

Yeah, I thought about your precision sleep library when I wrote that. I decided it wasn’t worth the added complexity for this purpose. But to speak to your point, I'm afraid I'll need to get a little pedantic.

I believe we've already accounted for bias, although not variance. This loop structure is designed to eliminate bias (since we only care about the capture frequency, not phase).

Suppose the sleep implementation (just as an example) tends to oversleep by a mean of 2 ms, with a standard deviation of 0 ms: it’s all bias, no variance.

The first loop iteration, let’s say, starts at midnight. Since next_frame_at increments by 1/30 sec each frame, regardless of how long anything in the loop took (including the sleep itself), it will schedule frames at 0 s, 0.033 s, 0.067 s, and 0.100 s.

Now, let’s look at when the sleeps exit, and the captures happen. The sleep implementation doesn’t return until 2 ms after it’s scheduled. That means the captures will be at 0.002 s, 0.035 s, 0.069 s, and 0.102 s.

These are still at 1/30 sec intervals. They’re not the exact times that we sent to sleep, but they’re still at the desired frequency, just at a different phase.

Now, let’s add in variance. By definition, the variance term averages to 0. For now, let’s assume that the sleep still has a mean of 2 ms, but now its variance is always ±1 ms (easier to write the numbers than σ²=1 ms). It can overshoot by 1 or 3 ms. So what happens then?

We still are using the same sequence of deadlines, of course: 0 s, 0.033 s, 0.067 s, 0.100 s. Now, the sleep exits at 0.003 s, 0.034 s, 0.070 s, and 0.101 s.

Our frames are still scheduled at a mean frequency of exactly 1/30 sec, although now with a frequency error of 0 ± 1 ms: jitter. Unfortunately, that’s not something that we can get rid of without a precision sleep implementation like yours. However, that timing noise is a lot less of a problem than a systematic error, since that would cause issues with A/V sync, seeking, etc.

I believe the loop structure — incrementing the target time by 1/30 sec each iteration — eliminates the systematic error. There’s no bias in the frequency. This is different than the most common naïve implementations I’ve seen: either adding 1/30 sec to the previous loop’s actual start time (now in our code), or worse yet, sleeping 1/30 sec each loop. Both of those will show a bias in the frame timings.

If this code does still allow a bias, a systemic error that has a non-zero mean, then I’d want to look for ways to eliminate that. But I think the present implementation, as it stands, does eliminate the bias, and ensures that any timing errors in sleep only get propagated to the frame timings with their variance, not bias.

bboudaoud-nv · 2026-01-22T14:27:29Z

I'm definitely picking nits at this point, this code is very well written and should be a nice example of how to (actually) screen capture video with pyav as opposed to the weird muxing example they provide in their docs!

jholveck · 2026-01-23T07:46:02Z

I'm definitely picking nits at this point, this code is very well written and should be a nice example of how to (actually) screen capture video with pyav as opposed to the weird muxing example they provide in their docs!

I'm so glad to hear it! Thank you for your diligent review! At this point, I think I'll review your comments again and see if there are ways I can improve the comments to explain some of the sticking points, and then @BoboTiG can merge it.

jholveck added 9 commits January 2, 2026 18:41

Basic draft of the video capture code.

7cadb41

I'll probably break this into a simple and advanced version too. I may have to take out the audio code. This also currently uses some of my work in the (unmerged) feat-buffer branch, so I'll need to switch it to use what's available now.

Merge branch 'BoboTiG:main' into feat-demo-video

4e3c559

Work the video demo to a more viable form

ca91fe5

Add more docs to the video capture demo

fba9bda

Very much incomplete, sometimes stopping mid-sentence. But I've written enough that I don't want to lose it, so here's an intermediate commit.

Improve comments

e788690

Add a comment about color spaces

eed9245

Add notes about colorspace tagging

03da28d

Also reformat the comments.

Add a pointer to the comments in pipeline.py

883f365

Merge branch 'main' into feat-demo-video

a7f63a3

jholveck added 5 commits January 17, 2026 01:49

Add comments and help strings about using other codecs

dfe2784

Merge branch 'main' into feat-demo-video

5c05ced

Add a simple version

8705054

Add information about VFR

76d3b85

Merge branch 'BoboTiG:main' into feat-demo-video

c37017d