Skip to content

Conversation

@jholveck
Copy link
Contributor

@jholveck jholveck commented Jan 17, 2026

Changes proposed in this PR

It seems that there are a number of users who use MSS for video encoding. Add a demo showing how to do this.

This again highlights multithreaded pipelining, like the TinyTV demo, but is more accessible (as few users have a TinyTV). It also shows a number of pitfalls that are common when encoding video, such as failing to correctly record timestamps.

  • Tests added/updated - N/A
  • Documentation updated
  • Changelog entry added
  • ./check.sh passed

I'll probably break this into a simple and advanced version too.
I may have to take out the audio code.

This also currently uses some of my work in the (unmerged) feat-buffer
branch, so I'll need to switch it to use what's available now.
Very much incomplete, sometimes stopping mid-sentence.  But I've
written enough that I don't want to lose it, so here's an intermediate
commit.
Also reformat the comments.
@BoboTiG
Copy link
Owner

BoboTiG commented Jan 17, 2026

Nice!!

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 17, 2026

This is really good information here, priceless!

If I wanted to use h265, is it as easy as serting "h265", or does it need special handling?

@jholveck
Copy link
Contributor Author

You could pass --codec libx265 on the command line. The thing you use there is the same as ffmpeg's -c:v flag.

You can get a list of available codecs in the PyAV build using python3 -m av --codecs; libx265 is among those.

You'd need to comment out the "profile":"high" in CODEC_OPTIONS: libx265 doesn't recognize the high profile. Most, if not all, of the features in the H.264 "high" profile are already part of the H.265 "main" (default) profile.

You can look at other flags for libx265 using ffmpeg --help encoder=libx265, if your ffmpeg build has libx265 compiled in. The libx265 library is GPL-only, so some builds might not include it, but the one included in PyAV does.

I'll add some comments to this effect.

@bboudaoud-nv
Copy link

This is really good information here, priceless!

If I wanted to use h265, is it as easy as serting "h265", or does it need special handling?

@BoboTiG codec names can be a bit finicky here. IIRC on Linux x265 should work for encode. On Windows I believe its hevc. The optional parameters passed in, as well as the supported encoding formats for frames may change with the codec depending on the system though. Currently h264 encoding is definitely the safe choice for multi-platform.

You could also experiment with codec names like mpeg or h264 to see if you get better multiplatform support here.

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 20, 2026

Currently h264 encoding is definitely the safe choice for multi-platform.

Fully agree on that.

Thank you for the useful review :)

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 20, 2026

@jholveck should we do checks for third-party modules?

I was testing, and I was missing multiple modules, I needed to look on PyPI, a process which could be improved.

Should we check at import time, something like this?

try:
    import av
except ImportError:
    print("The PyAv module is missing, run: `python -m pip install av`")
    sys.exit(1)

try:
    import si_prefix
except ImportError:
    print("The si-prefix module is missing, run: `python -m pip install si-prefix`")
    sys.exit(1)

This feels like a horrible solution haha, I'm simply putting words on a idea.

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 20, 2026

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.

@jholveck
Copy link
Contributor Author

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.

You might try turning on DISPLAY_IS_SRGB.

@jholveck
Copy link
Contributor Author

@jholveck should we do checks for third-party modules?

I think that doing explicit checks like that might be more mess than success. But I can easily add a comment above where we import third-party modules giving the right pip command to install them.

@jholveck
Copy link
Contributor Author

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.
You might try turning on DISPLAY_IS_SRGB.

By the way, this relates to #207. I think that tagging screenshots with the display's colorspace would be a useful future addition to MSS, for just this sort of thing.

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 20, 2026

@jholveck should we do checks for third-party modules?

I think that doing explicit checks like that might be more mess than success. But I can easily add a comment above where we import third-party modules giving the right pip command to install them.

Yes, lets keep it simple then: at the top of the file, one line to install everything.

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 20, 2026

Out of curiosity, do you plan to add more stuff in that PR? Wondering if you keep it as a draft for a specific reason :)

@jholveck
Copy link
Contributor Author

jholveck commented Jan 21, 2026

Out of curiosity, do you plan to add more stuff in that PR? Wondering if you keep it as a draft for a specific reason :)

Not specifically, but I've asked my colleague, @bboudaoud-nv , to review this. He's worked with a number of internal devs who use MSS and other libraries for video encoding, so I wanted to get any insights he might have. He's already provided comments on the simple demo; I've asked him to also look at the full one too.

Once he's done his review, I expect that I'll be ready for your final review and commit.

Edit: Actually, I'll probably also fix what's needed to make it Windows-compatible, as he mentioned in his review comments.

@BoboTiG
Copy link
Owner

BoboTiG commented Jan 21, 2026

Overall, it works great! I see a small difference in colors, on Linux, but I guess this is due to the JPEG compression.

You might try turning on DISPLAY_IS_SRGB.

Actually, DISPLAY_IS_SRGB being True or False changes nothing. and that's OK, I was just noting the fact. No need to provide a script to handle all cases.

# Timestamps (PTS/DTS)
# --------------------
#
# Every frame has a *presentation timestamp* (PTS): when the viewer should see it.
Copy link

@bboudaoud-nv bboudaoud-nv Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth noting a PTS is an integer value scaled by the time base (and maybe introduce time base before this section for logical flow)

# Constant Frame Rate (CFR) and Variable Frame Rate (VFR)
# -------------------------------------------------------
#
# Many video files run at a fixed frame rate, like 30 fps. Each frame is shown at 1/30 sec intervals. This is called

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to point out in cases like this the time base can just be the frame rate/time and then the PTS values are simply integers increasing by 1 at each frame


# Keep running this loop until the main thread says we should stop.
while not shutdown_requested.is_set():
# Wait until we're ready. This should, ideally, happen every 1/fps second.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as other script I'd be tempted to capture then sleep for the remaining time to retain per-call perf/avoid slowing over the target rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get that concept. Can you elaborate on the difference?

One reason for doing the sleep right before the capture is because the capture is what we want to do at precise intervals. Putting the sleep here means that jitter in the time taken by other steps (such as blocking in the yield for the mailbox to become empty) doesn't translate into the capture interval.

I'm still not clear on why you are saying that this would impact per-call perf and slow the overall rate. If we were sleeping 1/30 sec each time, I'd agree with you, but here we're sleeping until 1/30 sec since the previous frame's target time. Is there something I'm missing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same idea as above, the goal for video is to have each frame be "captured" 1 / fps apart. The idea that you fall behind 1-off then get many short frames is less desirable than simply resuming regularly spaced frames after. As you point out technically which side of the capture this is on isn't all that important. But if you want to maintain frame rate as best as possible I'd do something like:

dt = 1 / fps
while not shutdown_requested.is_set():
  now= time.time()
  screenshot = sct.grab()
  dur = time.time() - tstart
  yield screenshot, now
  remaining = dt - dur
  if remaining > 0:
    time.sleep(remaining)

This way each frame is timed independently and one long frame doesn't cause impacts on other frames. (e.,g., the fps will resume/not "catch up" on the other side of the hitch)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm assuming here is that there is more runtime variation in the sct.grab() call duration than there is in the sleep() call here. That may not be true on all platforms.

Comment on lines +236 to +238
ndarray = np.frombuffer(screenshot.bgra, dtype=np.uint8)
ndarray = ndarray.reshape(screenshot.height, screenshot.width, 4)
frame = av.VideoFrame.from_numpy_buffer(ndarray, format="bgra")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what the type of screenshot.bgra is but you can do direct plane assignment to speed this up a bit/avoid the av.VideoFrame.from_[x] call.

This is an option, a more performant model might be something like:

frame = av.VideoFrame(format="bgr24")
frame.width = screenshot.width
frame.height = screenshot.height
frame.planes[0] = screenshot.bgra[:,:,0] (or similar)
frame.planes[1] = screenshot.bgra[:,:,1]
...

If you can do this you avoid the conversion to/back from the np array and speed up encode slightly

This doesn't allow things like reformatting or using the numpy conversion, but is the fastest model I've seen up to now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer your question about the type, it's bytes. (I'm looking at making it a memoryview in the future to avoid the blit.) It's just a packed linear array of BGRA bytes. We do also have other ways to access the data: for instance, using the NumPy-compatible __array_interface__ will give a HWC shape (without a copy; it's just giving that as the array shape specification).

frame.planes is a tuple; you can't assign to its elements. You also can't assign to frame.planes either. In fact, planes is a @property that will create a new tuple of new VideoPlane objects each time you access it. (I've checked in PyAV 15.1.0 and 16.1.0, and a quick look at the git history seems to confirm that it's been read-only for over seven years.)

Additionally, bgr24 is a packed, not planar, format. It expects a single plane of BGR packed pixels, not three planes. gbrp is the usual planar RGB format, although planar formats are much more common for YUV.

In [15]: av.VideoFormat("bgr24").is_planar
Out[15]: False

In [16]: frame = av.VideoFrame.from_image(Image.new(mode="RGB", size=(16, 16), color="red"))

In [17]: frame.format
Out[17]: <av.VideoFormat rgb24, 16x16>

In [18]: frame.planes
Out[18]: (<av.VideoPlane 768 bytes; buffer_ptr=0x279a000; at 0x7fb7c6c08450>,)

In [19]: list(memoryview(frame.planes[0])[:10])
Out[19]: [255, 0, 0, 255, 0, 0, 255, 0, 0, 255]

I do have vague memories of creating an empty VideoFrame and populating the plane, but I can't find those experiments (I probably deleted them at some point), and I can't remember the specifics of how I did it; possibly using copy_bytes_to_plane. I did find, to my surprise at the time, that it was slower than the alternatives.

The main reason I'm using NumPy as an intermediary is to avoid copying the data at all. While from_ndarray will blit the data, the undocumented from_numpy_buffer will reuse pointers into the ndarray. I haven't found any other way to avoid copying the pixel data when using PyAV. A 4k screenshot is a 32 MB buffer, and that takes about 3.12 ms to just memcpy that much data (on my Ryzen 7 2700X with DDR4-2133). Additionally, even with non-temporal stores (which Python won't be using for a list slicing operation), it's still going to blow away even the L3 cache (or whatever your LLC is), and hog the DRAM bandwidth from all the other cores. When using from_numpy_buffer, it can become just a pointer copy.

I did verify that using from_numpy_buffer is faster than even using VideoFrame.from_bytes directly (which we can't actually use because it only supports RGBA - not BGRA - and Bayer formats, but it's the lowest-overhead mechanism other than from_numpy_buffer).

The easiest way to test this is in video-capture-simple.py: since that's single-threaded, the time spent creating the frame shows up very directly in the FPS numbers. It's easy enough to test different ways of creating the VideoFrame there.

This can really be a significant part of the time, and this is how we'll be recommending end users use MSS, so it's worth making sure that we've got it right. We want our end users to get the best performance they can.

Do you have the code that you're thinking of, to verify the mechanism?

Copy link

@bboudaoud-nv bboudaoud-nv Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact details aren't hanging around too well, but I recall something like:

# This is a 4 channel bgra image as uint8
image: ndarray 
# Create the frame and update its planes
frame = VideoFrame(width=w, height=h, format="bgra")
frame.planes[0].update(image.tobytes())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. Yeah, that incurs a memcpy, which is what I was trying to avoid. I'll test to be sure, since this is an important thing to accurately present to end users. One of the few things in this whole demo that's actually directly related to MSS is showing the best way to transfer pixel data from MSS to PyAV and other libraries!

if first_frame_at is None:
first_frame_at = timestamp
frame.pts = int((timestamp - first_frame_at) / TIME_BASE)
frame.time_base = TIME_BASE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe you need to set the individual frame time bases here, just the PTS value. In fact I think you may even be able to set the frame.time and let it handle the conversion to PTS here, but I haven't tried that lately.

# The rate= parameter here is just the nominal frame rate: some tools (like file browsers) might display
# this as the frame rate. But we actually control timing via the pts and time_base values on the frames
# themselves.
video_stream = avmux.add_stream(codec, rate=fps, options=CODEC_OPTIONS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of the rate arg here with the fps may in some cases set a different time base/assume a CFR encoding. I tend to avoid rate here as its not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on those circumstances? I haven't seen that happen when I also provide an explicit time base on the stream, as long as I assign the time base after setting the rate.

As the comment mentions, the rate= parameter does set a nominal frame rate for the file, which some tools use for quick display or estimation, or to convert to CFR. It seems to make sense to include that in the metadata if possible. But if it causes problems, then yeah, I'll eliminate it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more confusing to folks than actually a problem. I believe what happens here is the time base gets set using rate then this just gets overridden later in the process and PTS values are assigned to each frame. I'm not sure if adding rate to this constructor has any other effect to the encoded video.

stage_mux.start()
stage_show_stats.start()
stage_video_process.start()
stage_video_encode.start()
Copy link

@bboudaoud-nv bboudaoud-nv Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory these could probably be a single thread. You do get some speed up here, but I'd guess encode is the cycle hog and the rest are fairly fast (process may have a bit less work to do if you directly access video frame planes). Definitely not required/more for readability than perf (this is better perf)

Copy link
Contributor Author

@jholveck jholveck Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, encode is definitely the bottleneck. But the performance difference multithreading makes can be significant, in the range of 10-25%.

Capturing a 4k screen on my home computer while idle:

  • libx264, single-threaded: 13.5 fps
  • libx264, multi-threaded: 15.3 fps (13% improvement over single-threaded)
  • h264_nvenc, single-threaded: 35.6 fps
  • h264_nvenc, multi-threaded: 43.5 fps (22% improvement over single-threaded)

The video-capture-simple.py demo is single-threaded, for readability. The video-capture.py demo is multi-threaded, to show best practices for serious applications.

Edit: The reason I separate encode from the others is because it's the long pole. The reason I don't combine capture and process is to have one thread, exclusively used for capture, to help minimize the timing variance. The reason I have a separate muxing stage is because an earlier version of the code also had an audio stream also feeding the same mux stage, and it seems like something end users might want to add themselves. (I don't have an audio stream in this demo because audio capture is quite platform-specific.) But really, there are certainly many ways to structure the pipeline that would work equally well.

I've thought about adding some code to record the percentage of wall time that a thread is processing, blocked on send, and blocked on receive. That would give a bit more information about structuring things. But right now, the CPU% is a reasonable proxy for that, so I just print the thread IDs so you can run top -H and see.

yield video_stream.encode(frame)
# Our input has run out. Flush the frames that the encoder still is holding internally (such as to compute
# B-frames).
yield video_stream.encode(None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to be careful w/ flushing the stream here. If you run out of frames IIRC you want to be careful w/ calling this flush midstream. It may write some sort of end of file signal to the video. May want to not wait for no frames here, but a shutdown signal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we run out of frames, then by definition, we're not midstream. We're at the end of the video stream. The for loop doesn't exit until the source exits and closes the mailbox. In other words, the source's StopIteration is acting as a shutdown signal.

IOW: if the next frame isn't ready at the top of the for loop, then the input iterator will block, not terminate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I think I'm not totally up on how these threads are coordinated then, I guess this gets triggered by the end of the process call here and the in mailbox cannot be empty during normal operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be empty, but in that case the input iterator will block until something does come available. The comments at the top of common/pipeline.py explain that part.

It's similar to doing something like for line in input_file: when input_file is a socket or pipe or something. If there's presently no data (but more is presumably on the way), it'll just sit there and wait for more data to arrive. It doesn't end the iteration until the sender signals that no more is forthcoming.

Comment on lines +176 to +177
while (now := time.monotonic()) < next_frame_at:
time.sleep(next_frame_at - now)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems biased towards likely to oversleep the goal. Not a problem for relaxed CFR video, but certainly something to consider. On many platforms threaded sleep is a promise to sleep for "at least [X]" but not exactly/under [X]. For example if you do sleep to within say < 1ms of the goal you might just wanna call that good enough here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about your precision sleep library when I wrote that. I decided it wasn’t worth the added complexity for this purpose. But to speak to your point, I'm afraid I'll need to get a little pedantic.

I believe we've already accounted for bias, although not variance. This loop structure is designed to eliminate bias (since we only care about the capture frequency, not phase).

Suppose the sleep implementation (just as an example) tends to oversleep by a mean of 2 ms, with a standard deviation of 0 ms: it’s all bias, no variance.

The first loop iteration, let’s say, starts at midnight. Since next_frame_at increments by 1/30 sec each frame, regardless of how long anything in the loop took (including the sleep itself), it will schedule frames at 0 s, 0.033 s, 0.067 s, and 0.100 s.

Now, let’s look at when the sleeps exit, and the captures happen. The sleep implementation doesn’t return until 2 ms after it’s scheduled. That means the captures will be at 0.002 s, 0.035 s, 0.069 s, and 0.102 s.

These are still at 1/30 sec intervals. They’re not the exact times that we sent to sleep, but they’re still at the desired frequency, just at a different phase.

Now, let’s add in variance. By definition, the variance term averages to 0. For now, let’s assume that the sleep still has a mean of 2 ms, but now its variance is always ±1 ms (easier to write the numbers than σ²=1 ms). It can overshoot by 1 or 3 ms. So what happens then?

We still are using the same sequence of deadlines, of course: 0 s, 0.033 s, 0.067 s, 0.100 s. Now, the sleep exits at 0.003 s, 0.034 s, 0.070 s, and 0.101 s.

Our frames are still scheduled at a mean frequency of exactly 1/30 sec, although now with a frequency error of 0 ± 1 ms: jitter. Unfortunately, that’s not something that we can get rid of without a precision sleep implementation like yours. However, that timing noise is a lot less of a problem than a systematic error, since that would cause issues with A/V sync, seeking, etc.

I believe the loop structure — incrementing the target time by 1/30 sec each iteration — eliminates the systematic error. There’s no bias in the frequency. This is different than the most common naïve implementations I’ve seen: either adding 1/30 sec to the previous loop’s actual start time (now in our code), or worse yet, sleeping 1/30 sec each loop. Both of those will show a bias in the frame timings.

If this code does still allow a bias, a systemic error that has a non-zero mean, then I’d want to look for ways to eliminate that. But I think the present implementation, as it stands, does eliminate the bias, and ensures that any timing errors in sleep only get propagated to the frame timings with their variance, not bias.

@bboudaoud-nv
Copy link

I'm definitely picking nits at this point, this code is very well written and should be a nice example of how to (actually) screen capture video with pyav as opposed to the weird muxing example they provide in their docs!

@jholveck
Copy link
Contributor Author

I'm definitely picking nits at this point, this code is very well written and should be a nice example of how to (actually) screen capture video with pyav as opposed to the weird muxing example they provide in their docs!

I'm so glad to hear it! Thank you for your diligent review! At this point, I think I'll review your comments again and see if there are ways I can improve the comments to explain some of the sticking points, and then @BoboTiG can merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants