Author Topic: Internal rendering differences between OSX/Windows/Linux (Read 3915 times)

dkulp · « **on:** January 31, 2020, 07:54:41 AM »

I posted this on Facebook but thought I should post it here as well so it's easier to find/refer to in the future...

There was a question on another thread about performance differences of rendering between OSX, Windows, and Linux. The thread was locked before I could respond, but thought I'd type up a response anyway. Assuming IDENTICAL hardware, OSX will render most older sequences just as fast as Windows. However, most "modern" sequences that use some of the more advanced techniques (shaders, videos) will likely render significantly faster on OSX than Windows. There are several reasons:

HUGE_PAGE/SUPER_PAGE - on Linux and OSX, the large blocks of memory that are used to store the sequence data can be stored using larger system pages (2MB vs 4K). This can reduce pressure on the processor page table cache significantly and, according to some benchmarks, can help around 1-2%. Windows requires admin privileges to use the large pages so we cannot use them there. Linux actually "wins" this one as the kernel will auto defrag memory to provide more large pages and will promote non-huge pages to huge pages if possible.
Compiler - the Microsoft compiler is NOT knowns for producing the fastest code, particularly in the areas of auto-vectorization. We recently switched from gcc to MS on Windows due to issues/bugs in gcc on Windows. The stability/debugability was more important. On OSX we use clang/llvm (which is what XCode uses) which is known to be constantly improving. Like the above, probably only the difference of a percent or two.
Background rendering - this is a big one. MOST effects on all platforms can be rendered on the background threads and thus can be rendered in parallel using all the cores. However, in some cases, we have to move the rendering to the main thread to avoid issues in libraries or other contention. On OSX, we don't have to move anything. Every single effect is rendered on the background threads. On Windows (and Linux), the Shader effect is currently a main thread only effect. Thus, if you use a lot of shaders, the renders end up being more serialized one after the other whereas they can be rendered in parallel on OSX. This is particularly noticeable with complex shaders and if you have a good video card. We did try to get this working on Windows and it works "most of the time". The problem is when it doesn't work, then entire UI gets messed up. We may try and revisit this at some point this year. Linux is the big "loser" in this category as there are other effects (Text, Tendrils) that also need flipping to the main thread.
Hardware video decoding - this is HUGE. On OSX, there is a very good, standard, API for using the hardware to decode video and scale the frames to the needed size and also set the pixel format (RGBA vs whatever the video file uses). It's all done in hardware and only the resulting (usually smaller, non-HD) frame data is transferred from the hardware into xLights. This is very stable and works across ALL the Apple hardware. We tried to get it working on Windows, but it's extremely hit or miss (mostly miss) with video drivers and video format issues, and all kinds of challenges. The ffmpeg API's for hardware decoding mostly assume you will be using the result to display onto a screen. Thus, getting the information for other uses is hard. Anyway, this is the biggest performance gain. If you have lots of video effects, this can have a huge impact. The "Greatest Show" sequence for my show drops from over 4 minutes to render on my machine to roughly 2. I've seen some of "Magical Light Shows" sequences have similar 50-60% drops. That said, if you have Render cache turned on, it may be a "one time" hit as the video effect would get cached. That said, render cache will increase memory usage. See #5. (Note: you can TRY turning on the hardware scaler on Windows as it is a setting. It may work OK for you. It might crash xLights. It might corrupt the video memory and require a reboot. Don't know.)
Gigantic sequences/Memory pressure - MOST of you wont need to care about this section at all. OSX's built in memory manager will do a lot of things such as transparent block memory compression to allow things to keep running that fail on the other platforms. I was working with one particular person in Nov/Dec who had a sequence where the raw sequence data was over 60GB in size (point of note: doing 20-30 minute long sequences when you have a HUGE channel count is not xLights strong point at this point in time). His Windows machine with 64GB of ram wouldn't even open the sequence. It crashed immediately as it couldn't allocate the needed block of memory. My 32GB OSX laptop opened it up and (slowly) rendered it. This is more pronounced with RenderCache turned on. The rendercache also CAN eat up a ton of memory. After rendering, my laptop had over 95GB of "memory" allocated to xLights. We ended up shortening his sequence down to about 16 minutes so he could use it. I did use this experience to change how the memory blocks are allocated internally to xLights (see point 1 above) so we don't need a single 60GB block which may help in the future.

As I said, this assume IDENTICAL hardware. For a given amount of money, it's definitely possible that you could get a Windows box that out performs the Mac due to a better processor with more/faster cores or similar. But then you'd still be stuck with Windows and would not get all the OTHER benefits of OSX, like Dark Mode, TouchBar, touch pad, etc...

mikewlaymon · « **Reply #1 on:** January 31, 2020, 09:21:44 AM »

But then, I'd still have money left over for a new car...

Author Topic: Internal rendering differences between OSX/Windows/Linux (Read 3915 times)

dkulp

Internal rendering differences between OSX/Windows/Linux

mikewlaymon

Re: Internal rendering differences between OSX/Windows/Linux