There's definitely room for some algorithmic improvements to Kaleidoscope. I did some traces through and on my whole house group, it loops through every pixel of an 600x400 buffer 400 times for each frame. Takes a long time. For a simple improvement, I've updated it to use a parallel_for so it will at least use all the cores you have in your machine. That's a big help. However, I do want to look at this a bit closer.
Another issue with Kaleidoscope is that it requires Canvas mode. That's another area that is expensive. I also updated that to use a parallel_for so the next build should improve a bit (assuming you have a bunch of cores in your machine).