Zoom isn't magic, but it probably knows more about the stream than a generic package.
First, check what codec is used before and after your change. I think Zoom uses VP8, but you may have accidentally transcoded it into a less efficient codec. That could make a big difference. I believe Zoom uses Opus for audio, which is very efficient.
Second, check how long it takes to save the modified file. Video coding is very CPU intensive and Zoom is probably optimised for a talking head against a static background, so will exploit the unchanged areas of the picture to save bits (this is just the standard way video codecs work). Note that if you are in a Zoom call for an hour, it can spend an hour of CPU time encoding your picture, while if you save an hour-long video file from a video editor, it would be totally unacceptable for it to take an hour! That means that the original coding is potentially far superior to a re-encoding done later. The software may have some option to spend more time to get better compression.
Zoom probably needs quite a high rate of keyframes, but that depends on several factors. Decoding can only start at a keyframe, as other frames are encoded as differences with respect to some other frames. For offline media (e.g. a file that you can play at any time) a simple decoder can only start at a keyframe, so you put enough of them in to ensure that whenyou skip forwards or back the decode can start reasonably close to the target point. A more sophistacated decoder can start at any arbitrary point by looking backwards for the previous keyframe and decoding from there, discarding the result until it gets to the desired point. In principle, offline media can be better compressed than online (where online media is such that you need to decode the data as it arrives), because keyframes can be in the future as well as the past. I'm not sure that this is exploited very much, and it's the wrong direction for what you are seeing.
Another need for keyframes is to allow resuming decode after packet loss has broken the chain of dependencies. There are three ways to cope with packet loss, and I don't know which Zoom uses:
- Wait for the next keyframe - requires frequent keyframes as the picture is bad or freezes until one arrives
- Request a keyframe - costs a round-trip to the source, which must be able to generate a keyframe on demand
- Use an underlying protocol which guarantees delivery through re-transmission
I believe Zoom used to use TCP, which is the third option, so in principle only needed a keyframe at the very start of the stream. However, I have read that it now uses WebRTC, so would use the second option. Obviously a file does not suffer packet loss, so doesn't need any of them, so in principle a recording could eliminate keyframes that were needed during a call - though that would require re-encoding the stream which is expensive, so very unlikely.
I would be astonished if the audio content was bigger than the video - unless the picture is very small or blocky. Opus might take 80Kb/s for very good quality audio, while 80Kb/s would be a terrible picture in any codec.