>This sounds like the fixes we did last quarter to the batch insertion basically hid the problem instead of making it go away.
I think we are mixing things here, when we had issues with batching code we never saw a pattern of "no-events-whatsoever-in-any-table for an hour". We saw events dropped in bursts here and there but certainly not an "hour long blackout".

Also, there were no events dropped when we did the major backfilling in early march where the db sustained quite a bit of load as we had to insert those one by one.

So (while I am not saying we could not uncover a code issue in our end) we have not seen this particular error pattern before.

I didn't mean to suggest we saw this error before. I was trying to say that intuitively the error seems very similar. That is, over time, the lag grows and at some point it's so big that we lose a bunch of data all at once. I was just saying that because the first place I'd look is at that change. For example, I'd try replicating by simulating the same traffic and then I'd revert to the original logic before batch inserts and try that. We've all looked at this code but we must be missing something big.