I might be misunderstanding what is being asked, but could someone explain to me why the span tags with the OCR block information needs to be permanent? Would it suffice to have the span tags, proof read the OCR'd text till it perfectly matches the scans, feed it back into the DJVU file and then remove all the span tags to have a clean wikitext?

I would imagine once the proofed text becomes the text layer to the DJVU file, that would be the last time we would have to even touch the text anyway, so there would be no more modifications we would need to make to either the DJVU or the wikitext at all. At that point we could make the text 100% clean.