On Thu, Sep 21, 2017 at 6:10 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de
wrote:
Yes, we could put it into a separate table. But that table would be exactly as tall as the content table, and would be keyed to it. I see no advantage.
The advantage is that MediaWiki almost would never need to use the hash table. It would need to add the hash for a new revision there, but table size is not much of an issue on INSERT; other than that, only slow operations like export and API requests which explicitly ask for the hash would need to join on that table. Or this primarily a disk space concern?
Also, since content is supposed to be deduplicated (so two revisions with
the exact same content will have the same content_address), cannot that replace content_sha1 for revert detection purposes?
Only if we could detect and track "manual" reverts. And the only reliable way to do this right now is by looking at the sha1.
The content table points to a blob store which is content-addressible and has its own deduplication mechanism, right? So you just send it the content to store, and get an address back, and in the case of a manual revert, that address will be one that has already been used in other content rows. Or do you need to detect the revert before saving it?
SHA1 is not that slow.
For the API/Special:Export definitely not. Maybe for generating the official dump files it might be significant? A single sha1 operation on a modern CPU should not take more than a microsecond: there are a few hundred operations in a decently implemented sha1 and processors are in the GHz range. PHP benchmarks [1] also give similar values. With the 64-byte block size, that's something like 5 hours/TB - not sure how that compares to the dump process itself (also it's probably running on lots of cores in parallel).