Hi Data Folks,

TL;DR: We plan to update the wmf.webrequest on on Monday, November 23rd with this change - Please get in touch on this task if you run hive queries taking advantage of the TABLESAMPLE feature on this table.

Why?
Testing the changes, we have seen:
What changes?
The change visible to users of the table is the increase of the number of buckets by which the table is bucketed, from 64 to 256. This means that for any leaf partition (webrequest_source, year, month, day, hour - actual folders where data files are stored), there will be 256 files instead of 64. The bucketing strategy won't change, meaning that the shuffling of rows between the files will still be done using the (hostname, sequence) fields pair in that order.
Changes invisible to users are improvements in the hive query loading/augmenting the data into the partitions.

How does the change impact users?
We plan to drop the table (the structure, not the data!) and recreate it with the new bucketing number, re-adding existing partitions. This drop-recreate should go unnoticed as it is fast to execute. As new data flows in and old data is deleted, it will take 3 month for the whole table to be converted. During those three month, partitions containing 64 files will still be usable, but the queries taking advantage of buckets through the TABLESAMPLE feature will be broken for those partitions.

Don't hesitate to reach out if you have questions :)

--
Joseph Allemandou (joal) (he / him) on behalf of the Analytics-Engineering team
Staff Data Engineer
Wikimedia Foundation