Hi Michael,

The new data is available, but we found a small formatting bug that we have to fix.  Because of that, we haven't announced it widely yet, and we haven't rolled up the data to the monthly level.

The data: https://dumps.wikimedia.org/other/pageview_complete/
The bug: some rows have 6 columns and some rows have 5 columns, where page_id is missing.  We are inserting "null" and re-writing the files, but it's almost 3 Terabytes so it'll take a while.  If you want to download and use the data in the meantime, you're welcome to, just make your parsing robust to the inconsistency.

Thanks for your patience.  Once we have this sorted out we will make a wide announcement and explain the history of this data and how going forward there will be a single unified dataset with all the history we have.

Good suggestion to post updates on the -ez page.  We will do that.

On Mon, Nov 16, 2020 at 9:46 AM Michael Tartre <michael@predata.com> wrote:
This was brought up in a previous thread (link here), but the aggregated hourly view dumps haven't been published since 2020-09-24 (see here, also mirrored here). The response to the previous thread by Dan suggested that the new data would be available in a week, but it's already a month past that expected deadline. Are there any updates on the status of that new dump, any new estimates of when it would become available? I would also suggest posting information about the pending change and new system to the information page (at https://dumps.wikimedia.org/other/pagecounts-ez/) -- from reading that page, there is no indication that data delivery has stopped or that a new pipeline will be available shortly.

Thanks for any information,


Michael Tartre
Senior Machine Learning Engineer

t: +1 415 857 0967
1 Liberty Plaza
New York, NY 10006
Wikitech-l mailing list