The new data is available, but we found a small formatting bug that we have
to fix. Because of that, we haven't announced it widely yet, and we
haven't rolled up the data to the monthly level.
The data: https://dumps.wikimedia.org/other/pageview_complete/
The bug: some rows have 6 columns and some rows have 5 columns, where
page_id is missing. We are inserting "null" and re-writing the files, but
it's almost 3 Terabytes so it'll take a while. If you want to download and
use the data in the meantime, you're welcome to, just make your parsing
robust to the inconsistency.
Thanks for your patience. Once we have this sorted out we will make a wide
announcement and explain the history of this data and how going forward
there will be a single unified dataset with all the history we have.
Good suggestion to post updates on the -ez page. We will do that.
On Mon, Nov 16, 2020 at 9:46 AM Michael Tartre <michael(a)predata.com> wrote:
This was brought up in a previous thread (link here
but the aggregated hourly view dumps haven't been published since
2020-09-24 (see here
The response to the previous thread by Dan suggested that the new data
would be available in a week, but it's already a month past that expected
deadline. Are there any updates on the status of that new dump, any new
estimates of when it would become available? I would also suggest posting
information about the pending change and new system to the information page
) -- from reading
that page, there is no indication that data delivery has stopped or that a
new pipeline will be available shortly.
Thanks for any information,
Senior Machine Learning Engineer
t: +1 415 857 0967
1 Liberty Plaza
New York, NY 10006
Wikitech-l mailing list