On Fri, May 13, 2022 at 11:26 AM Maxim Aparovich <max.aparovich@gmail.com> wrote:

Dear Sir or Madam,


Hi!

Writing to you with a question about Pageviews hourly raw data files. First of all, let me know if I chose the right person for a question. If not, could you please advise to whom I should direct the question? The question is below.


This is the right place to contact the folks at WMF that work on data engineering, analytics, and public datasets.

I am working on a project where we would like to use Pageviews hourly data. For us, it is crucial to get data as soon as possible. As I can see on the web page, hourly data is available in the Wikimedia's file system approximately 45min after the hour ends. But for an end-user, it is available several hours later after that (this is shown on the screenshot).


 There may be some confusion here.  The timestamps shown on the dumps website are in the UTC timezone.  The time on your computer is in your local timezone.  I'll answer inline below, but this is an important detail.
  1. Is there any way to get data as soon as it is available on the Wikimedia filesystem (~45 min after the hour ends)?
We move the data as soon as possible to the public dump server, but it's a large slow transfer.  It takes ~50 minutes to process the raw data, then some time for the job that copies to run, then at least an hour for the copy itself.  So this is as fast as we can currently make it without different infrastructure. 
  1. Are there any other faster ways to get hourly data? For instance, faster access to raw data files or access to wmf.pageview_hourly or to wmf.pageviews_actor. Unfortunately, API does not provide the opportunity to get data on an hourly level.
We wanted to provide hourly data via the API, but it's very costly in terms of storage space.  There is no other way to access it, for privacy reasons.  The `pageview_hourly` table needs to be sanitized before we can publish it, but we're always improving our pipelines.  Which brings me to a question: what is your use case?  If we can find enough folks who need fresh data for good reasons, we can consider different approaches.