Hello Eric, so we don’t have any tag or field in the add/changes dump that differentiates between “new pages vs pages with new revision”. But we have a column in the page table called “page_is_new” and we dump that table twice every month in SQL format.  You might want to cross-check the page table dump against the add/changes dump of the same day; 


Please note that both dumps are not done at exactly the same time and the data on both files might not be consistent.


On Tue, Jan 24, 2023 at 4:31 AM Eric Andrew Lewis <eric.andrew.lewis@gmail.com> wrote:
Hi Ariel,

Thank you for the detail! Helpful to understand.

Is it possible to disambiguate completely new pages vs. pages with new revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not sure there's a way to do this.

In the interim I wrote a golang script that parses a meta-history file as previously described, with various filters – excludes redirect pages (wow there so many), user pages, etc. It worked out rather well. A bit sloppy but here is the script for reference.

Eric Andrew Lewis
+1 610 715 8560 


On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF <ariel@wikimedia.org> wrote:
Eric,

We don't produce dumps of the revision table in sql format because some of those revisions may be hidden from public view, and even metadata about them should not be released. We do however publish so-called Adds/Changes dumps once a day for each wiki, providing stubs and content files in xml of just new pages and revisions since the last such dump. They lag about 12 hours behind to allow vandalism and such to be filtered out by wiki admins, but hopefully that's good enough for your needs.  You can find those here: https://dumps.wikimedia.org/other/incr/

Ariel Glenn

On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <eric.andrew.lewis@gmail.com> wrote:

Hi,

I am interested in performing analysis on recently created pages on English Wikipedia. 

One way to find recently created pages is downloading a meta-history file for the English language, and filter through the XML, looking for pages where the oldest revision is within the desired timespan. 

Since this requires a library to parse through XML string data, I would imagine this is much slower than a database query. Is page revision data available in one of the SQL dumps which I could query for this use case? Looking at the exported tables list, it does not look like it is. Maybe this is intentional?

Thanks,

Eric Andrew Lewis
+1 610 715 8560 
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org