I’ve been exploring the enwiki database. I can find the page row for [[Iron]]
MariaDB [enwiki_p]> select page_title from page where page_id = 14734; +------------+ | page_title | +------------+ | Iron | +------------+
It looks like it has the right number of revisions:
MariaDB [enwiki_p]> select count(*) from revision where rev_page = 14734; +----------+ | count(*) | +----------+ | 5560 | +----------+
But, all of the rev_text_ids are 0
MariaDB [enwiki_p]> select rev_text_id from revision where rev_page = 14734 and rev_text_id != 0; Empty set (0.02 sec)
The schema description https://www.mediawiki.org/wiki/Manual:Revision_table seems pretty straight-forward. What am I not understanding?
The MediaWiki schema description is only valid for the underlying database, you do not have access to that as a labs user - you just have security-sanitised views. rev_text_ids are not useful to you as you cannot access revision texts via the DBs - you must go through the API.
On 1 October 2017 at 04:31, Roy Smith roy@panix.com wrote:
I’ve been exploring the enwiki database. I can find the page row for [[Iron]]
MariaDB [enwiki_p]> select page_title from page where page_id = 14734; +------------+ | page_title | +------------+ | Iron | +------------+
It looks like it has the right number of revisions:
MariaDB [enwiki_p]> select count(*) from revision where rev_page = 14734; +----------+ | count(*) | +----------+ | 5560 | +----------+
But, all of the rev_text_ids are 0
MariaDB [enwiki_p]> select rev_text_id from revision where rev_page = 14734 and rev_text_id != 0; Empty set (0.02 sec)
The schema description seems pretty straight-forward. What am I not understanding?
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
Hmmm, interesting. I did read [[Help:Toolforge/Database]] where it says:
Tool and Tools users are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table). For most practical purposes this is identical to the production databases and sharded into clusters in much the same way.
But didn’t realize that applied to things like rev_text_id. I assumed it was just stuff like user’s passwords and email addresses. If stuff as basic as rev_text_id is redacted, that really stretches the meaning of "For most practical purposes this is identical to the production databases”.
I assume by “the API” you mean https://www.mediawiki.org/wiki/API:Main_page https://www.mediawiki.org/wiki/API:Main_page ?
On Oct 1, 2017, at 12:10 AM, Alex Monk krenair@gmail.com wrote:
The MediaWiki schema description is only valid for the underlying database, you do not have access to that as a labs user - you just have security-sanitised views. rev_text_ids are not useful to you as you cannot access revision texts via the DBs - you must go through the API.
On 1 October 2017 at 04:31, Roy Smith roy@panix.com wrote:
I’ve been exploring the enwiki database. I can find the page row for [[Iron]]
MariaDB [enwiki_p]> select page_title from page where page_id = 14734; +------------+ | page_title | +------------+ | Iron | +------------+
It looks like it has the right number of revisions:
MariaDB [enwiki_p]> select count(*) from revision where rev_page = 14734; +----------+ | count(*) | +----------+ | 5560 | +----------+
But, all of the rev_text_ids are 0
MariaDB [enwiki_p]> select rev_text_id from revision where rev_page = 14734 and rev_text_id != 0; Empty set (0.02 sec)
The schema description seems pretty straight-forward. What am I not understanding?
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
Yep, that is the MediaWiki API documentation.
On 1 October 2017 at 06:13, Roy Smith roy@panix.com wrote:
Hmmm, interesting. I did read [[Help:Toolforge/Database]] where it says:
Tool and Tools users are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table). For most practical purposes this is identical to the production databases and sharded into clusters in much the same way.
But didn’t realize that applied to things like rev_text_id. I assumed it was just stuff like user’s passwords and email addresses. If stuff as basic as rev_text_id is redacted, that really stretches the meaning of "For most practical purposes this is identical to the production databases”.
I assume by “the API” you mean https://www.mediawiki.org/wiki/API:Main_page ?
On Oct 1, 2017, at 12:10 AM, Alex Monk krenair@gmail.com wrote:
The MediaWiki schema description is only valid for the underlying database, you do not have access to that as a labs user - you just have security-sanitised views. rev_text_ids are not useful to you as you cannot access revision texts via the DBs - you must go through the API.
On 1 October 2017 at 04:31, Roy Smith roy@panix.com wrote:
I’ve been exploring the enwiki database. I can find the page row for [[Iron]]
MariaDB [enwiki_p]> select page_title from page where page_id = 14734; +------------+ | page_title | +------------+ | Iron | +------------+
It looks like it has the right number of revisions:
MariaDB [enwiki_p]> select count(*) from revision where rev_page = 14734; +----------+ | count(*) | +----------+ | 5560 | +----------+
But, all of the rev_text_ids are 0
MariaDB [enwiki_p]> select rev_text_id from revision where rev_page = 14734 and rev_text_id != 0; Empty set (0.02 sec)
The schema description seems pretty straight-forward. What am I not understanding?
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
I’m not seeing how to access the wikitext for a specific revision via the API. I can get the HTML with /page/html/{title}/{revision}, but I don’t see how to get the wikitext. Do I really need to get the HTML and then feed that through /transform/html/to/wikitext? That seems suboptimal. Not to mention rate limited :-(
What I want to do is get the wikitext for every revision of a page. I’m thinking of building something like WikiBlame http://wikipedia.ramselehof.de/wikiblame.php?user_lang=en&lang=en&project=wikipedia&article=User:RoySmith/American_Bank_Note_Company_Printing_Plant&needle=building&skipversions=0&ignorefirst=0&limit=500&offmon=10&offtag=2&offjahr=2017&searchmethod=int&order=desc&user=, but with a nicer interface. I’m thinking of something like displaying the page in much the same style as it’s normally displayed, but with each contiguous piece of text from a given revision set off visually (perhaps color shading to show age?), and a mouseover giving revision details (user, date, etc).
On Oct 1, 2017, at 12:10 AM, Alex Monk krenair@gmail.com wrote:
The MediaWiki schema description is only valid for the underlying database, you do not have access to that as a labs user - you just have security-sanitised views. rev_text_ids are not useful to you as you cannot access revision texts via the DBs - you must go through the API.
On 1 October 2017 at 04:31, Roy Smith roy@panix.com wrote:
I’ve been exploring the enwiki database. I can find the page row for [[Iron]]
MariaDB [enwiki_p]> select page_title from page where page_id = 14734; +------------+ | page_title | +------------+ | Iron | +------------+
It looks like it has the right number of revisions:
MariaDB [enwiki_p]> select count(*) from revision where rev_page = 14734; +----------+ | count(*) | +----------+ | 5560 | +----------+
But, all of the rev_text_ids are 0
MariaDB [enwiki_p]> select rev_text_id from revision where rev_page = 14734 and rev_text_id != 0; Empty set (0.02 sec)
The schema description seems pretty straight-forward. What am I not understanding?
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
Cloud mailing list Cloud@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/cloud
On Mon, Oct 2, 2017 at 6:30 PM, Roy Smith roy@panix.com wrote:
I’m not seeing how to access the wikitext for a specific revision via the API.
Something like?
curl 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=rev...'
(you probably should use a higher level api from your favourite language library in any case)
It is not possible to share those through the database protocol, not only wikitext is not stored in human-readable format, metadata and content is separated and the only feasible way to share it while maintaining user privacy/access control is if we put an application in between (mediawiki itself :-P) or if we exported it (dumps).
Hope that is helpful.
What I want to do is get the wikitext for every revision of a page.
If it is for a single page, you can define multiple revids. But if you plan to do that massively, extracting the dumps will be both faster for you and easier on the servers. There is probably close to 100 TB of plan-text wiki content among all projects.
On 2 October 2017 at 17:30, Roy Smith roy@panix.com wrote:
I’m not seeing how to access the wikitext for a specific revision via the API. I can get the HTML with /page/html/{title}/{revision}, but I don’t see how to get the wikitext. Do I really need to get the HTML and then feed that through /transform/html/to/wikitext? That seems suboptimal. Not to mention rate limited :-(
Those /page/html/{title}/{revision} and /transform/html/to/wikitext paths are part of the separate REST API and not the MediaWiki API. Please see Jaime's example URL instead.
On 10/02/2017 12:30 PM, Roy Smith wrote:
I’m not seeing how to access the wikitext for a specific revision via the API. I can get the HTML with /page/html/{title}/{revision}, but I don’t see how to get the wikitext. Do I really need to get the HTML and then feed that through /transform/html/to/wikitext? That seems suboptimal. Not to mention rate limited :-(
What I want to do is get the wikitext for every revision of a page.
If you just want to download some revisions of a single page (for development purposes), https://en.wikipedia.org/w/api.php?action=query&prop=revisions&title... should be enough.
You'll have to use rvcontinue to get more than 50, and you should probably use a library like pywikibot.
Later, if you want to do it for more articles, go to https://dumps.wikimedia.org/backup-index-bydb.html and choose a wiki (e.g. enwiki).
You may need to click "Last dumped on" a couple times until you find a "All pages with complete edit history" with the links.
You can then download either a single archive (with all revisions of a subset of pages), or all of them.
Matt Flaschen