Yup. For context; because of the scale of Wikimedia's MediaWiki instances, we actually store revision contents in their own cluster, not in the pertinent field within the MediaWiki database schema - that field instead acts as a pointer to where the content really lives. One of the consequences of this is that even the R&D analysts don't have direct access :/. If you're operating on python, I'd thoroughly recommend Aaron's proposed utility; it's probably my favourite way to process the dumps.
On 25 January 2015 at 19:18, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Neta,
There are two ways to get revision text.
- Query the API. See
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brevisions Take special note of the "content" value of the rvprop parameter. This strategy is good when you want to process only few revisions.
- Process the XML dumps. http://dumps.wikimedia.org/backup-index.html If
you are working in python, I have some nice utilities for processing the XML dump files. See http://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump This strategy is good when you want to process the entire history of a wiki.
-Aaron
On Sun, Jan 25, 2015 at 2:24 PM, Neta Livneh neta.livneh@gmail.com wrote:
Hi,
I'm trying to reach the text table (for read only purposes), but it seems that I it is not available to me (It is not in the table when I run SHOW TABLES).
Does anybody know why I don't have access and if I can get one? It is crucial for my research as I need to analyse the text.
Thanks, Neta
On Thu, Jan 15, 2015 at 7:36 PM, Neta Livneh neta.livneh@gmail.com wrote:
yeah, I do have access - Thanks! I already used ssh, and also used the quarry tool for smaller quick queries.
Cheers, Neta
On Thu, Jan 15, 2015 at 7:35 PM, Neta Livneh neta.livneh@gmail.com wrote:
On Thu, Jan 15, 2015 at 4:42 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Sorry, old thread, but I wanted to point out that http://quarry.wmflabs.org seems like a good tool for this use case.
On Wednesday, December 24, 2014, Leila Zia leila@wikimedia.org wrote:
Hi Neta,
On Wed, Dec 24, 2014 at 7:19 AM, Neta Livneh neta.livneh@gmail.com wrote: > > > Actually, this is a great opportunity to say that I would love to get > you guys involved or at least hear insights from the analytics team > regarding the project's direction.
Feel free to keep me in the loop for the latter.
Best, Leila
> > > > On Wed, Dec 24, 2014 at 4:39 PM, Aaron Halfaker > ahalfaker@wikimedia.org wrote: >> >> Here's the instructions that Christian gave with some screenshots >> and discussion: >> https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_Lab... >> >> If you're just looking to run a few queries, you might consider >> http://quarry.wmflabs.org which requires no shell access -- just a Wikimedia >> sites account. >> >> -Aaron >> >> On Wed, Dec 24, 2014 at 7:22 AM, Christian Aistleitner >> christian@quelltextlich.at wrote: >>> >>> Hi Neta, >>> >>> On Wed, Dec 24, 2014 at 11:28:33AM +0200, Neta Livneh wrote: >>> > For my project, we will need to sql queries on current wikipedia >>> > data >>> > (mostly revision history table). >>> > >>> > I already have a Gerrit account. Can I get SSH access for running >>> > such >>> > queries? >>> >>> It sounds like the redacted labs databases would nicely fit your >>> use >>> case. The easiest way to get access there is to apply for Tool Labs >>> [1]. >>> >>> To get access, please file a request through >>> >>> >>> https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request >>> >>> (Many parts around the WMF are currently getting migrated to >>> phabricator.wikimedia.org, so if someone knows a phabricator >>> procedure >>> for that please chime in!) >>> >>> >>> Once you've got Tool Labs [1] access you can ssh to >>> >>> tools-login.wmflabs.org >>> >>> and running >>> >>> sql enwiki >>> >>> on that host connects you to labsdb's enwiki database and you can >>> run >>> your queries there (similar for other wikis). >>> >>> Have fun, >>> Christian >>> >>> >>> >>> [1] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs >>> has more information and links about Tool Labs. >>> >>> >>> -- >>> ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- >>> Companies' registry: 360296y in Linz >>> Christian Aistleitner >>> Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at >>> 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 >>> Fax: +43 7946 / 20 5 81 >>> Homepage: http://quelltextlich.at/ >>> --------------------------------------------------------------- >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics