Thanks Aaron and Oliver!
Strategy 2 sounds like the right way to go.
By the way, I wrote a document [1] that describes the features to search for when trying to estimate whether a page was translated or is originally written. Your comments are highly appreciated.
[1] How_to_detect_translated_articles https://www.mediawiki.org/w/index.php?title=Wikipedia_article_translation_metrics/How_to_detect_translated_articles&redirect=no
Cheers, Neta
On Mon, Jan 26, 2015 at 5:23 AM, Oliver Keyes okeyes@wikimedia.org wrote:
Yup. For context; because of the scale of Wikimedia's MediaWiki instances, we actually store revision contents in their own cluster, not in the pertinent field within the MediaWiki database schema - that field instead acts as a pointer to where the content really lives. One of the consequences of this is that even the R&D analysts don't have direct access :/. If you're operating on python, I'd thoroughly recommend Aaron's proposed utility; it's probably my favourite way to process the dumps.
On 25 January 2015 at 19:18, Aaron Halfaker ahalfaker@wikimedia.org wrote:
Neta,
There are two ways to get revision text.
- Query the API. See
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brevisions Take special note of the "content" value of the rvprop parameter. This strategy is good when you want to process only few revisions.
- Process the XML dumps. http://dumps.wikimedia.org/backup-index.html
If
you are working in python, I have some nice utilities for processing the
XML
dump files. See
http://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump
This strategy is good when you want to process the entire history of a
wiki.
-Aaron
On Sun, Jan 25, 2015 at 2:24 PM, Neta Livneh neta.livneh@gmail.com
wrote:
Hi,
I'm trying to reach the text table (for read only purposes), but it
seems
that I it is not available to me (It is not in the table when I run SHOW TABLES).
Does anybody know why I don't have access and if I can get one? It is crucial for my research as I need to analyse the text.
Thanks, Neta
On Thu, Jan 15, 2015 at 7:36 PM, Neta Livneh neta.livneh@gmail.com wrote:
yeah, I do have access - Thanks! I already used ssh, and also used the quarry tool for smaller quick queries.
Cheers, Neta
On Thu, Jan 15, 2015 at 7:35 PM, Neta Livneh neta.livneh@gmail.com wrote:
On Thu, Jan 15, 2015 at 4:42 PM, Dan Andreescu dandreescu@wikimedia.org wrote:
Sorry, old thread, but I wanted to point out that http://quarry.wmflabs.org seems like a good tool for this use case.
On Wednesday, December 24, 2014, Leila Zia leila@wikimedia.org
wrote:
> > Hi Neta, > > On Wed, Dec 24, 2014 at 7:19 AM, Neta Livneh <neta.livneh@gmail.com
> wrote: >> >> >> Actually, this is a great opportunity to say that I would love to
get
>> you guys involved or at least hear insights from the analytics team >> regarding the project's direction. > > > Feel free to keep me in the loop for the latter. > > Best, > Leila > >> >> >> >> On Wed, Dec 24, 2014 at 4:39 PM, Aaron Halfaker >> ahalfaker@wikimedia.org wrote: >>> >>> Here's the instructions that Christian gave with some screenshots >>> and discussion: >>>
https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_Lab...
>>> >>> If you're just looking to run a few queries, you might consider >>> http://quarry.wmflabs.org which requires no shell access -- just
a Wikimedia
>>> sites account. >>> >>> -Aaron >>> >>> On Wed, Dec 24, 2014 at 7:22 AM, Christian Aistleitner >>> christian@quelltextlich.at wrote: >>>> >>>> Hi Neta, >>>> >>>> On Wed, Dec 24, 2014 at 11:28:33AM +0200, Neta Livneh wrote: >>>> > For my project, we will need to sql queries on current
wikipedia
>>>> > data >>>> > (mostly revision history table). >>>> > >>>> > I already have a Gerrit account. Can I get SSH access for
running
>>>> > such >>>> > queries? >>>> >>>> It sounds like the redacted labs databases would nicely fit your >>>> use >>>> case. The easiest way to get access there is to apply for Tool
Labs
>>>> [1]. >>>> >>>> To get access, please file a request through >>>> >>>> >>>>
https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request
>>>> >>>> (Many parts around the WMF are currently getting migrated to >>>> phabricator.wikimedia.org, so if someone knows a phabricator >>>> procedure >>>> for that please chime in!) >>>> >>>> >>>> Once you've got Tool Labs [1] access you can ssh to >>>> >>>> tools-login.wmflabs.org >>>> >>>> and running >>>> >>>> sql enwiki >>>> >>>> on that host connects you to labsdb's enwiki database and you can >>>> run >>>> your queries there (similar for other wikis). >>>> >>>> Have fun, >>>> Christian >>>> >>>> >>>> >>>> [1] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs >>>> has more information and links about Tool Labs. >>>> >>>> >>>> -- >>>> ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- >>>> Companies' registry: 360296y in Linz >>>> Christian Aistleitner >>>> Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at >>>> 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 >>>> Fax: +43 7946 / 20 5 81 >>>> Homepage: http://quelltextlich.at/ >>>> --------------------------------------------------------------- >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics