Thanks Aaron and Oliver!

Strategy 2 sounds like the right way to go.

By the way, I wrote a document [1] that describes the features to search for when trying to estimate whether a page was translated or is originally written.
Your comments are highly appreciated.

[1] How_to_detect_translated_articles

Cheers,
Neta

On Mon, Jan 26, 2015 at 5:23 AM, Oliver Keyes <okeyes@wikimedia.org> wrote:
Yup. For context; because of the scale of Wikimedia's MediaWiki
instances, we actually store revision contents in their own cluster,
not in the pertinent field within the MediaWiki database schema - that
field instead acts as a pointer to where the content really lives. One
of the consequences of this is that even the R&D analysts don't have
direct access :/. If you're operating on python, I'd thoroughly
recommend Aaron's proposed utility; it's probably my favourite way to
process the dumps.

On 25 January 2015 at 19:18, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
> Neta,
>
> There are two ways to get revision text.
>
> 1. Query the API.  See
> https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brevisions
> Take special note of the "content" value of the rvprop parameter.  This
> strategy is good when you want to process only few revisions.
>
> 2. Process the XML dumps.  http://dumps.wikimedia.org/backup-index.html  If
> you are working in python, I have some nice utilities for processing the XML
> dump files.  See
> http://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump
> This strategy is good when you want to process the entire history of a wiki.
>
> -Aaron
>
> On Sun, Jan 25, 2015 at 2:24 PM, Neta Livneh <neta.livneh@gmail.com> wrote:
>>
>> Hi,
>>
>> I'm trying to reach the text table (for read only purposes), but it seems
>> that I it is not available to me (It is not in the table when I run SHOW
>> TABLES).
>>
>> Does anybody know why I don't have access and if I can get one? It is
>> crucial for my research as I need to analyse the text.
>>
>> Thanks,
>> Neta
>>
>>
>>
>>
>>
>>
>> On Thu, Jan 15, 2015 at 7:36 PM, Neta Livneh <neta.livneh@gmail.com>
>> wrote:
>>>
>>> yeah, I do have access - Thanks!
>>> I already used ssh, and also used the quarry tool for smaller quick
>>> queries.
>>>
>>> Cheers,
>>> Neta
>>>
>>>
>>> On Thu, Jan 15, 2015 at 7:35 PM, Neta Livneh <neta.livneh@gmail.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Jan 15, 2015 at 4:42 PM, Dan Andreescu
>>>> <dandreescu@wikimedia.org> wrote:
>>>>>
>>>>> Sorry, old thread, but I wanted to point out that
>>>>> http://quarry.wmflabs.org seems like a good tool for this use case.
>>>>>
>>>>>
>>>>> On Wednesday, December 24, 2014, Leila Zia <leila@wikimedia.org> wrote:
>>>>>>
>>>>>> Hi Neta,
>>>>>>
>>>>>> On Wed, Dec 24, 2014 at 7:19 AM, Neta Livneh <neta.livneh@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Actually, this is a great opportunity to say that I would love to get
>>>>>>> you guys involved or at least hear insights from the analytics team
>>>>>>> regarding the project's direction.
>>>>>>
>>>>>>
>>>>>> Feel free to keep me in the loop for the latter.
>>>>>>
>>>>>> Best,
>>>>>> Leila
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 24, 2014 at 4:39 PM, Aaron Halfaker
>>>>>>> <ahalfaker@wikimedia.org> wrote:
>>>>>>>>
>>>>>>>> Here's the instructions that Christian gave with some screenshots
>>>>>>>> and discussion:
>>>>>>>> https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_Labs
>>>>>>>>
>>>>>>>> If you're just looking to run a few queries, you might consider
>>>>>>>> http://quarry.wmflabs.org which requires no shell access -- just a Wikimedia
>>>>>>>> sites account.
>>>>>>>>
>>>>>>>> -Aaron
>>>>>>>>
>>>>>>>> On Wed, Dec 24, 2014 at 7:22 AM, Christian Aistleitner
>>>>>>>> <christian@quelltextlich.at> wrote:
>>>>>>>>>
>>>>>>>>> Hi Neta,
>>>>>>>>>
>>>>>>>>> On Wed, Dec 24, 2014 at 11:28:33AM +0200, Neta Livneh wrote:
>>>>>>>>> > For my project, we will need to sql queries on current wikipedia
>>>>>>>>> > data
>>>>>>>>> > (mostly revision history table).
>>>>>>>>> >
>>>>>>>>> > I already have a Gerrit account. Can I get SSH access for running
>>>>>>>>> > such
>>>>>>>>> > queries?
>>>>>>>>>
>>>>>>>>> It sounds like the redacted labs databases would nicely fit your
>>>>>>>>> use
>>>>>>>>> case. The easiest way to get access there is to apply for Tool Labs
>>>>>>>>> [1].
>>>>>>>>>
>>>>>>>>> To get access, please file a request through
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request
>>>>>>>>>
>>>>>>>>> (Many parts around the WMF are currently getting migrated to
>>>>>>>>> phabricator.wikimedia.org, so if someone knows a phabricator
>>>>>>>>> procedure
>>>>>>>>> for that please chime in!)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Once you've got Tool Labs [1] access you can ssh to
>>>>>>>>>
>>>>>>>>>   tools-login.wmflabs.org
>>>>>>>>>
>>>>>>>>> and running
>>>>>>>>>
>>>>>>>>>   sql enwiki
>>>>>>>>>
>>>>>>>>> on that host connects you to labsdb's enwiki database and you can
>>>>>>>>> run
>>>>>>>>> your queries there (similar for other wikis).
>>>>>>>>>
>>>>>>>>> Have fun,
>>>>>>>>> Christian
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs
>>>>>>>>> has more information and links about Tool Labs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
>>>>>>>>>                            Companies' registry: 360296y in Linz
>>>>>>>>> Christian Aistleitner
>>>>>>>>> Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
>>>>>>>>> 4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
>>>>>>>>>                              Fax:            +43 7946 / 20 5 81
>>>>>>>>>                              Homepage: http://quelltextlich.at/
>>>>>>>>> ---------------------------------------------------------------
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Analytics mailing list
>>>>>>>>> Analytics@lists.wikimedia.org
>>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Analytics mailing list
>>>>>>>> Analytics@lists.wikimedia.org
>>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> Analytics@lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>
>>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics