Thanks Aaron and Oliver!
Strategy 2 sounds like the right way to go.
By the way, I wrote a document [1] that describes the features to search
for when trying to estimate whether a page was translated or is originally
written.
Your comments are highly appreciated.
[1] How_to_detect_translated_articles
<https://www.mediawiki.org/w/index.php?title=Wikipedia_article_translation_metrics/How_to_detect_translated_articles&redirect=no>
Cheers,
Neta
On Mon, Jan 26, 2015 at 5:23 AM, Oliver Keyes <okeyes(a)wikimedia.org> wrote:
Yup. For context; because of the scale of
Wikimedia's MediaWiki
instances, we actually store revision contents in their own cluster,
not in the pertinent field within the MediaWiki database schema - that
field instead acts as a pointer to where the content really lives. One
of the consequences of this is that even the R&D analysts don't have
direct access :/. If you're operating on python, I'd thoroughly
recommend Aaron's proposed utility; it's probably my favourite way to
process the dumps.
On 25 January 2015 at 19:18, Aaron Halfaker <ahalfaker(a)wikimedia.org>
wrote:
Neta,
There are two ways to get revision text.
1. Query the API. See
https://en.wikipedia.org/w/api.php?action=help&modules=query%2Brevisions
Take special note of the "content" value of the rvprop parameter. This
strategy is good when you want to process only few revisions.
2. Process the XML dumps.
http://dumps.wikimedia.org/backup-index.html If
you are working in python, I have some nice
utilities for processing the
XML
dump files. See
http://pythonhosted.org/mediawiki-utilities/core/xml_dump.html#mw-xml-dump
This strategy is good when you want to process
the entire history of a
wiki.
-Aaron
On Sun, Jan 25, 2015 at 2:24 PM, Neta Livneh <neta.livneh(a)gmail.com>
wrote:
>
> Hi,
>
> I'm trying to reach the text table (for read only purposes), but it
seems
> that I it is not available to me (It is not
in the table when I run SHOW
> TABLES).
>
> Does anybody know why I don't have access and if I can get one? It is
> crucial for my research as I need to analyse the text.
>
> Thanks,
> Neta
>
>
>
>
>
>
> On Thu, Jan 15, 2015 at 7:36 PM, Neta Livneh <neta.livneh(a)gmail.com>
> wrote:
>>
>> yeah, I do have access - Thanks!
>> I already used ssh, and also used the quarry tool for smaller quick
>> queries.
>>
>> Cheers,
>> Neta
>>
>>
>> On Thu, Jan 15, 2015 at 7:35 PM, Neta Livneh <neta.livneh(a)gmail.com>
>> wrote:
>>>
>>>
>>>
>>> On Thu, Jan 15, 2015 at 4:42 PM, Dan Andreescu
>>> <dandreescu(a)wikimedia.org> wrote:
>>>>
>>>> Sorry, old thread, but I wanted to point out that
>>>>
http://quarry.wmflabs.org seems like a good tool for this use case.
>>>>
>>>>
>>>> On Wednesday, December 24, 2014, Leila Zia <leila(a)wikimedia.org>
wrote:
>>>>
>>>> Hi Neta,
>>>>
>>>> On Wed, Dec 24, 2014 at 7:19 AM, Neta Livneh <neta.livneh(a)gmail.com
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Actually, this is a great opportunity to say that I would love to
get
>>>>>> you guys involved or at
least hear insights from the analytics team
>>>>>> regarding the project's direction.
>>>>>
>>>>>
>>>>> Feel free to keep me in the loop for the latter.
>>>>>
>>>>> Best,
>>>>> Leila
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 24, 2014 at 4:39 PM, Aaron Halfaker
>>>>>> <ahalfaker(a)wikimedia.org> wrote:
>>>>>>>
>>>>>>> Here's the instructions that Christian gave with some
screenshots
>>>>>>> and discussion:
>>>>>>>
https://meta.wikimedia.org/wiki/Research:Labs2/Getting_started_with_Tool_La…
>>>>>>>
>>>>>>> If you're just looking to run a few queries, you might
consider
>>>>>>>
http://quarry.wmflabs.org which requires no shell access --
just
a Wikimedia
>>>>>>> sites account.
>>>>>>>
>>>>>>> -Aaron
>>>>>>>
>>>>>>> On Wed, Dec 24, 2014 at 7:22 AM, Christian Aistleitner
>>>>>>> <christian(a)quelltextlich.at> wrote:
>>>>>>>>
>>>>>>>> Hi Neta,
>>>>>>>>
>>>>>>>> On Wed, Dec 24, 2014 at 11:28:33AM +0200, Neta Livneh
wrote:
>>>>>>>> > For my project, we will need to sql queries on
current
wikipedia
>>>>>>>> > data
>>>>>>>> > (mostly revision history table).
>>>>>>>> >
>>>>>>>> > I already have a Gerrit account. Can I get SSH
access for
running
>>>>>>>> > such
>>>>>>>> > queries?
>>>>>>>>
>>>>>>>> It sounds like the redacted labs databases would nicely
fit your
>>>>>>>> use
>>>>>>>> case. The easiest way to get access there is to apply for
Tool
Labs
>>>>>>>> [1].
>>>>>>>>
>>>>>>>> To get access, please file a request through
>>>>>>>>
>>>>>>>>
>>>>>>>>
https://wikitech.wikimedia.org/wiki/Special:FormEdit/Tools_Access_Request
>>>>>>
>>>>>> (Many parts around the WMF are currently getting migrated to
>>>>>>
phabricator.wikimedia.org, so if someone knows a phabricator
>>>>>> procedure
>>>>>> for that please chime in!)
>>>>>>
>>>>>>
>>>>>> Once you've got Tool Labs [1] access you can ssh to
>>>>>>
>>>>>>
tools-login.wmflabs.org
>>>>>>
>>>>>> and running
>>>>>>
>>>>>> sql enwiki
>>>>>>
>>>>>> on that host connects you to labsdb's enwiki database and you
can
>>>>>> run
>>>>>> your queries there (similar for other wikis).
>>>>>>
>>>>>> Have fun,
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1]
https://wikitech.wikimedia.org/wiki/Help:Tool_Labs
>>>>>> has more information and links about Tool Labs.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
>>>>>> Companies' registry: 360296y in
Linz
>>>>>> Christian Aistleitner
>>>>>> Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at
>>>>>> 4293 Gutau, Austria Phone: +43 7946 / 20 5 81
>>>>>> Fax: +43 7946 / 20 5 81
>>>>>> Homepage:
http://quelltextlich.at/
>>>>>> ---------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics(a)lists.wikimedia.org
>>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics