On Thu, Jan 5, 2017 at 6:20 AM, Bertel Teilfeldt Hansen <geilfeldt(a)gmail.com
wrote:
Hi Gabriel,
Oh yeah, I see now that the REST api doesn't mind parallel requests. I was
going off of the etiquette section in the documentation for the other api (
https://www.mediawiki.org/wiki/API:Etiquette). That one prefers requests
in series.
Ah, ok - that caveat is actually quite relevant for my project. It
requires all revisions of certain pages along with all revisions of their
talk pages (along with a bunch of other stuff). So perhaps the REST api is
not for me. I am not targeting especially frequently edited articles
specifically; rather, I'm look at articles related to particular real-world
conflicts (international and civil wars). I am a postdoc at Copenhagen
University funded by the Danish government (grant information at the bottom
of this page:
http://ufm.dk/en/research-and-innovation/funding-programmes-
for-research-and-innovation/who-has-received-funding/2015/
postdoc-grants-from-the-danish-council-for-independent-research-social-
sciences-february-2015?set_language=en&cl=en). Let me know if you want
more identification or anything.
My concern was mainly about the overall volume of uncached requests. It
sounds like you are interested in is a fairly small subset of overall
pages, so I think this should be fine. Perhaps don't max out the
parallelism in this case. In any case, making the same requests to the
action API will result in even more on-demand parses, as only the very
latest revision is cached in that case.
I actually have another question about the REST api, if that's ok. I'm
using it to get page views over time for the pages that I'm interested in.
However, the data don't seem to stretch very far back in time - is that
correct? And if so, is there a better way of getting page views (short of
using the raw files at
https://dumps.wikimedia.org/other/pagecounts-raw/)?
Yes, the pageview API is relatively new, and only has recent data at this
point. I am not certain if the analytics team plans to back-fill more
historic data over time. I vaguely remember that there might be
difficulties with changes in what is considered a pageview, so the numbers
might not be completely comparable. I cc'ed Nuria and Dan from the
analytics team, who should be able to speak to this.
> Thanks for your help so far!
>
> Bertel
>
>
>
>
> 2017-01-03 19:25 GMT+01:00 Gabriel Wicke <gwicke(a)wikimedia.org>rg>:
>
>> Bertel,
>>
>> On Mon, Jan 2, 2017 at 7:40 AM, Bertel Teilfeldt Hansen <
>> geilfeldt(a)gmail.com
wrote:
>>
>>> Hi Gabriel,
>>>
>>> The REST API looks promising - thank you!
>>>
>>> Having played around with it a bit, I seem to only be able to get one
>>> revision per request. Is that correct, or am I doing something wrong?
>>>
>>
>>
>> this is correct. The requests themselves are quite cheap, and can be
>> parallelized up to rate limit set out in the API documentation.
>>
>>
>>
>>> My project requires every revision and its references from a large
>>> number of articles, so that would make a lot of requests. The regular API
>>> allows for multiple revisions per request (only with action=query, though).
>>>
>>
>>
>> There is a caveat here in that we currently don't store all revisions for
>> all articles. This means that requests for really old revisions will
>> trigger a more expensive on-demand parse, just as with the action API. Can
>> you say more about the number of articles you are targeting, and how this
>> list is selected? Regarding the selection, I am mainly wondering if you are
>> targeting especially frequently edited articles.
>>
>> Thanks,
>>
>> Gabriel
>>
>>
>>>
>>> Thanks!
>>>
>>> Bertel
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2016-12-21 17:01 GMT+01:00 Gabriel Wicke <gwicke(a)wikimedia.org>rg>:
>>>
>>>> Bertel, another option is to use the REST API:
>>>>
>>>>
>>>> - HTML for a specific revision:
https://en.wikipedia
>>>> .org/api/rest_v1/#!/Page_content/getFormatRevision
>>>>
<https://en.wikipedia.org/api/rest_v1/#!/Page_content/getFormatRevision>
>>>> - Within this HTML, references are marked up like this:
>>>>
https://www.mediawiki.org/wiki/Specs/HTML/1.3.0/Extensions/Cite
>>>>
<https://www.mediawiki.org/wiki/Specs/HTML/1.3.0/Extensions/Cite>.
>>>> Any HTML or XML DOM parser can be used to extract this information.
>>>>
>>>> Hope this helps,
>>>>
>>>> Gabriel
>>>>
>>>> On Wed, Dec 21, 2016 at 3:20 AM, Bertel Teilfeldt Hansen <
>>>> geilfeldt(a)gmail.com
wrote:
>>>>
>>>>> Hi Brad and Gergo,
>>>>>
>>>>> Thanks for your responses!
>>>>>
>>>>> @Brad: Yeah, that was also my impression, but I wasn't sure.
Seemed
>>>>> strange that the example in the official docs would point to a place
where
>>>>> the feature was disabled. Thank you for clearing that up!
>>>>>
>>>>> @Gergo: I've been looking at action=parse, but as far as I
understand
>>>>> it, it is limited to one revision per API request, which makes it
quite
>>>>> slow to get a bunch of older revisions from a large number of
articles.
>>>>> action=query&prop=revisions&rvprop=content omits the
references from
>>>>> the output (just gives the string "{{reflist}}" after
"References").
>>>>> "mvrefs" sounds very promising, though! I will definitely
check that out -
>>>>> thank you!
>>>>>
>>>>> Best,
>>>>>
>>>>> Bertel
>>>>>
>>>>> 2016-12-20 19:51 GMT+01:00 Gergo Tisza <gtisza(a)wikimedia.org>rg>:
>>>>>
>>>>>> On Tue, Dec 20, 2016 at 10:18 AM, Bertel Teilfeldt Hansen <
>>>>>> geilfeldt(a)gmail.com
wrote:
>>>>>>
>>>>>>> And is there no way of getting references through the API?
>>>>>>>
>>>>>>
>>>>>> There is no nice way, but you can always get the HTML (or the
parse
>>>>>> tree, depending on whether you want parsed or raw refs) and
process it;
>>>>>> references are not hard to extract. For the wikitext version,
there is a
>>>>>> python tool:
https://github.com/mediawiki-utilities/python-mwrefs
>>>>>>
>>>>>> _______________________________________________
>>>>>> Mediawiki-api mailing list
>>>>>> Mediawiki-api(a)lists.wikimedia.org
>>>>>>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Mediawiki-api mailing list
>>>>> Mediawiki-api(a)lists.wikimedia.org
>>>>>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Gabriel Wicke
>>>> Principal Engineer, Wikimedia Foundation
>>>>
>>>> _______________________________________________
>>>> Mediawiki-api mailing list
>>>> Mediawiki-api(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Mediawiki-api mailing list
>>> Mediawiki-api(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>>>
>>>
>>
>>
>> --
>> Gabriel Wicke
>> Principal Engineer, Wikimedia Foundation
>>
>> _______________________________________________
>> Mediawiki-api mailing list
>> Mediawiki-api(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>>
>>
>
> _______________________________________________
> Mediawiki-api mailing list
> Mediawiki-api(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
--
Gabriel Wicke
Principal Engineer, Wikimedia Foundation