So Datasift is doing something like this already because they have a stream
of edits to the English edition of Wikipedia that contains content in near
real-time [1]. I'm not saying to use them, but it might be instructive if
we can figure out (or if the API folk know) what they are doing.
[1]
http://datasift.com/source/44/wikipedia
- Scott
On Sat, Dec 13, 2014 at 8:54 AM, Toby Negrin <tnegrin(a)wikimedia.org> wrote:
Hi Max -- let me ping the API folks. I don't think we researchers can make
the final call on this.
-Toby
On Fri, Dec 12, 2014 at 2:53 PM, Maximilian Klein <isalix(a)gmail.com>
wrote:
Hello Researchers,
I've been playing with Recent Changes Stream Interface
<https://wikitech.wikimedia.org/wiki/RCStream> recently, and have
started trying to use the API's "*action=compare*" to look at every diff
of every wiki in real time. The goal is to produce real-time analytics on
the content that's being added or deleted. The only problem is that is will
really hammer the API with lots of reads since it doesn't have a batch
interface. Can I spawn multiple network threads and do 10+ reads per second
forever without the API complaining? Can I warn someone about this and get
a special exemption for research purposes?
The other thing to do would be to use "*action=query*" to get the
revisions in batches and do the diffing myself, but then i'm not guaranteed
to be diffing in the same way that the site is.
What techniques would you recommend?
Make a great day,
Max Klein ‽
http://notconfusing.com/
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.hale(a)oii.ox.ac.uk