Hello,
I try to get the number of revisions back for some articles, but I don't find any query where this will be offered over the API. only found this answer at stackoverflow. http://stackoverflow.com/questions/7136343/wikipedia-api-how-to-get-the-numb...
is this still unsolved? would save me lot of time and I think this is one of the most important metadata about an article. I will use it to download just articles between 500 and 5000 revisions, cause lower is useless for our research and more is too expensive to compute.
thanks for your answer.
cheers, Stefan
This type of data is very expensive to generate. If you can provide some more context of that you are trying to do I might be able to provide some help
On Wednesday, February 4, 2015, Stefan Kasberger mail@stefankasberger.at wrote:
Hello,
I try to get the number of revisions back for some articles, but I don't find any query where this will be offered over the API. only found this answer at stackoverflow.
http://stackoverflow.com/questions/7136343/wikipedia-api-how-to-get-the-numb...
is this still unsolved? would save me lot of time and I think this is one of the most important metadata about an article. I will use it to download just articles between 500 and 5000 revisions, cause lower is useless for our research and more is too expensive to compute.
thanks for your answer.
cheers, Stefan
-- *Stefan Kasberger* *E* mail@stefankasberger.at javascript:_e(%7B%7D,'cvml','mail@stefankasberger.at'); *W* www.openscienceASAP.org
The stackoverflow answer is suboptimal in that they retrieve all the revision contents for discarding them. Just listing revision ids would be enough for the counting (the most expensive field being the content, the other items will be in the same row).
It is possible to get such list from a SQL query, which should be a bit more efficient. Nonetheless, it will be expensive.
i look for a dataset with some specific characteristics. revision number is one, cause articles with low revisions dont create enough metrics for our algorithm and ones with too much take very long time (network effects). so it would be helpful to save the time to download lot of xml, compute needed metrics and select it locally.
anyway, i would suggest to generete this metadata when a new revision is created. just one counting variable and way easier to offer afterwards.
the strong point i want to make is: this is central metadata of the article, like size, number of characters, date created, urls, page ids, human-readable titles and computable titles of both, article and talk page.
another point I have some troubles now is that for example when you output the page in a query, you use the human readable title as an variable for the article. the page-id or the computable title (dont know how to call it, the one used in the url, i.e. Barack_Obama, not Barack Obama) would be better to use as a key. i. e. got a problem in creating files with the actual variable (had now the problem with the HIV/AIDS page, when python looked for a folder HIV where it wanted to create a page AIDS in) and also to address other apis or services with computer is more direct with it. i use the api for example to select my data and get it then from the export special page.
thanks for your answers!
cheers, stefan
On 2015-02-04 23:10, John wrote:
This type of data is very expensive to generate. If you can provide some more context of that you are trying to do I might be able to provide some help
On Wednesday, February 4, 2015, Stefan Kasberger <mail@stefankasberger.at mailto:mail@stefankasberger.at> wrote:
Hello, I try to get the number of revisions back for some articles, but I don't find any query where this will be offered over the API. only found this answer at stackoverflow. http://stackoverflow.com/questions/7136343/wikipedia-api-how-to-get-the-number-of-revisions-of-a-page is this still unsolved? would save me lot of time and I think this is one of the most important metadata about an article. I will use it to download just articles between 500 and 5000 revisions, cause lower is useless for our research and more is too expensive to compute. thanks for your answer. cheers, Stefan -- *Stefan Kasberger* *E* mail@stefankasberger.at <javascript:_e(%7B%7D,'cvml','mail@stefankasberger.at');> *W* www.openscienceASAP.org <http://www.openscienceASAP.org>
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
It sounds like you might be better off working from a database dump (https://dumps.wikimedia.org/) than from the API.
mediawiki-api@lists.wikimedia.org