Action=parse can take multiple titles, and you can get other page metadata in addition to just HTML output. Not to mention you can bundle it into one request with action=parse|query.
-Chad
On Feb 23, 2009 3:14 PM, "Michael Dale" mdale@wikimedia.org wrote:
it would be really nice if we could get html output from the api query ... this would avoid issuing doing dozens of action parse requests separately.
It apperas to be mentioned pretty regularly... does anyone know if a bug to that end has been filed.. I will plop one in there if none exists (did not find any with in my quick search)
--michael
Bryan Tong Minh wrote:
On Fri, Feb 20, 2009 at 6:45 PM, marco tanzi tanzi.marco@gmail.com
wrote:
I received a correct json object, but the content of the revision is full of data I do not need like {{....}} [[...]] ecc. I would like to get only the clean description, only text (like the one visible from the wiki website).
Run the parsed text (action=parse) through an HTML parser that strips all the tags.
Bryan
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
_______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Chad schreef:
Action=parse can take multiple titles,
No it can't.
and you can get other page metadata in addition to just HTML output. Not to mention you can bundle it into one request with action=parse|query.
That's also not possible.
-Chad
On Feb 23, 2009 3:14 PM, "Michael Dale" mdale@wikimedia.org wrote:
it would be really nice if we could get html output from the api query ... this would avoid issuing doing dozens of action parse requests separately.
It apperas to be mentioned pretty regularly... does anyone know if a bug to that end has been filed.. I will plop one in there if none exists (did not find any with in my quick search)
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise.
Roan Kattouw (Catrope)
On Tue, Feb 24, 2009 at 10:30 AM, Roan Kattouw roan.kattouw@home.nl wrote:
Chad schreef:
Action=parse can take multiple titles,
No it can't.
and you can get other page metadata in addition to just HTML output. Not to mention you can bundle it into one request with action=parse|query.
That's also not possible.
-Chad
On Feb 23, 2009 3:14 PM, "Michael Dale" mdale@wikimedia.org wrote:
it would be really nice if we could get html output from the api query ... this would avoid issuing doing dozens of action parse requests separately.
It apperas to be mentioned pretty regularly... does anyone know if a bug to that end has been filed.. I will plop one in there if none exists (did not find any with in my quick search)
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise.
Roan Kattouw (Catrope)
Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Could've sworn it was possible. Apologies.
-Cha
*snip*
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise
but clearly it would be more resource efficient than issuing 30 separate additional requests... maybe we could enable it with a low row return count say 30 ? It should be able to grab the output from the parse cache no?
With my use case of returning search result descriptions...it does not really need html it just needs striped wikitext or even a striped segment of wikitext.
So here are a few possible ways forward:
* I can switch on 30 extra requests if we need to highlight the problem.... * I could try and use one of the javascript wikitext -> html converters * Maybe we could support the output striped wikitext (really what we want for search results) ...
It appears Lucene and the internal mysql store the index in striped form if we could add access to that from the api that would be ideal way forward I think.
--michael
Michael Dale schreef:
*snip*
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise
but clearly it would be more resource efficient than issuing 30 separate additional requests... maybe we could enable it with a low row return count say 30 ?
For queries that's true, but for stuff like parsing there wouldn't be much of a difference in performance.
It should be able to grab the output from the parse cache no?
It does that already, *if* the page you're parsing is in the parser cache.
With my use case of returning search result descriptions...it does not really need html it just needs striped wikitext or even a striped segment of wikitext.
You'd be way better off stripping wikitext yourself then. Shouldn't be too hard.
- Maybe we could support the output striped wikitext (really what we
want for search results) ...
It appears Lucene and the internal mysql store the index in striped form if we could add access to that from the api that would be ideal way forward I think.
That'd be good, yes. I'll look into this.
Roan Kattouw (Catrope)
Michael Dale wrote:
*snip*
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise
but clearly it would be more resource efficient than issuing 30 separate additional requests...
The API could process up to X work (time, includes, proprocessor nodes...) and then issue some continue parameter.
Shouldn't be so bad if using the parser cache.
Platonides schreef:
Michael Dale wrote:
*snip*
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise
but clearly it would be more resource efficient than issuing 30 separate additional requests...
The API could process up to X work (time, includes, proprocessor nodes...)
That's not very feasible. Doing this with time isn't even possible.
Roan Kattouw (Catrope)
Ben Ritter wrote:
As you only want to display a small amount of text from each page you could get just the text you need from each page and send them all together with some sort of separator to http://en.wikipedia.org/w/api.php?action=parse&format=xml&text=This is some [[text]] to parse Of course this turns "[[text]]" into an html anchor tag and expands templates. If this is not what you want, stripping the text yourself would probably be the best.
I don't know if that won't work so well ... since you never now what part of a template or table or some larger wikitext structure your at when you match some segment of text. JS striping the wikitext is not so fun.. since has to deal with multiple languages and is duplicating code that already exist in the php ... see SearchUPdate::doUpdate() ... better to have all those regEx in one place ... although we could do that in js as a hack in the mean time ...
But in the end I think serving the (more) human readable text thats used for full text searches directly to the api would be ideal...
--michael
mediawiki-api@lists.wikimedia.org