Re: [Mediawiki-api] parse output

List overview All Threads
Download

newer

older

Ip address from users

Geo Coordinates

Chad

24 Feb 2009 24 Feb '09

4:17 a.m.

Action=parse can take multiple titles, and you can get other page metadata in addition to just HTML output. Not to mention you can bundle it into one request with action=parse|query.

-Chad

On Feb 23, 2009 3:14 PM, "Michael Dale" mdale@wikimedia.org wrote:

it would be really nice if we could get html output from the api query ... this would avoid issuing doing dozens of action parse requests separately.

It apperas to be mentioned pretty regularly... does anyone know if a bug to that end has been filed.. I will plop one in there if none exists (did not find any with in my quick search)

--michael

Bryan Tong Minh wrote:

...

On Fri, Feb 20, 2009 at 6:45 PM, marco tanzi tanzi.marco@gmail.com

wrote:

...

...
I received a correct json object, but the content of the revision is full of data I do not need like {{....}} [[...]] ecc. I would like to get only the clean description, only text (like the one visible from the wiki website).

Run the parsed text (action=parse) through an HTML parser that strips all the tags.

Bryan

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

_______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Attachments:

attachment.htm (text/html — 2.1 KB)

Show replies by date

Roan Kattouw

24 Feb 24 Feb

11:30 p.m.

New subject: parse output

Chad schreef:

...

Action=parse can take multiple titles,

No it can't.

...

and you can get other page metadata in addition to just HTML output. Not to mention you can bundle it into one request with action=parse|query.

That's also not possible.

...

-Chad

On Feb 23, 2009 3:14 PM, "Michael Dale" mdale@wikimedia.org wrote:

it would be really nice if we could get html output from the api query ... this would avoid issuing doing dozens of action parse requests separately.

It apperas to be mentioned pretty regularly... does anyone know if a bug to that end has been filed.. I will plop one in there if none exists (did not find any with in my quick search)

Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise.

Roan Kattouw (Catrope)

Chad

11:35 p.m.

New subject: parse output

On Tue, Feb 24, 2009 at 10:30 AM, Roan Kattouw roan.kattouw@home.nl wrote:

...

Chad schreef:

...
Action=parse can take multiple titles,

No it can't.

...
and you can get other page metadata in addition to just HTML output. Not to mention you can bundle it into one request with action=parse|query.

That's also not possible.

...
-Chad

On Feb 23, 2009 3:14 PM, "Michael Dale" mdale@wikimedia.org wrote:

it would be really nice if we could get html output from the api query ... this would avoid issuing doing dozens of action parse requests separately.

It apperas to be mentioned pretty regularly... does anyone know if a bug to that end has been filed.. I will plop one in there if none exists (did not find any with in my quick search)

Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise.

Roan Kattouw (Catrope)

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Could've sworn it was possible. Apologies.

-Cha

Michael Dale

25 Feb 25 Feb

1:33 a.m.

New subject: parse output

*snip*

...

Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise

but clearly it would be more resource efficient than issuing 30 separate additional requests... maybe we could enable it with a low row return count say 30 ? It should be able to grab the output from the parse cache no?

With my use case of returning search result descriptions...it does not really need html it just needs striped wikitext or even a striped segment of wikitext.

So here are a few possible ways forward:

* I can switch on 30 extra requests if we need to highlight the problem.... * I could try and use one of the javascript wikitext -> html converters * Maybe we could support the output striped wikitext (really what we want for search results) ...

It appears Lucene and the internal mysql store the index in striped form if we could add access to that from the api that would be ideal way forward I think.

--michael

Roan Kattouw

2:07 a.m.

New subject: parse output

Michael Dale schreef:

...

*snip*

...
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise

but clearly it would be more resource efficient than issuing 30 separate additional requests... maybe we could enable it with a low row return count say 30 ?

For queries that's true, but for stuff like parsing there wouldn't be much of a difference in performance.

...

It should be able to grab the output from the parse cache no?

It does that already, *if* the page you're parsing is in the parser cache.

...

With my use case of returning search result descriptions...it does not really need html it just needs striped wikitext or even a striped segment of wikitext.

You'd be way better off stripping wikitext yourself then. Shouldn't be too hard.

...

Maybe we could support the output striped wikitext (really what we

want for search results) ...

It appears Lucene and the internal mysql store the index in striped form if we could add access to that from the api that would be ideal way forward I think.

That'd be good, yes. I'll look into this.

Roan Kattouw (Catrope)

Michael Dale

3:08 a.m.

New subject: parse output

Roan Kattouw wrote:

...

That'd be good, yes. I'll look into this.

Sweet :) Looking forward to that feature.

Platonides

2:09 a.m.

New subject: parse output

Michael Dale wrote:

...

*snip*

...
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise

but clearly it would be more resource efficient than issuing 30 separate additional requests...

The API could process up to X work (time, includes, proprocessor nodes...) and then issue some continue parameter.

Shouldn't be so bad if using the parser cache.

Roan Kattouw

2:10 a.m.

New subject: parse output

Platonides schreef:

...

Michael Dale wrote:

...
*snip*

...
Yes, it's been filed before and WONTFIXed because parsing dozens or hundreds of pages in one request is kind of scary performance-wise

but clearly it would be more resource efficient than issuing 30 separate additional requests...

The API could process up to X work (time, includes, proprocessor nodes...)

That's not very feasible. Doing this with time isn't even possible.

Roan Kattouw (Catrope)

Ben Ritter

8:44 a.m.

New subject: parse output

Michael Dale

26 Feb 26 Feb

5:13 a.m.

New subject: parse output

Ben Ritter wrote:

...

As you only want to display a small amount of text from each page you could get just the text you need from each page and send them all together with some sort of separator to http://en.wikipedia.org/w/api.php?action=parse&format=xml&text=This is some [[text]] to parse Of course this turns "[[text]]" into an html anchor tag and expands templates. If this is not what you want, stripping the text yourself would probably be the best.

I don't know if that won't work so well ... since you never now what part of a template or table or some larger wikitext structure your at when you match some segment of text. JS striping the wikitext is not so fun.. since has to deal with multiple languages and is duplicating code that already exist in the php ... see SearchUPdate::doUpdate() ... better to have all those regEx in one place ... although we could do that in js as a hack in the mean time ...

But in the end I think serving the (more) human readable text thats used for full text searches directly to the api would be ideal...

--michael

5752

Age (days ago)

5754

Last active (days ago)

mediawiki-api@lists.wikimedia.org

9 comments

5 participants

tags (0)

participants (5)

Ben Ritter
Chad
Michael Dale
Platonides
Roan Kattouw