Parsoid still doesn't love me

List overview All Threads
Download

newer

older

Forking, branching, merging, and...

Community Tech: October report

Ricordisamoa

7 Nov 2015 7 Nov '15

1:59 a.m.

What if I need to get all revisions (~2000) of a page in Parsoid HTML5? The prop=revisions API (in batches of 50) with mwparserfromhell is much quicker. And what about ~400 revisions from a wiki without Parsoid/RESTBase? I would use /transform/wikitext/to/html then. Thanks in advance.

Show replies by date

Subramanya Sastry

7 Nov 7 Nov

2:09 a.m.

Parsoid is simply a wikitext -> html and a html -> wikitext conversion service. Everything else would be tools and libs built on top of it.

Subbu.

On 11/06/2015 02:29 PM, Ricordisamoa wrote:

...

What if I need to get all revisions (~2000) of a page in Parsoid HTML5? The prop=revisions API (in batches of 50) with mwparserfromhell is much quicker. And what about ~400 revisions from a wiki without Parsoid/RESTBase? I would use /transform/wikitext/to/html then. Thanks in advance.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ricordisamoa

2:22 a.m.

I mean RESTBase can't access more than 1 revision at once?

Il 06/11/2015 21:39, Subramanya Sastry ha scritto:

...

Parsoid is simply a wikitext -> html and a html -> wikitext conversion service. Everything else would be tools and libs built on top of it.

Subbu.

On 11/06/2015 02:29 PM, Ricordisamoa wrote:

...
What if I need to get all revisions (~2000) of a page in Parsoid HTML5? The prop=revisions API (in batches of 50) with mwparserfromhell is much quicker. And what about ~400 revisions from a wiki without Parsoid/RESTBase? I would use /transform/wikitext/to/html then. Thanks in advance.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

C. Scott Ananian

3:54 a.m.

I think your subject line should have been "RESTBase doesn't love me"? --scott

Gabriel Wicke

4:07 a.m.

We don't currently store the full history of each page in RESTBase, so your first access will trigger an on-demand parse of older revisions not yet in storage, which is relatively slow. Repeat accesses will load those revisions from disk (SSD), which will be a lot faster.

With a majority of clients now supporting HTTP2 / SPDY, use cases that benefit from manual batching are becoming relatively rare. For a use case like revision retrieval, HTTP2 with a decent amount of parallelism should be plenty fast.

Gabriel

On Fri, Nov 6, 2015 at 2:24 PM, C. Scott Ananian cananian@wikimedia.org wrote:

...

I think your subject line should have been "RESTBase doesn't love me"? --scott _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation

Petr Bena

10 Nov 10 Nov

12:07 a.m.

Do you really want to say that reading from disk is faster than processing the text using CPU? I don't know how complex syntax of mw actually is, but C++ compilers are probably much faster than parsoid, if that's true. And these are very slow.

What takes so much CPU time in turning wikitext into html? Sounds like JS wasn't a best choice here.

On Fri, Nov 6, 2015 at 11:37 PM, Gabriel Wicke gwicke@wikimedia.org wrote:

...

We don't currently store the full history of each page in RESTBase, so your first access will trigger an on-demand parse of older revisions not yet in storage, which is relatively slow. Repeat accesses will load those revisions from disk (SSD), which will be a lot faster.

With a majority of clients now supporting HTTP2 / SPDY, use cases that benefit from manual batching are becoming relatively rare. For a use case like revision retrieval, HTTP2 with a decent amount of parallelism should be plenty fast.

Gabriel

On Fri, Nov 6, 2015 at 2:24 PM, C. Scott Ananian cananian@wikimedia.org wrote:

...
I think your subject line should have been "RESTBase doesn't love me"? --scott

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Gabriel Wicke Principal Engineer, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Subramanya Sastry

12:17 a.m.

On 11/09/2015 12:37 PM, Petr Bena wrote:

...

Do you really want to say that reading from disk is faster than processing the text using CPU? I don't know how complex syntax of mw actually is, but C++ compilers are probably much faster than parsoid, if that's true. And these are very slow.

What takes so much CPU time in turning wikitext into html? Sounds like JS wasn't a best choice here.

The problem is not turning wikitext into HTML, but turning it into HTML so that it can be turned back into wikitext when it is edited and doing it in such a way that you don't introduce dirty diffs.

That requires keeping around state, tracking things in wikitext closely, and doing a lot more analysis.

That means detecting markup errors, and retaining error recovery information so that you can account for it during analysis, and also so you can reintroduce the markup errors when you convert the html back to wikitext. This is the reason why we proposed https://phabricator.wikimedia.org/T48705 since we already have all the information about broken wikitext usage.

If you are interested in more details, either show up on #mediawiki-parsoid, or look at this april 2014 tech-talk: A preliminary look at Parsoid internals [ Slides https://commons.wikimedia.org/wiki/File:Parsoid.techtalk.apr15.2014.pdf, Video https://www.youtube.com/watch?v=Eb5Ri0xqEzk ]. It has some details.

So, TL:DR; is Parsoid is a *bi-directional* wikitext <-> html bridge and doing that is non-trivial.

Subbu.

C. Scott Ananian

12:22 a.m.

On Mon, Nov 9, 2015 at 1:37 PM, Petr Bena benapetr@gmail.com wrote:

...

Do you really want to say that reading from disk is faster than processing the text using CPU? I don't know how complex syntax of mw actually is, but C++ compilers are probably much faster than parsoid, if that's true. And these are very slow.

What takes so much CPU time in turning wikitext into html? Sounds like JS wasn't a best choice here.

More fundamentally, the parsing task involves recursive expansion of templates and image information queries, and popular wikipedia articles can involve hundreds of templates and image queries. Caching the result of parsing allows us to avoid repeating these nested queries, which are a major contributor to parse time.

One of the benefits of the Parsoid DOM representation[*] is that it will allow in-place update of templates and image information, so that updating pages after a change can be by simple substitution and *without* repeating the actual "parse wikitext" step. --scott [*] This actually requires some tweaks to the wikitext of some popular templates; https://phabricator.wikimedia.org/T114445 is a decent summary of the work (although be sure to read to the end of the comments, there's significant stuff there which I haven't editing into the top-level task description yet).

-- (http://cscott.net)

Brad Jorsch (Anomie)

9 Nov 9 Nov

8:22 p.m.

On Fri, Nov 6, 2015 at 3:29 PM, Ricordisamoa ricordisamoa@openmailbox.org wrote:

...

What if I need to get all revisions (~2000) of a page in Parsoid HTML5? The prop=revisions API (in batches of 50) with mwparserfromhell is much quicker.

That's a tradeoff you get with a highly-cacheable REST API: you generally have to fetch each 'thing' individually rather than being able to batch queries.

If you already know how to individually address each 'thing' (e.g. you fetch the list of revisions for the page first) and the REST API's ToS allow it, multiplexing requests might be possible to reduce the impact of the limitation. If you have to rely on "next" and "previous" links in the content to address adjacent 'things' hateoas-style you're probably out of luck.

-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Ricordisamoa

11:20 p.m.

Il 09/11/2015 15:52, Brad Jorsch (Anomie) ha scritto:

...

On Fri, Nov 6, 2015 at 3:29 PM, Ricordisamoa ricordisamoa@openmailbox.org wrote:

...
What if I need to get all revisions (~2000) of a page in Parsoid HTML5? The prop=revisions API (in batches of 50) with mwparserfromhell is much quicker.

That's a tradeoff you get with a highly-cacheable REST API: you generally have to fetch each 'thing' individually rather than being able to batch queries.

If you already know how to individually address each 'thing' (e.g. you fetch the list of revisions for the page first) and the REST API's ToS allow it, multiplexing requests might be possible to reduce the impact of the limitation. If you have to rely on "next" and "previous" links in the content to address adjacent 'things' hateoas-style you're probably out of luck.

All of this seems overkill in the first place...

3348

Age (days ago)

3351

Last active (days ago)

wikitech-l@lists.wikimedia.org

9 comments

6 participants

tags (0)

participants (6)

Brad Jorsch (Anomie)
C. Scott Ananian
Gabriel Wicke
Petr Bena
Ricordisamoa
Subramanya Sastry