Correct method of pre-processing article text?

List overview All Threads
Download

newer

older

Support for Chemical Markup...

Local wikipedia server...

Mark Clements (HappyDog)

20 Nov 2008 20 Nov '08

3:16 p.m.

I have an extension which parses the contents of a page to store the content of certain embedded tags to the database, and I want the parsing to take place after the pre-processing (comment removal, template expansion, etc.) I also need the code to be compatible with MW1.6 as I am currently unable to upgrade to PHP5 (hopefully soon...)

Here is the code I was using until recently (where $Text is the unmodified article text):

// Create new Parser object to deal with some transformations that are // required before saving. $Parser = new Parser();

// Use the Parser object to strip out html comments, nowiki and pre tags // and whatever other bits shouldn't make it through when rendering (so // they don't affect saving). $ParserOptions = new ParserOptions(); $StripState =& $Parser->mStripState; $Parser->mOptions = $ParserOptions; $TidyText = $Parser->strip($Text, $StripState, true);

// Then replace any variables, parser functions etc. so that 'hidden' tags // (e.g. tags that are created by code, such as using the ExpandAfter // extension) are expanded properly for saving. $Parser->mFunctionHooks = $wgParser->mFunctionHooks; $Parser->mTitle =& $wgParser->mTitle; $TidyText = $Parser->replaceVariables($TidyText);

However, I was recently testing this on MW1.12, and this gives the following error:

Fatal error: Call to a member function matchStartToEnd() on a non-object in Parser.php on line 2771

I fixed this by inserting the following two lines just before the second $TidyText = ...

$Parser->mVariables =& $wgParser->mVariables; $Parser->mOutput =& $wgParser->mOutput;

Now, it is clear to me that this is the wrong way of going about this - I shouldn't be having to mess with the internals of the parser object in order to just pre-process the text, as it will clearly break whenever the parser object is updated!

Can someone tell me the correct forward-compatible way to pre-process article text in this manner?

- Mark Clements (HappyDog).

Show replies by date

Roan Kattouw

20 Nov 20 Nov

4:52 p.m.

Mark Clements (HappyDog) schreef:

...

$Parser->mVariables =& $wgParser->mVariables; $Parser->mOutput =& $wgParser->mOutput;

Now, it is clear to me that this is the wrong way of going about this - I shouldn't be having to mess with the internals of the parser object in order to just pre-process the text, as it will clearly break whenever the parser object is updated!

Can someone tell me the correct forward-compatible way to pre-process article text in this manner?

My guess would be something like $parser = clone $wgParser; . Dunno if that works (or if clone is even available in PHP 4), but you could try.

Roan Kattouw (Catrope)

Mark Clements (HappyDog)

6:30 p.m.

"Roan Kattouw" roan.kattouw@home.nl wrote in message news:492595E6.8060802@home.nl...

...

Mark Clements (HappyDog) schreef:

...
$Parser->mVariables =& $wgParser->mVariables; $Parser->mOutput =& $wgParser->mOutput;

Now, it is clear to me that this is the wrong way of going about this - I shouldn't be having to mess with the internals of the parser object in order to just pre-process the text, as it will clearly break whenever the parser object is updated!

Can someone tell me the correct forward-compatible way to pre-process article text in this manner?

My guess would be something like $parser = clone $wgParser; . Dunno if that works (or if clone is even available in PHP 4), but you could try.

It doesn't, and it isn't... :-(

That said, the PHP_Compat PEAR module contains a PHP4 version of the clone function (providing you use clone($wgParser)) which I might try.

Alternatively, is there any harm in just using $wgParser directly? Do we have to make a copy?

- Mark Clements (HappyDog)

Roan Kattouw

10:07 p.m.

Mark Clements (HappyDog) schreef:

...

It doesn't, and it isn't... :-(

That said, the PHP_Compat PEAR module contains a PHP4 version of the clone function (providing you use clone($wgParser)) which I might try.

Alternatively, is there any harm in just using $wgParser directly? Do we have to make a copy?

I don't know. The only thing I know is that you have to be careful when calling Parser members inside a parser hook, because the parser kind of goes crazy when code called from Parser::parse() calls Parser::parse() with different arguments; Parser::recursiveTagParse() is the function you need in that case. I don't know what harm could be caused by using $wgParser directly in other cases; I guess you could try.

Roan Kattouw (Catrope)

Mark Clements (HappyDog)

27 Nov 27 Nov

4:51 p.m.

"Roan Kattouw" roan.kattouw@home.nl wrote in message news:4925DF92.50902@home.nl...

...

Mark Clements (HappyDog) schreef:

...
It doesn't, and it isn't... :-(

That said, the PHP_Compat PEAR module contains a PHP4 version of the clone function (providing you use clone($wgParser)) which I might try.

Alternatively, is there any harm in just using $wgParser directly? Do we have to make a copy?

I don't know. The only thing I know is that you have to be careful when calling Parser members inside a parser hook, because the parser kind of goes crazy when code called from Parser::parse() calls Parser::parse() with different arguments; Parser::recursiveTagParse() is the function you need in that case. I don't know what harm could be caused by using $wgParser directly in other cases; I guess you could try.

I tried using clone() and it didn't work :-(

I think it might be because the tag that I am manually parsing for in order to add its data to the DB is also used when rendering, therefore it is being stripped out of the page by the parser before it comes back to me, which is not what I want.

Perhaps it would be better if I explained the problem, as I think the issue is perhaps a little more complex than I first thought.

Here is the tag I am using: <data> name=Jim age=20 </data>

When displaying the page, the data tag has a hook attached, which replaces the name/value pairs with a nice table. When saving, or otherwise changing the page in anyway, the page is parsed to extract the contents of the data tags, which are written to the DB for later querying (via other means, not relevant to this discussion).

I don't do this on page view (in the <data> tag handler), as it is a relatively expensive operation to clear out the old entries and reparse the data tags, and the data will not have changed, so it is not necessary.

So instead I re-parse it whenever the data changes (save/undelete/etc.). In order to parse it correctly I first need to run it through the pre-parser to remove comments/nowiki blocks, to expand templates etc. otherwise the page works differently to how it is rendered (e.g. data will be added to the DB even though it is in a nowiki block). However, an issue arises when the code is embedded in some other tag.

For example:

If somecustomtag parses it's contents then the data tag should be treated as normal, and if the tag does something else which doesn't result in a parse (e.g. syntax highlighting) then we should ignore this block.

Currently my code uses a new parser object, which doesn't have any hooks defined so all tags are treated in the second way. Changing it to use clone() as described above means all tags are treated in the first way. This of course includes my own <data> handler, which results in the tag being replaced by the output table, so if I use this method then no data is found by the save parser at all! Neither of these is the correct behaviour, as described in the previous paragraph.

I hope I've done a decent job of describing the problem! Does anyone have a suggestion that would allow me to do what I want?

- Mark Clements (HappyDog)

Platonides

11:19 p.m.

Mark Clements (HappyDog) wrote:

...

I don't do this on page view (in the <data> tag handler), as it is a relatively expensive operation to clear out the old entries and reparse the data tags, and the data will not have changed, so it is not necessary.

Why not rely on default MediaWiki caching to avoid it? Or check the page_touched field.

...

So instead I re-parse it whenever the data changes (save/undelete/etc.).

What about <data> name={{username|Jim}} age=20 </data>

Are you reparsing when Template:username changes?

You're reimplementing too much MediaWiki behavior. There must be a better way :-)

Mark Clements (HappyDog)

28 Nov 28 Nov

3:56 a.m.

"Platonides" Platonides@gmail.com wrote in message news:ggn9tq$fuo$1@ger.gmane.org...

...

Mark Clements (HappyDog) wrote:

...
I don't do this on page view (in the <data> tag handler), as it is a relatively expensive operation to clear out the old entries and reparse the data tags, and the data will not have changed, so it is not necessary.

Why not rely on default MediaWiki caching to avoid it? Or check the page_touched field.

...
So instead I re-parse it whenever the data changes (save/undelete/etc.).

What about

<data> name={{username|Jim}} age=20 </data>

Are you reparsing when Template:username changes?

You're reimplementing too much MediaWiki behavior. There must be a better way :-)

I agree with that statement entirely... that's why I'm posting here! ;-)

- Mark Clements (HappyDog)

Yaron Koren

6:45 p.m.

Out of curiosity, are you aware of the Semantic MediaWiki extension? When used with templates, it seems to approximate the functionality you're talking about creating.

-Yaron On Thu, Nov 27, 2008 at 10:56 PM, Mark Clements (HappyDog) < gmane@kennel17.co.uk> wrote:

...

"Platonides" Platonides@gmail.com wrote in message news:ggn9tq$fuo$1@ger.gmane.org...

...
Mark Clements (HappyDog) wrote:

...
I don't do this on page view (in the <data> tag handler), as it is a relatively expensive operation to clear out the old entries and reparse the data tags, and the data will not have changed, so it is not necessary.

Why not rely on default MediaWiki caching to avoid it? Or check the page_touched field.

...
So instead I re-parse it whenever the data changes (save/undelete/etc.).

What about

<data> name={{username|Jim}} age=20 </data>

Are you reparsing when Template:username changes?

You're reimplementing too much MediaWiki behavior. There must be a better way :-)

I agree with that statement entirely... that's why I'm posting here! ;-)

Mark Clements (HappyDog)

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

5717

Age (days ago)

5725

Last active (days ago)

wikitech-l@lists.wikimedia.org

7 comments

4 participants

tags (0)

participants (4)

Mark Clements (HappyDog)
Platonides
Roan Kattouw
Yaron Koren