I just wanted to announce that * my PHP-based wiki-to-xml converter now supports the whole syntax * is now in the "php" directory of the CVS module "wiki2xml" * can be tested at http://magnusmanske.de/wiki2xml/w2x.php
You can either enter raw wikitext, or a list of article titles. Templates can be automatically resolved (which is necessary for some pages, as otherwise the wiki syntax is invalid and rendered as plain text). Article and template texts are fetched from the given MediaWiki site.
Please report any bugs you find. I will now start and try (again) to write a converter to OpenDocument format. Any help would be appreciated.
Magnus
Magnus Manske wrote:
I just wanted to announce that
- my PHP-based wiki-to-xml converter now supports the whole syntax
- is now in the "php" directory of the CVS module "wiki2xml"
- can be tested at http://magnusmanske.de/wiki2xml/w2x.php
My first (two-character) test input:
{|
yields invalid XML as output. ;-) In general, it does so whenever the close-table markup (|}) is missing.
Also, you seem to be ignoring all whitespace at the beginning of the input, which makes it output a <paragraph> when the first line should have been a <pre> because it starts with a space.
Otherwise: Very impressive!!
Timwi
Timwi wrote:
My first (two-character) test input:
{|
yields invalid XML as output. ;-) In general, it does so whenever the close-table markup (|}) is missing.
I hacked in a fix for nested tables minutes before announcing here, so it's probably a side effect of that. I'll have a look, thanks for noticing.
Also, you seem to be ignoring all whitespace at the beginning of the input, which makes it output a <paragraph> when the first line should have been a <pre> because it starts with a space.
Yep. Already fixed by changing "trim" to "rtrim" :-)
Otherwise: Very impressive!!
Thanks! I hope with added OpenDocument export, this will become useful one day.
Magnus
On 11/9/05, Magnus Manske magnus.manske@web.de wrote:
Timwi wrote:
My first (two-character) test input:
{|
yields invalid XML as output. ;-) In general, it does so whenever the close-table markup (|}) is missing.
I hacked in a fix for nested tables minutes before announcing here, so it's probably a side effect of that. I'll have a look, thanks for noticing.
Also, you seem to be ignoring all whitespace at the beginning of the input, which makes it output a <paragraph> when the first line should have been a <pre> because it starts with a space.
Yep. Already fixed by changing "trim" to "rtrim" :-)
Otherwise: Very impressive!!
Thanks! I hope with added OpenDocument export, this will become useful one day.
It's useful already... The complexity of the wikitext syntax (from a programmers perspective) is quite high and this adds a substantial level of friction to creating tools which can look for content in pages. Even doing something as simple as extracting all the text of an article and excluding content in images can be a pain. The XML representation is much easier to work with.
On 11/9/05, Magnus Manske magnus.manske@web.de wrote:
You can either enter raw wikitext, or a list of article titles. Templates can be automatically resolved (which is necessary for some pages, as otherwise the wiki syntax is invalid and rendered as plain text). Article and template texts are fetched from the given MediaWiki site.
There has been some discussion on IRC about potentially changing syntax so that this doesn't happen. (It horribly breaks WYSIWYG editing, it muddles up history, etc).
I'd like to see a longer and more thought out discussion on the list.
My thought is that if we consider the parse tree of wikitext it should be that templates should only be able to affect a subtree under the node where they are included, not make changes to the syntax at their level or above. I.e. you should be able to completely parse the wikitext, then go in and insert subtrees at the templates and not change anything else.
This *will* break a few things people have done on enwiki (I know this because I ran into pages that broke my parser), but I don't think there is anything useful that breaking this prevents accomplishing.
Gregory Maxwell:
My thought is that if we consider the parse tree of wikitext it should be that templates should only be able to affect a subtree under the node where they are included, not make changes to the syntax at their level or above.
Fully agreed. I found some pages on dewiki a while ago and corrected them.
I.e. you should be able to completely parse the wikitext, then go in and insert subtrees at the templates and not change anything else.
There are at least two more problems complicating the building of a sane parse tree:
- templates may be nested inside tags, e.g. <table {{Prettytable}}> {| {{Prettytable}} - variables may be used inside tags, see [1] [[Image:Chs2_{{{2}}}d40.png|{{{65}}}px]]
Erwin Jurschitza (de:Benutzer:Vlado)
Erwin Jurschitza wrote:
Gregory Maxwell:
My thought is that if we consider the parse tree of wikitext it should be that templates should only be able to affect a subtree under the node where they are included, not make changes to the syntax at their level or above.
Fully agreed. I found some pages on dewiki a while ago and corrected them.
That will break quite a few things on en, for example succession boxes, where they do something like this: {{table start}} {{succession|some position}} {{succession|some other position}} {{succession|yet some other position}} {{table end}}
I.e. you should be able to completely parse the wikitext, then go in and insert subtrees at the templates and not change anything else.
There are at least two more problems complicating the building of a sane parse tree:
- templates may be nested inside tags, e.g.
<table {{Prettytable}}> {| {{Prettytable}}
- variables may be used inside tags, see [1] [[Image:Chs2_{{{2}}}d40.png|{{{65}}}px]]
I do support #2, but not #1. Should be not overly complicated.
Basically, I agree with you though; Templates everywhere gets awfully messy.
Magnus
On 11/9/05, Magnus Manske magnus.manske@web.de wrote:
That will break quite a few things on en, for example succession boxes, where they do something like this: {{table start}} {{succession|some position}} {{succession|some other position}} {{succession|yet some other position}} {{table end}}
A small syntax change could fix that.. something like:
{{table start|| {{somesomething|data}} ||}}
The parser might not know what the heck "table start" is, but it would know that all of it's effects are contained inside the table start tag itself.
On 11/9/05, Magnus Manske magnus.manske@web.de wrote:
- templates may be nested inside tags, e.g.
<table {{Prettytable}}> {| {{Prettytable}}
- variables may be used inside tags, see [1] [[Image:Chs2_{{{2}}}d40.png|{{{65}}}px]]
I do support #2, but not #1. Should be not overly complicated.
Basically, I agree with you though; Templates everywhere gets awfully messy.
Ah, missed this in my first reply.
In #2's case we should separate objects and attributes. I don't think we should allow object names to be filled in via variables, only their attributes. This means that variables will not effect syntax but could still be used like the #2 above.
The website seems to be completely broken:
Warning: set_time_limit(): Cannot set time limit in safe mode in /home/www/ww4553/html/wiki2xml/w2x.php on line 8
Warning: Cannot modify header information - headers already sent by (output started at /home/www/ww4553/html/wiki2xml/w2x.php:8) in /home/www/ww4553/html/wiki2xml/w2x.php on line 64 REDIRECTMetasyntactic variable
Ævar Arnfjörð Bjarmason wrote:
The website seems to be completely broken:
Warning: set_time_limit(): Cannot set time limit in safe mode in /home/www/ww4553/html/wiki2xml/w2x.php on line 8
Warning: Cannot modify header information - headers already sent by (output started at /home/www/ww4553/html/wiki2xml/w2x.php:8) in /home/www/ww4553/html/wiki2xml/w2x.php on line 64 REDIRECTMetasyntactic variable
Sorry, I forgot my hoster insists in safemode :-(
Should be fixed now.
Hello, while testing the latest CVS version locally, I got 2 "only variables can be passed by reference"-errors in lines 231 and 232 of wiki2xml.php (im using php 5.0.5).
This is a well-known problem of scripts written for older php versions and ran under php 5, the solution is quite simple:
lines 231 and 232 of the recent CVS version of wiki2xml.php: $target = array_pop ( explode ( ">" , $target , 2 ) ) ; $target = array_shift ( explode ( "<" , $target , 2 ) ) ; and my modified version that runs under php 5: $target = array_pop ( @explode ( ">" , $target , 2 ) ) ; $target = array_shift ( @explode ( "<" , $target , 2 ) ) ;
apart from this the produced results seem to be quite good, I'll do some further testing in the next days..
best regards, Frando
Frando wrote:
Hello, while testing the latest CVS version locally, I got 2 "only variables can be passed by reference"-errors in lines 231 and 232 of wiki2xml.php (im using php 5.0.5).
This is a well-known problem of scripts written for older php versions and ran under php 5, the solution is quite simple:
lines 231 and 232 of the recent CVS version of wiki2xml.php: $target = array_pop ( explode ( ">" , $target , 2 ) ) ; $target = array_shift ( explode ( "<" , $target , 2 ) ) ; and my modified version that runs under php 5: $target = array_pop ( @explode ( ">" , $target , 2 ) ) ; $target = array_shift ( @explode ( "<" , $target , 2 ) ) ;
apart from this the produced results seem to be quite good, I'll do some further testing in the next days..
Applied, thanks!
Magnus
wikitech-l@lists.wikimedia.org