Hello all,
after so much talk about an alternate parser, here it is! Well, half of
it, anyway.
Attached (only 8KB), you'll find my manual C++ source (GPL) to convert
wiki markup to XML.
Remarks:
* This was hacked in a few days, so don't expect it to work perfectly,
although it does surprisingly well so far.
* It doesn't check for invalid user XML constructs (only partially
implemented so far)
* It doesn't recognize nowiki and pre tags
* Some small wikimarkup is not supported (ISBNs, for example)
All of these problems are due to to early stages of this project, not
because of "can't do".
* At the moment, it reads the source from the file "test.txt". Switching
this to piped command line input is dead easy, though.
* Because of strangeness in the Firebird XML display, I have to replace
"—" with something else ("!mdash;" in this case...), otherwise it
doesn't render as valid XML.
I have tested it against several "live" wikipedia articles, and
benchmarked a few on my laptop:
* List_of_state_leaders_in_1839 : 200 pages per second (link-heavy page)
* Operation_Downfall : 380 pages per second ("normal" page of
above-average length)
* List_of_Baronies : 3 pages per second (source is 233KB...)
* Sian_Lloyd : 2300 pages per second (stub)
For an average page (in number of links, tables, overall length etc.), I
estimate above 500 conversions per second on an off-the-shelf laptop
(with stuff running in the background). This ain't bad...
I am certain that further tweaking could enhance this some more.
I have not yet found a page that doesn't output correct XML; however, it
crashes on [[Results of the Canadian federal election, 2004]], our
longest page, for unknown reasons. I assume it voted for the other party ;-)
I would this consider that a promising start. Please give it a try, and
maybe look into the PHP XML-parser.
Magnus
Hai,
On the English wiktionary there has been a longstanding issue with
people not wanting to change en:wiktionary so that articles can start
with a lowercase letter. As the nl:wiktionary has more words in other
languages than in Nederlands, we found that the capitalisation works not
only for Roman characters but also for Cyrilic characters and maybe others
This makes the way we have tried to maintain the capitalisation really
akward. It is a good enough reason for us not to want the capitalisation
anymore. Could someone please turn this function off for us?
Thanks,
GerardM
Hoi,
There has been much talk about starting a repository for all kinds of
media. Pictures etc that have a license compatible with the ideas of
wikimedia. The idea lives to have one repository for all pictures so
that we need to store it only once.
The idea lives to make Wikicommons ?? the location where the userid's
for the wikimedia projects may live.
Anyway, Wikicommons does not exist and we have all these harddrives full
of our digital snapshots.that may eventually end up on wikisomewhere.
Wouldn't it be nice to start Wikicommons and have a place to dump these
pictures now for people to find?
Yes, we cannot use Wikicommons pictures in the diverse projects yet, but
now we do not have the pictures either. What is stopping us from
creating this space and get something going ??
Personally I have nice pictures from within the Vatican, a scull of a
Homo erectus etc etc. I have been to many musea and I have pictures a
plenty from Berlin, Prague, Budapest, Brussels, Oostvaardersplassen. I
am sure that I am not the only one with interesting pictures that are
usefull to others.
Please let us start the Wikicommons and add functionality when we think
of some, when we program some.
Thanks,
GerardM
On Monday 16 August 2004 21:33, (Jimmy (Jimbo) Wales) wrote:
> Voting can be a reasonable way to resolve issues for which absolutely
> no other solution can be found. But I very much doubt that
> introducing a system where people vote on article content is a good
> idea.
Thanks for taking the time to respond. You make a good argument, and I agree
that consensus is always the ideal we'd like to strive for and maximally
encourage in the incentive structures - but like Andre Engels, I'm skeptical
that consensus always works. In practice, voting is already used informally
on Talk pages all over the place on Wikipedia to resolve contentious issues
in which consensus breaks down; voting is even somewhat systematized already
via the whole "Votes for deletion" mechanism.
I have no desire to replace "normal-case" consensus-based Wiki collaboration
with "tyranny of the majority"; my intent is only to come up with a better
alternative to the existing, essentially authoritarian fallback mechanism of
sysop-controlled page protection for those "extreme cases" when normal
consensus fails.
> Let's imagine that 90% of the people editing a particular page are of
> one opinion, and 10% are of a different opinion. So long as the
> participants are all reasonable and friendly, this poses no obstacle
> to the production of an article that is satisfactory to all -- this is
> the normal wiki way of producing quality encyclopedic content.
>
> The majority has no choice but to respect the viewpoint of the
> minority, because the minority can always just _edit the article_.
Is this really the case? Suppose I'm the only one who votes to "Keep" a
particular page on VfD, and nine other users vote "Delete". Should the page
be deleted? If so, my rights as one of those 10% minorities are being
trampled! If someone does delete it, am I then justified in just re-creating
it? That's what your statement above seems to suggest - yet I suspect that
if I did that repeatedly, sooner or later either the page would be protected,
or I'd be kicked off for "incivil behavior" by an annoyed sysop. Is this not
the case?
My point is that in the system even as it stands, voting is already used
frequently as a fall-back path when consensus breaks down - and when that
happens, the minority is _not_ really free to disregard the results of the
vote if the majority decides against them, because if they do so persistently
then the essentially authoritarian power structure defined by the
sysadmin->bureaucrat->sysop "chain of command" will sooner or later crack
down on them. So instead of tyranny of the majority, we have tyranny of the
sysops. :)
I have every desire to preserve existing incentives toward consensus-based
collaboration; I think that goal is already substantially reflected in my
proposal, and if deficiencies remain in that regard then I want to fix them.
But as far as I can tell no one has actually read the proposal yet (somewhat
understandable given its length).
Even supposing that voting is never appropriate on "main-line" working Wiki
content, what about "stable" revisions or branches? Someone just pointed me
to the slashdot interview you did a while ago (nice, BTW!) - particularly
question 4, where you mentioned a "1.0 stable" release. In such a snapshot,
presumably there will have to be _some_ way in which a "definitive" version
of each Wikipedia page is to be selected, and I'm very curious to see how
such a selection might be performed on controversial pages, without
eventually devolving to something that amounts to a vote, whether formal or
informal. There can only be a single "1.0 stable" version of a given page -
_someone_ has to win, either the majority or the minority - and after that
decision is made no one gets to go back and re-edit the 1.0 version of the
page to suit their fancy. (At least I hope not!)
BTW, you mentioned in that interview a forthcoming draft proposal - has
anything along these lines been released yet?
Thanks for your time,
Bryan
As we are now progressing towards XML generation, what to do with it? I
altered a little code from the PHP page about the XML parser to arrange
the XML in a treelike structure. Don't know where to put it, so I'll
just dump it below ;-)
<?
# Three global functions, sorry guys
function wgXMLstartElement($parser, $name, $attrs) {
global $wgXMLobject;
// If var already defined, make array
eval('$test=isset('.$wgXMLobject->tree.'->'.$name.');');
if ($test) {
eval('$tmp='.$wgXMLobject->tree.'->'.$name.';');
eval('$arr=is_array('.$wgXMLobject->tree.'->'.$name.');');
if (!$arr) {
eval('unset('.$wgXMLobject->tree.'->'.$name.');');
eval($wgXMLobject->tree.'->'.$name.'[0]=$tmp;');
$cnt = 1;
}
else {
eval('$cnt=count('.$wgXMLobject->tree.'->'.$name.');');
}
$wgXMLobject->tree .= '->'.$name."[$cnt]";
}
else {
$wgXMLobject->tree .= '->'.$name;
}
if (count($attrs)) {
eval($wgXMLobject->tree.'->attr=$attrs;');
}
}
function wgXMLendElement($parser, $name) {
global $wgXMLobject;
// Strip off last ->
for($a=strlen($wgXMLobject->tree);$a>0;$a--) {
if (substr($wgXMLobject->tree, $a, 2) == '->') {
$wgXMLobject->tree = substr($wgXMLobject->tree, 0, $a);
break;
}
}
}
function wgXMLcharacterData($parser, $data) {
global $wgXMLobject;
eval($wgXMLobject->tree.'->data=\''.$data.'\';');
}
# Here's the class that generates a nice tree
class wikiXML
{
function init_object ()
{
global $wgXMLobject ;
$wgXMLobject->tree = '$wgXMLobject->xml';
$wgXMLobject->xml = '';
}
function scanFile ( $filename )
{
$this->init_object () ;
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "wgXMLstartElement",
"wgXMLendElement");
xml_set_character_data_handler($xml_parser, "wgXMLcharacterData");
if (!($fp = fopen($filename, "r"))) {
die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);
}
function scanString ( &$data )
{
$this->init_object () ;
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "wgXMLstartElement",
"wgXMLendElement");
xml_set_character_data_handler($xml_parser, "wgXMLcharacterData");
if (!xml_parse($xml_parser, $data, true)) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
xml_parser_free($xml_parser);
}
function is_list ( &$xml )
{
}
function parseit ( &$xml )
{
print_r ( $xml ) ;
}
}
$w = new wikiXML ;
$filename = 'sample.xml';
$w->scanFile ( $filename ) ;
$w->parseit ( $wgXMLobject->xml ) ;
return 0;
?>
Christian Semrau wrote:
> Hello wikipedians all around the world!
>
> I am Christian aka SirJective from the german WP, and want to promote some
> knowledge about a database flaw that can be handled by sysops.
>
> Did you ever ask yourselves why the "What links here" page of some articles
> lists articles that don't link there?
> Or found articles listed on Special:Shortpages that are not short (at least not
> as short as listed on that page)?
> Or articles listed on Special:Lonelypages that are not orphaned?
I've been working on these kinds of problems recently. The cause usually
seems to be a SELECT performed without a FOR UPDATE option.
Unfortunately using more locking will cause more deadlocks and lock wait
timeout errors, but hopefully with appropriate use of retry loops and
other measures, these problems can be kept to a minimum.
-- Tim Starling
One of SourceForge's anonymous CVS mirrors has been down since August
12, naturally the one that serves us. This has broken:
* anonymous CVS checkouts
* ViewCVS
* nightly backups of the CVS repository
Developer CVS is still working; the anonymous servers just mirror that,
so we're still working on things ok but it's a pretty annoying service
outage. SF's status page says they expect it to be fixed sometime on
August 13 *cough*. We'll see... :P
For those of you who want to work on the current source but aren't on
the project on SF, here's a fresh checkout of CVS head branch with
phase3, extensions, and wiki2xml modules:
http://download.wikimedia.org/mediawiki/mediawiki-cvs-2004-08-17.tar.bz2
-- brion vibber (brion @ pobox.com)
>
> Message: 6
> Date: Tue, 17 Aug 2004 10:49:14 +0100
> From: Timwi <timwi(a)gmx.net>
>
> (...)
>
> I love parsing. Does it show? :-)
>
> Timwi
>
Say, you wouldn't have worked with sendmail in the past? :)
Thanks and regards,
Jens Ropers
There are two types of IT techs: The ones who watch soap operas and the
ones who watch progress bars.
http://www.ropersonline.com/elmo/#108681741955837683
> Date: Tue, 17 Aug 2004 09:23:20 +0100
> From: Timwi <timwi(a)gmx.net>
> To: wikitech-l(a)wikimedia.org
>
> Jens Ropers wrote:
>
>> We don't actually have to move anything
>
> If you want to introduce a new namespace (or namespace alias) called
> "X", for example, you *do* have to move all the pages whose title
> begins
> with "X:", or else they will become inaccessible.
Err... no, actually. It's just a matter of additional checks and
PRECEDENCE when resolving. (See my previous mail and below.)
> This is because an article "WP:SB" is actually an article with the
> title
> "WP:SB" in the article namespace, while an article "Wikipedia:SB" is
> actually an article with the title "SB" in the Wikipedia namespace. If
> you introduce a new namespace (or namespace alias) called "WP", the
> article [[WP:SB]] will still be in the article namespace, but the
> server
> will start looking for it in the new namespace and not find it.
>
> Timwi
Things COULD be implemented as follows:
Let's assume we're looking for "Wikipedia:Sandbox".
Now let's assume we're entering "WP:Sandbox".
Assuming name resolution as I am proposing were implemented, the
following would happen:
FIRST the system would do name resolution as usual: This means if any
article of "WP:Sandbox" existed in the article namespace, then this
article would be resolved to.
ONLY IF NO article named "WP:Sandbox" exists AND an article of
"Wikipedia:Sandbox" exists, THEN "WP:Sandbox" will redirect to
"Wikipedia:Sandbox".
IF no article of "Wikipedia:Sandbox" existed, THEN it would show the
standard "do you want to create an article?"-screen -- but this for
"WP:Sandbox" (!)
As I said in my previous post, it's strictly a matter of precedence, a
question of WHEN you check what.
The option of whether to expand "WP" to "Wikipedia" will ONLY get
checked AFTER ordinary article namespace resolution for WP:xyz has
failed, but BEFORE the "do you want to create this article"-screen is
shown.
All that said, I could understand why you might be weary of
implementing such a solution -- it might be unnecessarily confusing to
future maintainers of the system.
An alternative would be to create a bot that will sift through the
Wikipedia database and automatically create ''individual'' redirects
following the above rationale and order of checks. This might be a
preferred option.
Thanks and regards,
Jens Ropers
There are two types of IT techs: The ones who watch soap operas and the
ones who watch progress bars.
http://www.ropersonline.com/elmo/#108681741955837683
I can help with the CFG (Context Free Grammar) too. I don't have as much experience as you, I've only taken compiler course, not taught them, but I definitely could help out. I think that having a CFG for out wiki markup is invaluable because it will give us the sort of flexibility that we don't have now. That, and the parsers that lexx/yakk give you are going to be much quicker, and more efficient than what we have now. I also think we should leave it in C, not php, since it will be more efficient, and it sounds like we could use all the efficiency that we can get.
On Sat, 14 Aug 2004 00:42:10 +0200, Jan Hidders <jan.hidders(a)pandora.be> wrote:
>
>
> On Friday 13 August 2004 20:59, Brion Vibber wrote:
> > Magnus Manske wrote:
> > > I therefore suggest a new structure:
> > > 1. Preprocessor
> > > 2. Wiki markup to XML
> > > 3. XML to (X)HTML
> >
> > This doesn't actually solve any of the issues with the current parser,
> > since it merely has it produce a different output format.
> >
> > The main problems are that we have a mess of regexps that stomp on each
> > other all the time.
>
> Are you kidding? That is exactly what it would solve! If you would let the
> preprocessor be generated with a lex/yacc type of tool then you would for the
> first time have a decent formal documentation of the wiki-syntax in the form
> of a context-free grammar. That not only would give you a better idea of what
> the wiki-syntax exactly is and tell you exactly whether any new mark-up
> interferes with old mark-up, but you could also more easily add
> context-sensitive rules (like replacing 2 dashes with — but only in
> normal text). Moreover it would give you the power to make small changes to
> the mark-up language because you could easily generate a parser that
> translates all old texts to the new mark-up. Finally, having an explicit
> grammar also makes it more easy to make sure that you actually generate
> well-formed and valid XHTML, or anything else that you would like to generate
> from it and that needs somehow to satisfy a certain syntax.
>
> It's simply a briliant idea, and frankly I think it is in the long run as
> unavoidable as the step to a database-backend. If there is performance
> problem you could even consider storing the XML in the database so you only
> need do the raw parse at write time and the xml parse at read time.
>
> That hard part is of course to come up with the contex-free grammar (it should
> probably be LALR(1) at that). Since I used to teach compiler theory I might
> be of some help there.
>
> -- Jan Hidders
>
> PS. You could even get rid of the OCaml code since the Latex parsing could be
> integrated in the general parser.
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)wikimedia.org
> http://mail.wikipedia.org/mailman/listinfo/wikitech-l
>
--
Michael Becker