Wikitech-l August 2004

wikitech-l@lists.wikimedia.org

120 participants
215 discussions

wiki2xml : Deeds following words
by Magnus Manske 19 Aug '04

19 Aug '04

Hello all, after so much talk about an alternate parser, here it is! Well, half of it, anyway. Attached (only 8KB), you'll find my manual C++ source (GPL) to convert wiki markup to XML. Remarks: * This was hacked in a few days, so don't expect it to work perfectly, although it does surprisingly well so far. * It doesn't check for invalid user XML constructs (only partially implemented so far) * It doesn't recognize nowiki and pre tags * Some small wikimarkup is not supported (ISBNs, for example) All of these problems are due to to early stages of this project, not because of "can't do". * At the moment, it reads the source from the file "test.txt". Switching this to piped command line input is dead easy, though. * Because of strangeness in the Firebird XML display, I have to replace "—" with something else ("!mdash;" in this case...), otherwise it doesn't render as valid XML. I have tested it against several "live" wikipedia articles, and benchmarked a few on my laptop: * List_of_state_leaders_in_1839 : 200 pages per second (link-heavy page) * Operation_Downfall : 380 pages per second ("normal" page of above-average length) * List_of_Baronies : 3 pages per second (source is 233KB...) * Sian_Lloyd : 2300 pages per second (stub) For an average page (in number of links, tables, overall length etc.), I estimate above 500 conversions per second on an off-the-shelf laptop (with stuff running in the background). This ain't bad... I am certain that further tweaking could enhance this some more. I have not yet found a page that doesn't output correct XML; however, it crashes on [[Results of the Canadian federal election, 2004]], our longest page, for unknown reasons. I assume it voted for the other party ;-) I would this consider that a promising start. Please give it a try, and maybe look into the PHP XML-parser. Magnus

10 23

nl:wikipedia does not want a starting capital for its articles
by Gerard Meijssen 19 Aug '04

19 Aug '04

Hai, On the English wiktionary there has been a longstanding issue with people not wanting to change en:wiktionary so that articles can start with a lowercase letter. As the nl:wiktionary has more words in other languages than in Nederlands, we found that the capitalisation works not only for Roman characters but also for Cyrilic characters and maybe others This makes the way we have tried to maintain the capitalisation really akward. It is a good enough reason for us not to want the capitalisation anymore. Could someone please turn this function off for us? Thanks, GerardM

1 0

Wikicommons
by Gerard Meijssen 18 Aug '04

18 Aug '04

Hoi, There has been much talk about starting a repository for all kinds of media. Pictures etc that have a license compatible with the ideas of wikimedia. The idea lives to have one repository for all pictures so that we need to store it only once. The idea lives to make Wikicommons ?? the location where the userid's for the wikimedia projects may live. Anyway, Wikicommons does not exist and we have all these harddrives full of our digital snapshots.that may eventually end up on wikisomewhere. Wouldn't it be nice to start Wikicommons and have a place to dump these pictures now for people to find? Yes, we cannot use Wikicommons pictures in the diverse projects yet, but now we do not have the pictures either. What is stopping us from creating this space and get something going ?? Personally I have nice pictures from within the Vatican, a scull of a Homo erectus etc etc. I have been to many musea and I have pictures a plenty from Berlin, Prague, Budapest, Brussels, Oostvaardersplassen. I am sure that I am not the only one with interesting pictures that are usefull to others. Please let us start the Wikicommons and add functionality when we think of some, when we program some. Thanks, GerardM

7 10

Stable revisions, page protection alternatives (was Re: Wiki branches, voting)
by Bryan Ford 18 Aug '04

18 Aug '04

On Monday 16 August 2004 21:33, (Jimmy (Jimbo) Wales) wrote: > Voting can be a reasonable way to resolve issues for which absolutely > no other solution can be found. But I very much doubt that > introducing a system where people vote on article content is a good > idea. Thanks for taking the time to respond. You make a good argument, and I agree that consensus is always the ideal we'd like to strive for and maximally encourage in the incentive structures - but like Andre Engels, I'm skeptical that consensus always works. In practice, voting is already used informally on Talk pages all over the place on Wikipedia to resolve contentious issues in which consensus breaks down; voting is even somewhat systematized already via the whole "Votes for deletion" mechanism. I have no desire to replace "normal-case" consensus-based Wiki collaboration with "tyranny of the majority"; my intent is only to come up with a better alternative to the existing, essentially authoritarian fallback mechanism of sysop-controlled page protection for those "extreme cases" when normal consensus fails. > Let's imagine that 90% of the people editing a particular page are of > one opinion, and 10% are of a different opinion. So long as the > participants are all reasonable and friendly, this poses no obstacle > to the production of an article that is satisfactory to all -- this is > the normal wiki way of producing quality encyclopedic content. > > The majority has no choice but to respect the viewpoint of the > minority, because the minority can always just _edit the article_. Is this really the case? Suppose I'm the only one who votes to "Keep" a particular page on VfD, and nine other users vote "Delete". Should the page be deleted? If so, my rights as one of those 10% minorities are being trampled! If someone does delete it, am I then justified in just re-creating it? That's what your statement above seems to suggest - yet I suspect that if I did that repeatedly, sooner or later either the page would be protected, or I'd be kicked off for "incivil behavior" by an annoyed sysop. Is this not the case? My point is that in the system even as it stands, voting is already used frequently as a fall-back path when consensus breaks down - and when that happens, the minority is _not_ really free to disregard the results of the vote if the majority decides against them, because if they do so persistently then the essentially authoritarian power structure defined by the sysadmin->bureaucrat->sysop "chain of command" will sooner or later crack down on them. So instead of tyranny of the majority, we have tyranny of the sysops. :) I have every desire to preserve existing incentives toward consensus-based collaboration; I think that goal is already substantially reflected in my proposal, and if deficiencies remain in that regard then I want to fix them. But as far as I can tell no one has actually read the proposal yet (somewhat understandable given its length). Even supposing that voting is never appropriate on "main-line" working Wiki content, what about "stable" revisions or branches? Someone just pointed me to the slashdot interview you did a while ago (nice, BTW!) - particularly question 4, where you mentioned a "1.0 stable" release. In such a snapshot, presumably there will have to be _some_ way in which a "definitive" version of each Wikipedia page is to be selected, and I'm very curious to see how such a selection might be performed on controversial pages, without eventually devolving to something that amounts to a vote, whether formal or informal. There can only be a single "1.0 stable" version of a given page - _someone_ has to win, either the majority or the minority - and after that decision is made no one gets to go back and re-edit the 1.0 version of the page to suit their fancy. (At least I hope not!) BTW, you mentioned in that interview a forthcoming draft proposal - has anything along these lines been released yet? Thanks for your time, Bryan

1 0

XML: Part 2
by Magnus Manske 18 Aug '04

18 Aug '04

As we are now progressing towards XML generation, what to do with it? I altered a little code from the PHP page about the XML parser to arrange the XML in a treelike structure. Don't know where to put it, so I'll just dump it below ;-) <? # Three global functions, sorry guys function wgXMLstartElement($parser, $name, $attrs) { global $wgXMLobject; // If var already defined, make array eval('$test=isset('.$wgXMLobject->tree.'->'.$name.');'); if ($test) { eval('$tmp='.$wgXMLobject->tree.'->'.$name.';'); eval('$arr=is_array('.$wgXMLobject->tree.'->'.$name.');'); if (!$arr) { eval('unset('.$wgXMLobject->tree.'->'.$name.');'); eval($wgXMLobject->tree.'->'.$name.'[0]=$tmp;'); $cnt = 1; } else { eval('$cnt=count('.$wgXMLobject->tree.'->'.$name.');'); } $wgXMLobject->tree .= '->'.$name."[$cnt]"; } else { $wgXMLobject->tree .= '->'.$name; } if (count($attrs)) { eval($wgXMLobject->tree.'->attr=$attrs;'); } } function wgXMLendElement($parser, $name) { global $wgXMLobject; // Strip off last -> for($a=strlen($wgXMLobject->tree);$a>0;$a--) { if (substr($wgXMLobject->tree, $a, 2) == '->') { $wgXMLobject->tree = substr($wgXMLobject->tree, 0, $a); break; } } } function wgXMLcharacterData($parser, $data) { global $wgXMLobject; eval($wgXMLobject->tree.'->data=\''.$data.'\';'); } # Here's the class that generates a nice tree class wikiXML { function init_object () { global $wgXMLobject ; $wgXMLobject->tree = '$wgXMLobject->xml'; $wgXMLobject->xml = ''; } function scanFile ( $filename ) { $this->init_object () ; $xml_parser = xml_parser_create(); xml_set_element_handler($xml_parser, "wgXMLstartElement", "wgXMLendElement"); xml_set_character_data_handler($xml_parser, "wgXMLcharacterData"); if (!($fp = fopen($filename, "r"))) { die("could not open XML input"); } while ($data = fread($fp, 4096)) { if (!xml_parse($xml_parser, $data, feof($fp))) { die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser))); } } xml_parser_free($xml_parser); } function scanString ( &$data ) { $this->init_object () ; $xml_parser = xml_parser_create(); xml_set_element_handler($xml_parser, "wgXMLstartElement", "wgXMLendElement"); xml_set_character_data_handler($xml_parser, "wgXMLcharacterData"); if (!xml_parse($xml_parser, $data, true)) { die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser))); } xml_parser_free($xml_parser); } function is_list ( &$xml ) { } function parseit ( &$xml ) { print_r ( $xml ) ; } } $w = new wikiXML ; $filename = 'sample.xml'; $w->scanFile ( $filename ) ; $w->parseit ( $wgXMLobject->xml ) ; return 0; ?>

3 5

[Fwd: Re: Double database entries]
by Tim Starling 18 Aug '04

18 Aug '04

Christian Semrau wrote: > Hello wikipedians all around the world! > > I am Christian aka SirJective from the german WP, and want to promote some > knowledge about a database flaw that can be handled by sysops. > > Did you ever ask yourselves why the "What links here" page of some articles > lists articles that don't link there? > Or found articles listed on Special:Shortpages that are not short (at least not > as short as listed on that page)? > Or articles listed on Special:Lonelypages that are not orphaned? I've been working on these kinds of problems recently. The cause usually seems to be a SELECT performed without a FOR UPDATE option. Unfortunately using more locking will cause more deadlocks and lock wait timeout errors, but hopefully with appropriate use of retry loops and other measures, these problems can be kept to a minimum. -- Tim Starling

1 0

CVS problems & snapshot
by Brion Vibber 17 Aug '04

17 Aug '04

One of SourceForge's anonymous CVS mirrors has been down since August 12, naturally the one that serves us. This has broken: * anonymous CVS checkouts * ViewCVS * nightly backups of the CVS repository Developer CVS is still working; the anonymous servers just mirror that, so we're still working on things ok but it's a pretty annoying service outage. SF's status page says they expect it to be fixed sometime on August 13 *cough*. We'll see... :P For those of you who want to work on the current source but aren't on the project on SF, here's a fresh checkout of CVS head branch with phase3, extensions, and wiki2xml modules: http://download.wikimedia.org/mediawiki/mediawiki-cvs-2004-08-17.tar.bz2 -- brion vibber (brion @ pobox.com)

3 3

Re: Wikitech-l Digest, Vol 13, Issue 48
by Jens Ropers 17 Aug '04

17 Aug '04

> > Message: 6 > Date: Tue, 17 Aug 2004 10:49:14 +0100 > From: Timwi <timwi(a)gmx.net> > > (...) > > I love parsing. Does it show? :-) > > Timwi > Say, you wouldn't have worked with sendmail in the past? :) Thanks and regards, Jens Ropers There are two types of IT techs: The ones who watch soap operas and the ones who watch progress bars. http://www.ropersonline.com/elmo/#108681741955837683

1 0

Re: [WikiEN-l] WP:(...)
by Jens Ropers 17 Aug '04

17 Aug '04

> Date: Tue, 17 Aug 2004 09:23:20 +0100 > From: Timwi <timwi(a)gmx.net> > To: wikitech-l(a)wikimedia.org > > Jens Ropers wrote: > >> We don't actually have to move anything > > If you want to introduce a new namespace (or namespace alias) called > "X", for example, you *do* have to move all the pages whose title > begins > with "X:", or else they will become inaccessible. Err... no, actually. It's just a matter of additional checks and PRECEDENCE when resolving. (See my previous mail and below.) > This is because an article "WP:SB" is actually an article with the > title > "WP:SB" in the article namespace, while an article "Wikipedia:SB" is > actually an article with the title "SB" in the Wikipedia namespace. If > you introduce a new namespace (or namespace alias) called "WP", the > article [[WP:SB]] will still be in the article namespace, but the > server > will start looking for it in the new namespace and not find it. > > Timwi Things COULD be implemented as follows: Let's assume we're looking for "Wikipedia:Sandbox". Now let's assume we're entering "WP:Sandbox". Assuming name resolution as I am proposing were implemented, the following would happen: FIRST the system would do name resolution as usual: This means if any article of "WP:Sandbox" existed in the article namespace, then this article would be resolved to. ONLY IF NO article named "WP:Sandbox" exists AND an article of "Wikipedia:Sandbox" exists, THEN "WP:Sandbox" will redirect to "Wikipedia:Sandbox". IF no article of "Wikipedia:Sandbox" existed, THEN it would show the standard "do you want to create an article?"-screen -- but this for "WP:Sandbox" (!) As I said in my previous post, it's strictly a matter of precedence, a question of WHEN you check what. The option of whether to expand "WP" to "Wikipedia" will ONLY get checked AFTER ordinary article namespace resolution for WP:xyz has failed, but BEFORE the "do you want to create this article"-screen is shown. All that said, I could understand why you might be weary of implementing such a solution -- it might be unnecessarily confusing to future maintainers of the system. An alternative would be to create a bot that will sift through the Wikipedia database and automatically create ''individual'' redirects following the above rationale and order of checks. This might be a preferred option. Thanks and regards, Jens Ropers There are two types of IT techs: The ones who watch soap operas and the ones who watch progress bars. http://www.ropersonline.com/elmo/#108681741955837683

1 0

An alternate parser
by Michael Becker 17 Aug '04

17 Aug '04

I can help with the CFG (Context Free Grammar) too. I don't have as much experience as you, I've only taken compiler course, not taught them, but I definitely could help out. I think that having a CFG for out wiki markup is invaluable because it will give us the sort of flexibility that we don't have now. That, and the parsers that lexx/yakk give you are going to be much quicker, and more efficient than what we have now. I also think we should leave it in C, not php, since it will be more efficient, and it sounds like we could use all the efficiency that we can get. On Sat, 14 Aug 2004 00:42:10 +0200, Jan Hidders <jan.hidders(a)pandora.be> wrote: > > > On Friday 13 August 2004 20:59, Brion Vibber wrote: > > Magnus Manske wrote: > > > I therefore suggest a new structure: > > > 1. Preprocessor > > > 2. Wiki markup to XML > > > 3. XML to (X)HTML > > > > This doesn't actually solve any of the issues with the current parser, > > since it merely has it produce a different output format. > > > > The main problems are that we have a mess of regexps that stomp on each > > other all the time. > > Are you kidding? That is exactly what it would solve! If you would let the > preprocessor be generated with a lex/yacc type of tool then you would for the > first time have a decent formal documentation of the wiki-syntax in the form > of a context-free grammar. That not only would give you a better idea of what > the wiki-syntax exactly is and tell you exactly whether any new mark-up > interferes with old mark-up, but you could also more easily add > context-sensitive rules (like replacing 2 dashes with — but only in > normal text). Moreover it would give you the power to make small changes to > the mark-up language because you could easily generate a parser that > translates all old texts to the new mark-up. Finally, having an explicit > grammar also makes it more easy to make sure that you actually generate > well-formed and valid XHTML, or anything else that you would like to generate > from it and that needs somehow to satisfy a certain syntax. > > It's simply a briliant idea, and frankly I think it is in the long run as > unavoidable as the step to a database-backend. If there is performance > problem you could even consider storing the XML in the database so you only > need do the raw parse at write time and the xml parse at read time. > > That hard part is of course to come up with the contex-free grammar (it should > probably be LALR(1) at that). Since I used to teach compiler theory I might > be of some help there. > > -- Jan Hidders > > PS. You could even get rid of the OCaml code since the Latex parsing could be > integrated in the general parser. > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)wikimedia.org > http://mail.wikipedia.org/mailman/listinfo/wikitech-l > -- Michael Becker

13 36

← Newer
1
...
6
7
8
9
10
11
12
...
22
Older →

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l August 2004