Wikipedia-l February 2002

wikipedia-l@lists.wikimedia.org

40 participants
121 discussions

Optimizing the Wiki parser
by Uri Yanover 08 Feb '02

08 Feb '02

> Actually that is what I also first had in mind, but now I > think that this would be a bad idea. It makes the code > more complicated and therefore harder to debug and > harder to introduce new and/or changed mark-up > features. It is important that the code is kept so simple > that newcomers who find a bug should in principle, if > they know PHP and MySQL, be able to find the > problem and send a patch. Well, while easiness of debugging is important, it is still possible to write a good parser that would be easy to debug; keeping it simple for as long as possible might lead to very hasty and ugly attempts to optimize when it's too late. > Moreover, I don't think that doing it with > regular expresssions is that inefficient. The regular > expressions, especially the Perl compatible ones, are > very well implemented in PHP and most wiki systems > do it that way. If you are clever with perl regular expressions > there is still a lot of optimization that you can do. If you > want inspiration look at the implementation of PhpWiki. Looking at PhpWiki was exactly what I did. And I know that PHP regexps are good, but they still take time to be evaluated. As of now (v. 1.51), the code uses regexps (and str_replaces) for the following: 1. Image linking 2. External links 3. Wiki variable substitution 4. Wiki-style text formatting 5. Removal of "forbidden tags" 6. Auto-number headings 7. ISBNs In addition to that, the internal links check explodes the array which is expensive both in terms of passing again over the text and the added use of memory. Therefore, the code discussed here passes over the text at least 8 times, often doing the checking against complicated regular expressions. Be the PHP regexp performance as good as it can be, converting from one complicated markup (Wiki) to another (HTML) is simply not the task they were intended for. > My advice would be: first try it in PHP with the PCRE's. > If that doesn't work, write a real one-pass parser in PHP. If that > also doesn't work, then you can start thinking about C and, for example, > yacc/bison. As for Wiki syntax changes, IMO we should first get the > current syntax efficiently and error-free working, and only then start > thinking about changes and extensions. For reasons stated above, I don't think PCRE's will do much better, all regexps will have the same problems. Having thought a bit about the subject, a one-pass PHP parser is certainly the easiest solution to write and maintain; however I'd feel much better if I knew for sure that PHP does not create a penalty for fetching characters out of a string one at a time. Sincerely yours, Uri Yanover

4 5

Re: Documentation and user aids
by Julie Kemp 08 Feb '02

08 Feb '02

HI all -- Magnus, it wasn't really aimed all at you, because frankly I'm not sure who does what! Although I truly appreciate the documentation = ironing thing, there needs to be something. And unfortunately, your answer waas sooooo programmer! Most people aren't. I'm lucky because I've actually had to translate geekspeak to sales/customer support in a couple of jobs, so I had an inkling, but frankly the discussion bored the hell out of me because it never got to the user level. Yes, you guys let interested parties know what was going on -- i.e., that changes were being made -- but not how those changes would affect the site and its use in user-speak. Here are some very fundamental issues that need addressing: 1) What happened to sub-pages and why 2) What the HELL is a namespace (and a ton of related questions -- like if there are no subpages, why is there a talk link? 3)What changes have been made to the use of certain punctuation marks in titles? 4)Where subpages once existed, how did the conversion affect them? 5) HOw do I redirect? (If it's the same as before, it doesn't seem to work with articles entitled with people's names) Lots of questions! NO answers. I suggest a New Features special page that has FAQs and a kind of tour for newbies and those of us who have been around a while but just want to use the site -- and don't really care how it works, as long as it does!! ;-) I will help in any way I can, but I really think we need something! [[JHK]] __________________________________________________ Do You Yahoo!? Send FREE Valentine eCards with Yahoo! Greetings! http://greetings.yahoo.com

2 1

Suing over copyright violations
by Axel Boldt 08 Feb '02

08 Feb '02

I have a couple of points regarding the "inability-to-sue" concerns: 1) Linux has precisely the same issue, and is clearly commercially a lot more relevant than Wikipedia. In all the years, nobody has ever had to file a suit. People occasionly violate the GPL inadvertently and back down once it is explained to them. 2) Bomis has standing to sue: Larry is an employee of Bomis, and as such all his contributions are automatically copyrighted by Bomis. He has edited a significant proportion of the articles on the site. 3) Imagine you're big bad Microsoft, about to willfully violate the GFDL. Which scenario is scarier: being sued by Bomis.com, or the prospect of being sued by hundreds of volunteers from around the country, even around the world, with lots of sympathetic press coverage? 4) Money for such suits wouldn't be an issue. FSF, EFF etc. would all be happy to provide pro bono lawyers. By contrast, Bomis wouldn't get nearly as much support. "Just two companies with some copyright disagreement -- who cares?" Axel

1 0

Tech update; programmers only! ;)
by Magnus Manske 08 Feb '02

08 Feb '02

I just updated the CVS with the caching mechanism I proposed on the tech talk page. It is deactivated by default, because you need to alter your database prior to use. To activate: 1. In mysql, add the field cur_cache: ALTER TABLE cur ADD cur_cache MEDIUMTEXT 2. Edit wikiText.php and remove the # from the last coding line, so that it reads $useCachedPages = true ; The one thing I didn't implement is the regular "forced update" of the cache. This could be done by counting the views and flushing the cache after, say, 50 views. It could also be time-based, so it is refereshed if the cache is more than two weeks old. (just example values) Note that pages with variables (like the Main Page with {{NUMBEROFARTICLES}}) are *not* cached. I noticed a slight performance improvement at my machine (~15%) when reading a cached page, and no noticable delay on saving the cache. "The Need For Speed" continues on this channel ;) Magnus

3 6

wikitech-l
by Jimmy Wales 08 Feb '02

08 Feb '02

I've set up a new mailing list for technical matters. Some of you have gotten notices that you're subscribed -- I took the small liberty of adding a few people by hand. I'm sure that by no means did I get everyone who wants to join in. :-( http://www.nupedia.com/mailman/listinfo/wikitech-l Is the place to go to add yourself, or to set your subscription options.

1 0

08 Feb '02

Brion, Login has been working again for some time now... I thought you knew it was fixed... maybe it was magic. ;-) Thanks, Chuck ===== Venu al la senpaga, libera enciklopedio esperanta reta! http://eo.wikipedia.com/ ==== Junuloj! Filadelfio, Usono 15an-17an de Februaro http://unumondo.com/cgi-bin/wiki.pl?Filadelfia_JES _________________________________________________________ Do You Yahoo!? Información de Estados Unidos y América Latina, en Yahoo! Noticias. Visítanos en http://noticias.espanol.yahoo.com

1 0

Optimizing the Wiki parser
by Uri Yanover 08 Feb '02

08 Feb '02

Hello, During the course of our recent discussion, Jan Hidders said: > Have you seen the parsing code? There is nothing _very light_ about it, at > the moment. As soon as I had the opportunity, I located the current Wiki parsing code (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/wikipedia/phpwiki/fpw/wikiPa ge.php?rev=1.44&content-type=text/vnd.viewcvs-markup) It occured to me that while the current code is fine as an initial version, it should be optimized if we expect the Wikipedia traffic to grow significantly, since using PHP regular expressions and string management functions to process the markup is inefficient. The obvious solution would be to write a "real" one-pass parser. Doing this stand-alone (or as CGI) would be quickest using C; I do not know whether an efficient solution exsits using PHP considering the way strings are implemented in it. However, executing C code from PHP is complicated in its own right (UNIX pipes being the simplest solution, in my opinion). I could try to re-write the parser in either PHP or C, but I wanted to ask first the members of what do they think of the subject (the extent to which the code can be optimized, which language to choose, in what technique should the parser be written, which Wiki syntax changes should be made, etc.). Sincerely yours, Uri Yanover

2 1

trolls
by koyaanisqatsi＠nupedia.com 08 Feb '02

08 Feb '02

actually--sorry, there's a lot on my plate--i just remembered that I never "minor edit" from work, because that option isn't available to me since I'm not logged in. I like that. Recently I mentioned that I sometimes un-vandalize while I'm at work, and wished for the option of banning that IP for a limited time. But I think it would--if that option is eventually offered to people--be a good idea to make it usable only when logged in. And what's the status with the passwords? Is it still the case that I can log in as Larry Sanger and Larry can log in as HJ? <g> best, kq 0

1 0

locked front page / sysop status
by Chuck Smith 08 Feb '02

08 Feb '02

Can we lock down the Main Page again? If nothing else, could we make it so that it cannot be edited by anonymous users? Thanks, Chuck ===== Venu al la senpaga, libera enciklopedio esperanta reta! http://eo.wikipedia.com/ ==== Junuloj! Filadelfio, Usono 15an-17an de Februaro http://unumondo.com/cgi-bin/wiki.pl?Filadelfia_JES _________________________________________________________ Do You Yahoo!? Información de Estados Unidos y América Latina, en Yahoo! Noticias. Visítanos en http://noticias.espanol.yahoo.com

2 1

Summary of pseudo-subpage discussion
by Magnus Manske 08 Feb '02

08 Feb '02

AFAIK, we all agree on now: * No magic character at all in page titles (thus, no real subpages) * Avoid "/" character in page titles * Some topics need an easy method of grouping related subtopics * The obvious soultion is to use "Y (X)" instead of "X/Y" * Linking within related pages should be simplyfied Two basic ways to do that: 1. Use the parser to display links differently 2. Change certain "shortcuts" on save ---- Now, for my own opinion: Personally, I am much in favor of #2, as there will be no special behaviour added that complicates both the software and the understanding of rules for newcomers. "Old hands" will know how to use this, other won't bother and will, eventually, have to type the long title. IF we could agree on #2, all we need to agree on otherwise is what kind of shortcut should be used. It should be * Simple * Fast to type * Unusual enough to not be used in a real page title So, how about * [[Elves()]] on [[Middle Earth]] becomes [[Elves (Middle Earth)]] I guess no real article would ever end on "()", except you try to write articles about functions with no parameter list ;) Magnus

3 2

← Newer
1
...
4
5
6
7
8
9
10
...
13
Older →

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Wikipedia-l February 2002