> Actually that is what I also first had in mind, but now I
> think that this would be a bad idea. It makes the code
> more complicated and therefore harder to debug and
> harder to introduce new and/or changed mark-up
> features. It is important that the code is kept so simple
> that newcomers who find a bug should in principle, if
> they know PHP and MySQL, be able to find the
> problem and send a patch.
Well, while easiness of debugging is important, it
is still possible to write a good parser that would
be easy to debug; keeping it simple for as long as
possible might lead to very hasty and ugly attempts
to optimize when it's too late.
> Moreover, I don't think that doing it with
> regular expresssions is that inefficient. The regular
> expressions, especially the Perl compatible ones, are
> very well implemented in PHP and most wiki systems
> do it that way. If you are clever with perl regular expressions
> there is still a lot of optimization that you can do. If you
> want inspiration look at the implementation of PhpWiki.
Looking at PhpWiki was exactly what I did. And I
know that PHP regexps are good, but they still take
time to be evaluated. As of now (v. 1.51), the code
uses regexps (and str_replaces) for the following:
1. Image linking
2. External links
3. Wiki variable substitution
4. Wiki-style text formatting
5. Removal of "forbidden tags"
6. Auto-number headings
7. ISBNs
In addition to that, the internal links check explodes the
array which is expensive both in terms of passing again
over the text and the added use of memory.
Therefore, the code discussed here passes over the text
at least 8 times, often doing the checking against
complicated regular expressions. Be the PHP regexp
performance as good as it can be, converting from one
complicated markup (Wiki) to another (HTML) is
simply not the task they were intended for.
> My advice would be: first try it in PHP with the PCRE's.
> If that doesn't work, write a real one-pass parser in PHP. If that
> also doesn't work, then you can start thinking about C and, for example,
> yacc/bison. As for Wiki syntax changes, IMO we should first get the
> current syntax efficiently and error-free working, and only then start
> thinking about changes and extensions.
For reasons stated above, I don't think PCRE's will do
much better, all regexps will have the same problems.
Having thought a bit about the subject, a one-pass PHP
parser is certainly the easiest solution to write and
maintain; however I'd feel much better if I knew for sure
that PHP does not create a penalty for fetching
characters out of a string one at a time.
Sincerely yours,
Uri Yanover
HI all --
Magnus, it wasn't really aimed all at you, because
frankly I'm not sure who does what!
Although I truly appreciate the documentation =
ironing thing, there needs to be something. And
unfortunately, your answer waas sooooo programmer!
Most people aren't. I'm lucky because I've actually
had to translate geekspeak to sales/customer support
in a couple of jobs, so I had an inkling, but frankly
the discussion bored the hell out of me because it
never got to the user level.
Yes, you guys let interested parties know what was
going on -- i.e., that changes were being made -- but
not how those changes would affect the site and its
use in user-speak.
Here are some very fundamental issues that need
addressing:
1) What happened to sub-pages and why
2) What the HELL is a namespace (and a ton of related
questions -- like if there are no subpages, why is
there a talk link?
3)What changes have been made to the use of certain
punctuation marks in titles?
4)Where subpages once existed, how did the conversion
affect them?
5) HOw do I redirect? (If it's the same as before, it
doesn't seem to work with articles entitled with
people's names)
Lots of questions! NO answers. I suggest a New
Features special page that has FAQs and a kind of tour
for newbies and those of us who have been around a
while but just want to use the site -- and don't
really care how it works, as long as it does!! ;-)
I will help in any way I can, but I really think we
need something! [[JHK]]
__________________________________________________
Do You Yahoo!?
Send FREE Valentine eCards with Yahoo! Greetings!
http://greetings.yahoo.com
I have a couple of points regarding the "inability-to-sue" concerns:
1) Linux has precisely the same issue, and is clearly commercially a
lot more relevant than Wikipedia. In all the years, nobody has ever
had to file a suit. People occasionly violate the GPL inadvertently
and back down once it is explained to them.
2) Bomis has standing to sue: Larry is an employee of Bomis, and as
such all his contributions are automatically copyrighted by Bomis.
He has edited a significant proportion of the articles on the site.
3) Imagine you're big bad Microsoft, about to willfully violate the
GFDL. Which scenario is scarier: being sued by Bomis.com, or the
prospect of being sued by hundreds of volunteers from around the
country, even around the world, with lots of sympathetic press
coverage?
4) Money for such suits wouldn't be an issue. FSF, EFF etc. would all
be happy to provide pro bono lawyers. By contrast, Bomis wouldn't
get nearly as much support. "Just two companies with some copyright
disagreement -- who cares?"
Axel
I just updated the CVS with the caching mechanism I proposed on the tech
talk page.
It is deactivated by default, because you need to alter your database prior
to use.
To activate:
1. In mysql, add the field cur_cache:
ALTER TABLE cur ADD cur_cache MEDIUMTEXT
2. Edit wikiText.php and remove the # from the last coding line, so that it
reads
$useCachedPages = true ;
The one thing I didn't implement is the regular "forced update" of the
cache. This could be done by counting the views and flushing the cache
after, say, 50 views. It could also be time-based, so it is refereshed if
the cache is more than two weeks old. (just example values)
Note that pages with variables (like the Main Page with
{{NUMBEROFARTICLES}}) are *not* cached.
I noticed a slight performance improvement at my machine (~15%) when reading
a cached page, and no noticable delay on saving the cache.
"The Need For Speed" continues on this channel ;)
Magnus
I've set up a new mailing list for technical matters.
Some of you have gotten notices that you're subscribed -- I took the small
liberty of adding a few people by hand. I'm sure that by no means did I get
everyone who wants to join in. :-(
http://www.nupedia.com/mailman/listinfo/wikitech-l
Is the place to go to add yourself, or to set your subscription options.
Brion,
Login has been working again for some time now... I
thought you knew it was fixed... maybe it was magic.
;-)
Thanks,
Chuck
=====
Venu al la senpaga, libera enciklopedio
esperanta reta! http://eo.wikipedia.com/
====
Junuloj! Filadelfio, Usono 15an-17an de Februaro
http://unumondo.com/cgi-bin/wiki.pl?Filadelfia_JES
_________________________________________________________
Do You Yahoo!?
Información de Estados Unidos y América Latina, en Yahoo! Noticias.
Visítanos en http://noticias.espanol.yahoo.com
Hello,
During the course of our recent discussion, Jan
Hidders said:
> Have you seen the parsing code? There is nothing _very light_ about it, at
> the moment.
As soon as I had the opportunity, I located the
current Wiki parsing code
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/wikipedia/phpwiki/fpw/wikiPa
ge.php?rev=1.44&content-type=text/vnd.viewcvs-markup)
It occured to me that while the current code is fine
as an initial version, it should be optimized if we
expect the Wikipedia traffic to grow significantly,
since using PHP regular expressions and string
management functions to process the markup is
inefficient.
The obvious solution would be to write a "real"
one-pass parser. Doing this stand-alone (or as CGI)
would be quickest using C; I do not know whether an
efficient solution exsits using PHP considering the
way strings are implemented in it. However,
executing C code from PHP is complicated in its
own right (UNIX pipes being the simplest solution,
in my opinion).
I could try to re-write the parser in either PHP
or C, but I wanted to ask first the members of
what do they think of the subject (the extent
to which the code can be optimized, which
language to choose, in what technique should the
parser be written, which Wiki syntax changes
should be made, etc.).
Sincerely yours,
Uri Yanover
actually--sorry, there's a lot on my plate--i just remembered that I never "minor edit" from work, because that option isn't available to me since I'm not logged in. I like that. Recently I mentioned that I sometimes un-vandalize while I'm at work, and wished for the option of banning that IP for a limited time. But I think it would--if that option is eventually offered to people--be a good idea to make it usable only when logged in.
And what's the status with the passwords? Is it still the case that I can log in as Larry Sanger and Larry can log in as HJ? <g>
best,
kq
0
Can we lock down the Main Page again? If nothing
else, could we make it so that it cannot be edited by
anonymous users?
Thanks,
Chuck
=====
Venu al la senpaga, libera enciklopedio
esperanta reta! http://eo.wikipedia.com/
====
Junuloj! Filadelfio, Usono 15an-17an de Februaro
http://unumondo.com/cgi-bin/wiki.pl?Filadelfia_JES
_________________________________________________________
Do You Yahoo!?
Información de Estados Unidos y América Latina, en Yahoo! Noticias.
Visítanos en http://noticias.espanol.yahoo.com
AFAIK, we all agree on now:
* No magic character at all in page titles (thus, no real subpages)
* Avoid "/" character in page titles
* Some topics need an easy method of grouping related subtopics
* The obvious soultion is to use "Y (X)" instead of "X/Y"
* Linking within related pages should be simplyfied
Two basic ways to do that:
1. Use the parser to display links differently
2. Change certain "shortcuts" on save
---- Now, for my own opinion:
Personally, I am much in favor of #2, as there will be no special behaviour
added that complicates both the software and the understanding of rules for
newcomers. "Old hands" will know how to use this, other won't bother and
will, eventually, have to type the long title.
IF we could agree on #2, all we need to agree on otherwise is what kind of
shortcut should be used. It should be
* Simple
* Fast to type
* Unusual enough to not be used in a real page title
So, how about
* [[Elves()]] on [[Middle Earth]] becomes [[Elves (Middle Earth)]]
I guess no real article would ever end on "()", except you try to write
articles about functions with no parameter list ;)
Magnus