Wikipedia-related project: WikiRover

List overview All Threads
Download

newer

older

Danish localization on 1.3 removed

Wiktionaries broken

Nicolas Weeger

25 May 2004 25 May '04

1:05 a.m.

Hello.

As Med mentioned in a mail in reply to Magnus, he & I are working on a project called 'wikirover' (exists on sourceforge, empty for now). Not very advanced yet, but you never know :p

I think some people may be interested, so i'll try to explain what we have in mind.

The aim will be, ultimately, to make an offline Wikipedia, that could be distributed on CD/DVD/... But not necessarily the whole encyclopedia, a subpart would do it (sports? movies? you name it).

The core will be a C++ library. It'll have tools to store raw articles (and history) in an sqlite database, parse'em to different formats (html first, then maybe others). Also search capabilities. Maybe classes to access articles from live Wikipedia, mysql server, and/or database dump.

Then an application that'll actually display articles, from local database. It'll use different raw articles sources, and merge that (like: first download 'movie' database, then 'actors/actresses', it'll merge & make correct links between topics). With also update capability (from live Wikipedia? dump? why not :p), things like that (basically, things you'd expect from a regular encyclopedia).

And, potentially, a tool to manage those theme-related bases.

We'll try to make it crossplatform (we can test on Linux & Windows). Library will use platform independant code, applications probably wxWidgets (depends, nothing is sure yet :p)

Nicolas

Show replies by date

Delirium

25 May 25 May

8:30 a.m.

New subject: parsing wikitext

I like the WikiRover, but something IMO it needs is a formal parser for wikitext. I haven't looked at MediaWiki 1.3, but last time I looked at the parser (1.1 maybe?) it wasn't actually a parser, but a bunch of regular expressions applied to the flat Wiki file, with some hacks like replacing math sections with a unique text string to avoid them getting clobbered. All that makes repurposing it for other things a bit difficult. As a side note, it also seems to make extending it difficult---it seems to be the reason (unless I'm missing something else) for limitations like "you can't have links inside of image captions", because regular expressions have a more limited expressiveness than context-free grammars do, so can't distinguish a ]] closing an internal link from the ]] closing the image.

I've been thinking of doing it for a while, but my main hang-up, apart from lack of time, is the lack of a good parser-generator for the full class of context-free grammars. Most require you to have LALR(1) grammars, and maintaining the wikimarkup specification in such a form, not to mention getting it there in the first place, would be a nightmare, since wikitext isn't particularly designed with it in mind like many programming languages are (it's hard to even scan to unambiguously find terminals with a lexer in wikitext). One tool that does both take the full set of grammars and has no separate lexer is "ratpack parsing", which was some guy's master's thesis recently, and has been implemented so far in Haskell and Java. It's very fast---O(n)---but also takes O(n) space, where 'n' is the size of the document being parsed, which isn't so good (LALR(1) parsers are O(k) where k is the maximum nesting depth). A ratpack parser on one of the larger Wikipedia articles (say, 100kb) would take around 0.5-1 second and 4MB of RAM to parse. That's fine for offline generation, but would be impossible to use on wikipedia.org, and it'd be ideal if eventually we could have one grammar that is used for everything, instead of keeping differently-specified things in approximate sync.

The other possibilities I've found are: 1. Bite the bullet and try to shove wikitext into LALR(1). Not very fun, and might not even be possible. 2. Write a hand-coded pseudo-recursive-descent parser (but the nature of wikitext means this requires unbounded lookahead to resolve ambiguities) 3. Use a GLR parser-generator like Berkeley's Elkhound. This might be doable, but Elkhound is a bit hard to use. Or I may just not have looked enough. 4. Use ratpack parsing, but change things up so articles get parsed on edit instead of on view, which is only on the order of a few tens of thousands per day, and have views generated from pre-parsed abstract representations (or perhaps even already-generated HTML-with-blanks that just needs link-coloring and date-format preferences filled in).

Anyone have any thoughts in this direction, or suggestions? Is this even worth doing at all? It seems like having wikitext formally specified would be nice, because it would allow for easy extensions, like the mentioned "links inside of image captions" example, and easy retargetting to any other sort of output format. But doing it for wikitext seems difficult---most programming languages are specifically designed with clean lexing followed by LALR(1) parsing in mind. That's not meant to be a criticism of wikitext btw---it's clearly supposed to be person-readable, with machine readability being a distant second---but it does make it rather difficult to deal with given the current state of parsing technology.

-Mark

Jens Frank

8:50 a.m.

New subject: parsing wikitext

On Mon, May 24, 2004 at 10:30:50PM -0500, Delirium wrote:

...

I like the WikiRover, but something IMO it needs is a formal parser for wikitext. I haven't looked at MediaWiki 1.3, but last time I looked at the parser (1.1 maybe?) it wasn't actually a parser, but a bunch of regular expressions applied to the flat Wiki file, with some hacks like replacing math sections with a unique text string to avoid them getting clobbered.

You should perhaps have a look at 1.3 first. Parts of the Parser are already a real parser, reading the wikitext in one pass, character by character. See Tokenizer.php and its use in Parser.php. This work is not yet completed, so the regexes still exist for some parts of the markup.

Regards,

JeLuF

Tim Starling

26 May 26 May

7:17 p.m.

New subject: 1.3 beta is slow!

Jens Frank wrote

...

You should perhaps have a look at 1.3 first. Parts of the Parser are already a real parser, reading the wikitext in one pass, character by character. See Tokenizer.php and its use in Parser.php. This work is not yet completed, so the regexes still exist for some parts of the

markup.

I hadn't seen this bit of the parser. Last time I looked at it, it was still splitting the string using regexes. When I saw the way you do it currently, I have to admit I went into a bit of a panic. In my experience reading a large string character by character in a high level language is a very bad idea. Indeed, our "parser" to date has gone to some lengths to avoid this, using regexes in all sorts of contrived ways to avoid executing a number of PHP lines proportional to the number of characters.

After I calmed down, I fixed up the profiler and did a couple of runs. Gabriel Wicke did some too, using ab. They're at:

http://meta.wikipedia.org/wiki/Profiling

They show that the page view time for the current CVS HEAD is double what it was in 1.2.5. The parser itself was rougly 2.4 times slower.

This is completely unacceptable considering the current state of our web serving hardware. The latest batch of 1U servers won't cover the penalty from upgrading to 1.3. Our web servers are not keeping up with demand in peak times as it is, during peak times their queues all overflow, giving users random error messages.

The current plan is to revert the tokenizer sections of the parser back to something similar to 1.2. Hopefully we'll get it working soon, since the Board vote feature I've written is a 1.3 extension and voting is meant to start in 4 days.

-- Tim Starling

erik_moeller＠gmx.de

25 May 25 May

2:21 p.m.

Nicolas-

...

The aim will be, ultimately, to make an offline Wikipedia, that could be distributed on CD/DVD/...

This is a great idea, and several people have been working on similar ideas.

What I don't get is why you want to parse the raw wikitext. This will be quite a PITA unless you also bundle PHP, texvc etc. It would be much easier to use the existing parser code to create a static HTML dump from the wikisource, and use something like swish-E ( www.swish-e.org ) to index that HTML dump. This would give you more time to focus on what actually matters, i.e. the user interface.

As I recall Tim is working on something like that. Maybe while none of these projects is very far along yet, these efforts could be merged into one?

Besides Tim, there's Magnus, who has hacked together his stand alone webserver/wiki engine for Windows, and there's an existing alpha quality static HTML dumper called terodump by Tero Karvinen: http://www.hut.fi/~tkarvine/tero-dump/

So how about it - Magnus, Tim, Tero, Nicolas, do you think you could work together on one solution?

Regards,

Erik

Stephan Walter

3:02 p.m.

Hi,

There's also Alfio Puglisi who wrote a script to generate static HTML.

He has archives of multiple languages for download:

http://www.tommasoconforti.com/wiki/

en is 800MB (bz2 compressed), should be no problem to put it on a DVD uncompressed.

Cheers, Stephan

-- http://cowww.epfl.ch/~swalter/

Alfio Puglisi

9:08 p.m.

On Tue, 25 May 2004, Stephan Walter wrote:

...

Hi,

There's also Alfio Puglisi who wrote a script to generate static HTML.

He has archives of multiple languages for download:

http://www.tommasoconforti.com/wiki/

en is 800MB (bz2 compressed), should be no problem to put it on a DVD uncompressed.

Yes, but that's quite an old dump, and it's growing fast. The last html version I did were well past 1 GB, still bzip2 compressed, and uncompressed over a significant fraction of a DVD :-) If the en wiki continues to grow so fast, it will overflow even a DVD in a not too distant future.

Alfio

Christopher Mahan

26 May 26 May

2:29 a.m.

What is the size of the W in html uncompressed?

Can that be multi-CD load into a PC, or even via FTP ala apt...

I can imagine an offsite versioning that checks for the latest version of all articles and self-updates accordingly. Also, drop the histories. They're on the website.

If I had a 4 gig offline self-update-able (when on network) or via new articles cds (using python of course for updating) this might be a good way to keep the W for offline browsing.

Thoughts?

===== Christopher Mahan chris_mahan@yahoo.com 818.943.1850 cell http://www.christophermahan.com/

__________________________________ Do you Yahoo!? Friends. Fun. Try the all-new Yahoo! Messenger. http://messenger.yahoo.com/

Medéric Boquien

2:34 a.m.

Christopher Mahan wrote:

...

I can imagine an offsite versioning that checks for the latest version of all articles and self-updates accordingly. Also, drop the histories. They're on the website.

...

If I had a 4 gig offline self-update-able (when on network) or via new articles cds (using python of course for updating) this might be a good way to keep the W for offline browsing.

It is our intention to implement live update when there is internet connection available. The idea was to keep only cur for offline browsing. And in fact not all cur, juste namespace=0 to reduce the size.

Cheers,

Med

Alfio Puglisi

2:07 p.m.

On Tue, 25 May 2004, Christopher Mahan wrote:

...

What is the size of the W in html uncompressed?

Last full en: uncompressed dump I have is from february 12 (so it's quite old..). Approximate figures:

- 250,000 html files, total 1.7 GB on a Linux ext3 disk - 44,000 image and media files, total 1.5 GB

Some of the size is due to filesystem overhead, but I don't remember the block structure of ext3 so I cannot quantify it. Also, being a separate html file, each article has its own header/footer.

I'll see if I can make a new html version with more recent data.

Alfio

delirium＠hackish.org

6 a.m.

Quoting Erik Moeller erik_moeller@gmx.de:

...

What I don't get is why you want to parse the raw wikitext. This will be quite a PITA unless you also bundle PHP, texvc etc. It would be much easier to use the existing parser code to create a static HTML dump from the wikisource, and use something like swish-E ( www.swish-e.org ) to index that HTML dump. This would give you more time to focus on what actually matters, i.e. the user interface.

This seems like a pretty hackish solution long-term. The HTML dump has some semantic information, but it also has a lot of HTML-ish cruft in it. The wikitext doesn't have all the semantic information anyone might want, but it's much better than the HTML version. If you're going to do anything reasonably intelligent with the output (other than just display the rendered HTML), or output it to some different format (TeX is the one I've been working on), and want it automated, a lot of that information will be useful.

So, basically: Wikitext --> abstract syntax --> a presentation format (HTML, TeX, etc.)

instead of: Wikitext --> one presentation format (HTML) --> another presentation format

...seems better to me.

The latter version is sort of like compiling a C++ program into x86 assembly and then transforming it into PowerPC assembly from that, rather than doing wha gcc does--compiling C++ into an abstract intermediate representation, which can then be output to x86 assembly or PowerPC assembly or whatever you might like.

It does bring up another point though: even in the wikitext there isn't as much semantic information as might be nice. Some is hard to come up with good markup for, but some is fairly easy--for example, encouraging people to use <math> tags for everying that's logically math, even short things like "the variable <math>g</math> is..." instead of using manual non-logical formatting commands like "italicize". Or even worse, using HTML-specific stuff like fancy divs.

-Mark

---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.

erik_moeller＠gmx.de

7:23 a.m.

delirium-

...

This seems like a pretty hackish solution long-term.

HTML is not hackish. It can be printed and viewed on all platforms using free libraries, it can be easily indexed, and it can be converted to literally everything else using existing code. Semantic wikimarkup can be replaced with CSS classes for the appropriate tags. We are already producing well-formed xhtml pages with virtually every aspect of layout defined in CSS. This then puts us on the road to producing pure XML when browsers support it. A CSS wizard like Gabriel might even be able to hack something like alternative MathML/PNG display for formulas, rendered based on the user's CSS file.

To use our own syntax, which is indeed hackish (a mix of UseMod and our own extensions, with ugly constructs like "#REDIRECT", pseudo-HTML tags, tags with undefined scope etc.), or to define a new "abstract syntax" would be a waste of valuable developer resources. Any semantic information that may still be missing from the output should instead be included in it.

Wikitext is meant to be read by humans, not by computers. The only program which should read it is our own parser. Keeping a separate parser code in sync with ours (or, in fact, a completely new "abstract" syntax) would be a maintenance nightmare. Our syntax may look simple, but it is amazingly complex - features like transclusion, parametrized templates, image scaling, wiki table conversion, and various extensions are not to be scoffed at. And new ones are being added all the time. That is good! The fact that we support stuff like hieroglyphs makes Wikipedia much more interesting for academics.

Now, rewriting the parser in another language to create a fast cross- platform library might make sense, but for completely different reasons. The only reason I can see to include the wikitext on a CD-ROM would be for offline editing, but for that you need a fresh copy of the source to avoid edit conflicts, so you need to fetch it from the site anyway.

Erik

delirium＠hackish.org

1 Jun 1 Jun

6:15 a.m.

Quoting Erik Moeller erik_moeller@gmx.de:

...

delirium-

...
This seems like a pretty hackish solution long-term.

HTML is not hackish.

On the contrary, HTML is incredibly hackish. It is not a semantic markup language, and attempts to make it so (CSS classes and whatnot) are hacks to try to shove some sematic information into what is a layout language. Your additional comments aren't even remotely related to the matter--yes, there are free libraries to display HTML, but that only indicates it's a usable display language, not a language for information-passing in abstract, non-layout form. Are you seriously proposing that, say, an HTML table is a good abstract format for information? That writing a horribly ugly parser to convert HTML to LaTeX is the best way of typesetting wikipedia articles in LaTeX?

I agree that wikitext should be parsed once, but it should be parsed once to a semantic format. If you must take something from the cesspool of HTML-related technologies, XML would probably be the least offensive choice, but given that there exist 40 years of mature technologies for parsing concrete syntax (like wikitext) into abstract syntax trees, I don't see why we should use such a hackish dot-com-style solution.

-Mark

---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.

Arvind Narayanan

26 May 26 May

8:29 a.m.

On Tue, May 25, 2004 at 09:00:29PM -0400, delirium@hackish.org wrote:

...

This seems like a pretty hackish solution long-term. The HTML dump has some semantic information, but it also has a lot of HTML-ish cruft in it. The wikitext doesn't have all the semantic information anyone might want, but it's much better than the HTML version. If you're going to do anything reasonably intelligent with the output (other than just display the rendered HTML), or output it to some different format (TeX is the one I've been working on), and want it automated, a lot of that information will be useful.

But that's not the purpose. The CD is just meant for reading the articles and perhaps printing them. If someone wanted to do data mining or convert to a different format or whatever else they want to do with the semantic markup, they should get the database dump.

...

So, basically: Wikitext --> abstract syntax --> a presentation format (HTML, TeX, etc.)

instead of: Wikitext --> one presentation format (HTML) --> another presentation format

...seems better to me.

The latter version is sort of like compiling a C++ program into x86 assembly and then transforming it into PowerPC assembly from that, rather than doing wha gcc does--compiling C++ into an abstract intermediate representation, which can then be output to x86 assembly or PowerPC assembly or whatever you might like.

It does bring up another point though: even in the wikitext there isn't as much semantic information as might be nice. Some is hard to come up with good markup for, but some is fairly easy--for example, encouraging people to use <math> tags for everying that's logically math, even short things like "the variable <math>g</math> is..." instead of using manual non-logical formatting commands like "italicize". Or even worse, using HTML-specific stuff like fancy divs.

That'd be nice, but unfortunately its not enforceable. The web was originally intended to have semantic tags. But guess what happened? There are two reasons why it won't work on wikipedia. People generally think in terms of presentational rather than structural markup. Might be a result of WYSIWYG word processors, I don't know. Second, even those feel semantic markup is important (and I offer myself as an example) aren't going to be bothered to write <math>x</math> instead of ''x''. The primary goal of wikipedia is parsability by humans rather than computers, and the former application is currently so predominant over the latter that I'm not willing to inconvenience myself for the sake of semantic markup.

Arvind

-- Its all GNU to me

delirium＠hackish.org

1 Jun 1 Jun

6:20 a.m.

Quoting Arvind Narayanan arvindn@meenakshi.cs.iitm.ernet.in:

...

Second, even those feel semantic markup is important (and I offer myself as an example) aren't going to be bothered to write <math>x</math> instead of ''x''. The primary goal of wikipedia is parsability by humans rather than computers, and the former application is currently so predominant over the latter that I'm not willing to inconvenience myself for the sake of semantic markup.

But that's the point: <math>x</math> is parsable by humans. It means x is a variable, and is instantly understandable. Saying "x is typeset in italics" is a typography detail that means nothing abstractly to humans (except perhaps humans overly familiar with the details of typesetting math textbooks). It also makes Wikipedia basically a web encyclopedia, or else an ugly-looking print encyclopedia.

]-Mark

---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.

7523

Age (days ago)

7531

Last active (days ago)

wikitech-l@lists.wikimedia.org

14 comments

11 participants

tags (0)

participants (11)

Alfio Puglisi
Arvind Narayanan
Christopher Mahan
Delirium
delirium＠hackish.org
erik_moeller＠gmx.de
Jens Frank
Medéric Boquien
Nicolas Weeger
Stephan Walter
Tim Starling