On Tue, Apr 21, 2009 at 21:15, Daniel Kinzler daniel@brightbyte.de wrote:
More or less - the parser parses the text, and hands the bit that is RDF (turtle) to the RDF-Extension for analysis. It analyzes the statements and would save it to the database (this is not yet implemented).
There is a preprocessor that expands all templates recursively. After that, the real "parser" (read: munger) is invoked to turn wiki text into HTML.
In the case of a "semantified" infobox, the substitution process would generate RDF/Turtle statements using the template parameters. These would in turn be handed to the RDF extension, which would write the resulting triples to the database.
Thanks! My picture of the process is becoming clearer... :-)
To reiterate: Template definitions would be extended to generate not just Wikitext aimed at the HTML generator, but also stuff that is processed by the RDF generator but ignored by the HTML generator (or at least by the browser).
Maybe it would sometimes be better for the RDF generator to have access to the unexpanded templates?
Property values contain all kinds of stuff, and DBpedia experience shows that one often needs specialized parsers to extract only the desired info. One way to distinguish between desired and undesired info is to have some metadata about the targets of wikilinks. The RDF extension would have to be quite sophisticated...
How are updates distributed? Do subscribers regularly poll the server for recent changes? Or is there some kind of store-and-forward / publish-subscribe?
There is the RSS/Atom feed (human readable, not easy to parse), and an OAI-PMH interface ("life update feed"). There's also the web API for polling data in a machine readable form, and there's the RC ("recent changes") channel on IRC (human readable, can't be parsed reliably). True XMPP based pubsub is being worked on, see http://brightbyte.de/page/RecentChanges_via_Jabber.
Looks great! But if I understand this correctly, tools that need the whole article text would still have to pull it from wikipedia servers. Pushing the whole text might improve scalability. A hierarchical structure (like a content distribution network) would be cool...
An RDF extension could simply be one of these 'text updates' consumers. Performance-wise, that would duplicate some of the effort of expanding the templates etc., but distribution and separation would come "for free".
Christopher