On Tue, Apr 21, 2009 at 21:15, Daniel Kinzler <daniel(a)brightbyte.de> wrote:
More or less - the parser parses the text, and hands
the bit that is RDF
(turtle) to the RDF-Extension for analysis. It analyzes the statements and would
save it to the database (this is not yet implemented).
There is a preprocessor that expands all templates recursively. After that, the
real "parser" (read: munger) is invoked to turn wiki text into HTML.
In the case of a "semantified" infobox, the substitution process would
generate
RDF/Turtle statements using the template parameters. These would in turn be
handed to the RDF extension, which would write the resulting triples to the
database.
Thanks! My picture of the process is becoming clearer... :-)
To reiterate: Template definitions would be extended to generate
not just Wikitext aimed at the HTML generator, but also stuff that
is processed by the RDF generator but ignored by the HTML
generator (or at least by the browser).
Maybe it would sometimes be better for the RDF generator to have
access to the unexpanded templates?
Property values contain all kinds of stuff, and DBpedia experience
shows that one often needs specialized parsers to extract only
the desired info. One way to distinguish between desired and
undesired info is to have some metadata about the targets of
wikilinks. The RDF extension would have to be quite sophisticated...
How are updates distributed? Do subscribers
regularly poll
the server for recent changes? Or is there some kind of
store-and-forward / publish-subscribe?
There is the RSS/Atom feed (human readable, not easy to parse), and an OAI-PMH
interface ("life update feed"). There's also the web API for polling data
in a
machine readable form, and there's the RC ("recent changes") channel on
IRC
(human readable, can't be parsed reliably). True XMPP based pubsub is being
worked on, see <http://brightbyte.de/page/RecentChanges_via_Jabber>.
Looks great! But if I understand this correctly, tools that need the whole
article text would still have to pull it from wikipedia servers. Pushing the
whole text might improve scalability. A hierarchical structure (like a
content distribution network) would be cool...
An RDF extension could simply be one of these 'text updates' consumers.
Performance-wise, that would duplicate some of the effort of expanding the
templates etc., but distribution and separation would come "for free".
Christopher