On 11/18/05, Max Voelkel max@xam.de wrote:
Dear Wikipedia-Wizards, we are a group of four researchers building an extension for Wikipedia, called the "Semantic Wikipedia", which is technically a MediaWiki extension. As a short summary, it allows users to type links, which yields to the creation of semantic metadata (page name, link type, link target). In a similar fashion we allow for the annotation of attributes. If this project will be deployed on Wikipedia, a huge amount of machine-processable data could be generated. We will provide an RDF export per page and a SPARQL-query endpoint for the whole Semantic Wikipedia (SPARQL is like SQL, but more adapted to the data model of RDF, a building block of the semantic web).
Currently, we have two problems and would be glad if you help us:
The tool stack in the semantic web community is mainly built on Java. For C, there is only on "triple store" (which is needed for efficient RDF storage & querying). The only candiadate we have, "3store" is not very mature - but many Java stores are. Especially the open-source system "Sesame" (openrdf.org) would be our choice for implementation. But, as far as I understand Wikipedia, Java is not open source enough, as there is no open source implementation of Java itself? Is this true or just a rumor?
Look at [[GCJ]] on enwiki. There are free implementations but they are not good enough for all applications.
Perhaps I'm hopeless out of touch with the times, but why can't this information be stored in a normal SQL database. I'm sure a parser could be written to rewrite SPARQL queries into efficiently executing SQL queries.
On my on and off again analysis copy of Wikipedia I use the hstore module in PGsql to store name=value pairs for every revision. If you were only talking about storing metadata for the current versions articles it would likely be quite efficent to carry a relation which contains source,type,dest tuples for each of your links. Even MySQL could do a pretty good job for this.
Syntax. We had to extend the syntax slightly to enable annotations of links and data values. Currently we settled down to use
[[link type::link target|optional alternate label]] Sample, on page "London": ... is in [[located in::England]] ... Renders as: ... is in England .... (England = Linked)
for relations, and for attributes.
[[attribute type:=data value with unit|optional alternate label]] Sample, on page "London": ... rains on [[rain:=234 days/year]] .... Renders as .... rains on 234 days/year (nothing linked)
For a full explanation of whay and what we try to do, you can also have a look at a paper, which we wrote for a conference: http://www.aifb.uni-karlsruhe.de/Publikationen/showPublikation_english?publ_...
There would appear to be some potential conversion with namespaces. It might be advisable to use a character which could currently not appear in an internal link. It might also be advisable to not make them look like internal links. We've already overloaded [[]] quite a bit with things which are not quite the same as internal links conceptually, or syntactically (categories, images).
Can semantic links carry an additional attribute beyond their type, i.e. on [[Bill Clinton]] [[us presidential succession::George W. Bush:=next]] or would it have to be [[next us president::George W. Bush]]
If it's just the latter case, then all it is is a typed directed graph and there are a wealth of fast algorithms available for searching and storing the relationships.