On Thursday 11 January 2007 18:31, Aerik Sylvan wrote:
Markus Kr?tzsch wrote:
I did not follow this discussion, but it seems appropriate to point to the
Semantic MediaWiki extension, which computes
"implied categories"
like "American actors" on request (it can combine unions, intersections,
namespace membership, and further "semantic properties").
We will assist in any effort towards developing an efficient way of doing
this, since our current implementation is probably not fast enough for
large
wikis.
Hi Markus. The semantic mediawiki extension is very cool, but I think the
main issue at this point is exactly what you said in your second paragraph:
An efficient way to do the data retrieval portion of this stuff
(specifically, for me, category intersections). There are a few very neat
extensions (semantic mediawiki, DPL, a home brewed category intersections
special page I did for media wiki 1.4x) but they are not fast enough for a
large wiki. This is the problem I'm trying to solve (for category
intersections, anyway), and then we can hash out interfaces etc. I've got
a test script using a MySQL fulltext index that may be good enough, and if
it isn't, I'll do one using Lucene (the php version).
Maybe though, it's appropriate to talk about what features category
intersections and semantic mediawiki share, and see if we can't find (data
retrieval) solutions for both. I'm not familar with the backend of
semantic mediawiki at all, so I can't comment on that.
SMW's backend trivially extends the DB layout to add some tables for storing
all semantic information. This is fast enough for small wikis, and only for
those. Querying generates a lot of joins among a few large tables, quite
similar to the situation with category intersections.
SMW differs from category intersection problem in that it also considers other
properties (in addition to "is element of category"). In general, it stores
data of the form
A has_property B with value C
i.e. triples. Out current storage model is a so called "single table
approach": have (essentially) one large table with (essentially) three
columns A, B, and C. Another approach is to have one table for each B, with
two columns A and C. This generates smaller tables but you can get large
numbers of tables. There are hybrid approaches that are better. But it seems
that smart caching strategies, and a not-quite-realtime computation could be
more robust solutions to achieve practical scalability.
Considering text-indexing for category intersection, I do not see how this
could be used for SMW, since the property (B) is implicit. A typical
SMW-search would be: give me all As with a property B1 with unknown value X
that has a property B2 with value C, i.e. search for the pattern "A -B1->
* -B2-> C" (example: give me all cities which have a mayor that is a member
of the democratic party). Not an easy task.
Anyway, if you have results for category intersections we would be interested
to hear about them. We can also provide our not-too-slow Wikipedia
test-server for large scale experiments.
Regards,
Markus
Best Regards,
Aerik
Best Regards,
Aerik
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
--
Markus Krötzsch
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
mak(a)aifb.uni-karlsruhe.de phone +49 (0)721 608 7362
www.aifb.uni-karlsruhe.de/WBS/ fax +49 (0)721 693 717