Re: [Wikitech-l] Category Intersections: "Proof of Concept

14 Jan 2007


      On Thursday 11 January 2007 18:31, Aerik Sylvan wrote:
...
Markus Kr?tzsch wrote:
I did not follow this discussion, but it seems appropriate to point to the
...
Semantic MediaWiki extension, which computes "implied categories"
like "American actors" on request (it can combine unions, intersections,
namespace membership, and further "semantic properties").
We will assist in any effort towards developing an efficient way of doing
this, since our current implementation is probably not fast enough for
large
wikis.
Hi Markus.  The semantic mediawiki extension is very cool, but I think the
main issue at this point is exactly what you said in your second paragraph:
An efficient way to do the data retrieval portion of this stuff
(specifically, for me, category intersections).  There are a few very neat
extensions (semantic mediawiki, DPL, a home brewed category intersections
special page I did for media wiki 1.4x) but they are not fast enough for a
large wiki.  This is the problem I'm trying to solve (for category
intersections, anyway), and then we can hash out interfaces etc.  I've got
a test script using a MySQL fulltext index that may be good enough, and if
it isn't, I'll do one using Lucene (the php version).
Maybe though, it's appropriate to talk about what features category
intersections and semantic mediawiki share, and see if we can't find (data
retrieval) solutions for both.  I'm not familar with the backend of
semantic mediawiki at all, so I can't comment on that.
SMW's backend trivially extends the DB layout to add some tables for storing 
all semantic information. This is fast enough for small wikis, and only for 
those. Querying generates a lot of joins among a few large tables, quite 
similar to the situation with category intersections.
SMW differs from category intersection problem in that it also considers other 
properties (in addition to "is element of category"). In general, it stores 
data of the form
A has_property B with value C
i.e. triples. Out current storage model is a so called "single table 
approach": have (essentially) one large table with (essentially) three 
columns A, B, and C. Another approach is to have one table for each B, with 
two columns A and C. This generates smaller tables but you can get large 
numbers of tables. There are hybrid approaches that are better. But it seems 
that smart caching strategies, and a not-quite-realtime computation could be 
more robust solutions to achieve practical scalability.
Considering text-indexing for category intersection, I do not see how this 
could be used for SMW, since the property (B) is implicit. A typical 
SMW-search would be: give me all As with a property B1 with unknown value X 
that has a property B2 with value C, i.e. search for the pattern "A -B1-> 
* -B2-> C" (example: give me all cities which have a mayor that is a member 
of the democratic party). Not an easy task.
Anyway, if you have results for category intersections we would be interested 
to hear about them. We can also provide our not-too-slow Wikipedia 
test-server for large scale experiments.
Regards,
Markus
...
Best Regards,
Aerik
Best Regards,
Aerik
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
http://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- 
Markus Krötzsch
Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe
mak@aifb.uni-karlsruhe.de        phone +49 (0)721 608 7362
www.aifb.uni-karlsruhe.de/WBS/     fax +49 (0)721 693  717

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Category Intersections: "Proof of Concept