Experimental: Live template value search

List overview All Threads
Download

newer

older

toolserver meta-database updated...

my personal email address is not a...

Magnus Manske

21 Apr 2009 21 Apr '09

12:59 a.m.

I have set up a highly experimental (read: pre-pre-alpha) tool on the toolserver to search for combinations of key/value pairs in templates. Like TemplateTiger, but for the live site (one day...)

URL: http://toolserver.org/~magnus/knights_template.php (as in "Knights Templar", sorry, couldn't resist)

What does it do (or should do): Find actors on Wikipedia that have a book by publisher McFarland cited as source? No problem: http://tinyurl.com/chadva (will take a few seconds)

All templates used on [[Roy Scheider]], with key/value pairs: http://tinyurl.com/dkpbh7

Things that need fixing (like, everything): * Slooow * Only en.wikipedia * Only 16K pages indexed at the moment (some Recent Changes snapshots) * Only templates used directly in the article are indexed * Needs continuous update (can run on cron, but disabled because text retrieval and DB storage too slow to keep up with RC) * Needs better query interface * Might cause database size problems

All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Cheers, Magnus

Show replies by date

James Hare

21 Apr 21 Apr

1:47 a.m.

I see "Live", and then I see "Search", and then I think this has to do with Microsoft...

On Mon, Apr 20, 2009 at 6:59 PM, Magnus Manske magnusmanske@googlemail.comwrote:

...

I have set up a highly experimental (read: pre-pre-alpha) tool on the toolserver to search for combinations of key/value pairs in templates. Like TemplateTiger, but for the live site (one day...)

URL: http://toolserver.org/~magnus/knights_template.php http://toolserver.org/%7Emagnus/knights_template.php (as in "Knights Templar", sorry, couldn't resist)

What does it do (or should do): Find actors on Wikipedia that have a book by publisher McFarland cited as source? No problem: http://tinyurl.com/chadva (will take a few seconds)

All templates used on [[Roy Scheider]], with key/value pairs: http://tinyurl.com/dkpbh7

Things that need fixing (like, everything):

Slooow

Only en.wikipedia

Only 16K pages indexed at the moment (some Recent Changes snapshots)

Only templates used directly in the article are indexed

Needs continuous update (can run on cron, but disabled because text

retrieval and DB storage too slow to keep up with RC)

Needs better query interface

Might cause database size problems

All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Cheers, Magnus

Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l

Daniel Kinzler

10:25 a.m.

Magnus Manske schrieb:

...

All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Actually, this is part of my grand plan for world domination. I'm pushing for it behind the scenes... I have a few ideas on how it may be done nicely.

I think the main problem is that semantic mediawiki looks like the obvious answer. But i doubt it is. I only want a small subset of that functionality on wikipedia. Maybe SMW can be chopped up to fit that, but i'm personally more inclined to extend the RDF extension to store triples in the DB.

-- daniel

Magnus Manske

11:07 a.m.

On Tue, Apr 21, 2009 at 9:25 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Magnus Manske schrieb:

...
All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Actually, this is part of my grand plan for world domination. I'm pushing for it behind the scenes... I have a few ideas on how it may be done nicely.

Excellent! I'll hold further development on the tool for now.

...

I think the main problem is that semantic mediawiki looks like the obvious answer. But i doubt it is. I only want a small subset of that functionality on wikipedia. Maybe SMW can be chopped up to fit that, but i'm personally more inclined to extend the RDF extension to store triples in the DB.

I agree about Semantic MediaWiki, which is a different beast (and might one day be used on Wikipedia).

The question seems to be scalability.Extrapolating from my sample data set, just the key/value pairs of templates directly included in articles would come to over 200 million rows for en.wikipedia at the moment. A MediaWiki-internal solution would want to store templates included in templates as well, which can be a lot for complicated meta-templates. I think a billion rows for the current English Wikipedia is not too far-fetched in that model. The table would be both constantly updated (potentially hundeds of writes for a single article update) and heavily searched (with LIKE "%stuff%", no less).

Would the RDF extension be up to that?

Cheers, Magnus

Daniel Kinzler

11:13 a.m.

Magnus Manske schrieb:

...

On Tue, Apr 21, 2009 at 9:25 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...
Magnus Manske schrieb:

...
All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Actually, this is part of my grand plan for world domination. I'm pushing for it behind the scenes... I have a few ideas on how it may be done nicely.

Excellent! I'll hold further development on the tool for now.

Well, keep playing if you, it'll be a while until it goes live. Just don't invest all that much into an interim solution.

...

...
I think the main problem is that semantic mediawiki looks like the obvious answer. But i doubt it is. I only want a small subset of that functionality on wikipedia. Maybe SMW can be chopped up to fit that, but i'm personally more inclined to extend the RDF extension to store triples in the DB.

I agree about Semantic MediaWiki, which is a different beast (and might one day be used on Wikipedia).

That's really the question. Should we work *now* on making it usable for wikipedia, or should we focus on something simpler?

...

The question seems to be scalability.Extrapolating from my sample data set, just the key/value pairs of templates directly included in articles would come to over 200 million rows for en.wikipedia at the moment. A MediaWiki-internal solution would want to store templates included in templates as well, which can be a lot for complicated meta-templates. I think a billion rows for the current English Wikipedia is not too far-fetched in that model. The table would be both constantly updated (potentially hundeds of writes for a single article update) and heavily searched (with LIKE "%stuff%", no less).

Would the RDF extension be up to that?

It would in a way: it just wouldn't store all parameters. It would store only things explicitly defined to be RDF values. That would greatly reduce the number of parameters to store, since all the templates used maintenance, formatting, styling and navigation can be omitted. It would be used nearly exclusively for infobox-type templates, image meta-info, and cross-links like the PND template. Or at least, that'S the idea. It also does away with problems caused by the various names a parameters with the same meaning may have in different templates (and different wikis).

-- daniel

Magnus Manske

11:48 a.m.

On Tue, Apr 21, 2009 at 10:13 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

Magnus Manske schrieb:

...
I agree about Semantic MediaWiki, which is a different beast (and might one day be used on Wikipedia).

That's really the question. Should we work *now* on making it usable for wikipedia, or should we focus on something simpler?

IMHO we should try to harvest the data that is already in Wikipedia first. Semantic Wikipedia, technical issues aside, relies heavily on users learning a new syntax, which is a community (read: political;-) decision. And it will be fought about much harder and longer than the license question...

...

...
The question seems to be scalability.Extrapolating from my sample data set, just the key/value pairs of templates directly included in articles would come to over 200 million rows for en.wikipedia at the moment. A MediaWiki-internal solution would want to store templates included in templates as well, which can be a lot for complicated meta-templates. I think a billion rows for the current English Wikipedia is not too far-fetched in that model. The table would be both constantly updated (potentially hundeds of writes for a single article update) and heavily searched (with LIKE "%stuff%", no less).

Would the RDF extension be up to that?

It would in a way: it just wouldn't store all parameters. It would store only things explicitly defined to be RDF values. That would greatly reduce the number of parameters to store, since all the templates used maintenance, formatting, styling and navigation can be omitted. It would be used nearly exclusively for infobox-type templates, image meta-info, and cross-links like the PND template. Or at least, that'S the idea. It also does away with problems caused by the various names a parameters with the same meaning may have in different templates (and different wikis).

Nice! I was thinking along the lines of a template whitelist/blacklist, but yours would be much more efficient. And it would hide most of the technical "ugliness" in the templates.

Daniel Kinzler

11:51 a.m.

Magnus Manske schrieb:

...

On Tue, Apr 21, 2009 at 10:13 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...
Magnus Manske schrieb:

...
I agree about Semantic MediaWiki, which is a different beast (and might one day be used on Wikipedia).

That's really the question. Should we work *now* on making it usable for wikipedia, or should we focus on something simpler?

IMHO we should try to harvest the data that is already in Wikipedia first. Semantic Wikipedia, technical issues aside, relies heavily on users learning a new syntax, which is a community (read: political;-) decision. And it will be fought about much harder and longer than the license question...

Well, semantic links would be used in template3 definitions, just like the RDF extension would be used in template definitions. By that, we can make use of all the data in template parameters directly. This applies for both.

-- daniel

Christopher Sahnwaldt

5:51 p.m.

...

IMHO we should try to harvest the data that is already in Wikipedia first.

That's what DBpedia does. Currently, we concentrate on infoboxes. We extract data from infobox templates and store the data as RDF. Queries against the data are usually done with SPARQL, e.g. at http://dbpedia.org/sparql

...

...
It also does away with problems caused by the various names a parameters with the same meaning may have in different templates (and different wikis).

DBpedia also struggled with these inconsistencies. The best solution seems to be a hand-made mapping between template properties and RDF properties. See

http://blog.georgikobilarov.com/2008/10/dbpedia-rethinking-wikipedia-infobox...

In the future, we would like publish the mapping and let it grow in a wiki-style.

A few numbers: from the infoboxes, we extracted 7 million RDF triples with the 'hand-made' approach and 32 million triples with a generic approach.

Bye, Christopher

PS: Here's a toy example: kid actors (younger than 18) in Spielberg movies:

select ?film, ?release, ?actor, ?birth where { ?film http://dbpedia.org/ontology/director http://dbpedia.org/resource/Steven_Spielberg . ?film http://dbpedia.org/ontology/starring ?actor . ?actor http://dbpedia.org/ontology/birthdate ?birth . ?film http://dbpedia.org/ontology/releaseDate ?release . FILTER (bif:datediff('year', ?birth, ?release) < 18) }

Christopher Sahnwaldt

7:15 p.m.

On Tue, Apr 21, 2009 at 10:25, Daniel Kinzler daniel@brightbyte.de wrote:

...

Magnus Manske schrieb:

...
All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Actually, this is part of my grand plan for world domination. I'm pushing for it behind the scenes... I have a few ideas on how it may be done nicely.

I think the main problem is that semantic mediawiki looks like the obvious answer. But i doubt it is. I only want a small subset of that functionality on wikipedia. Maybe SMW can be chopped up to fit that, but i'm personally more inclined to extend the RDF extension to store triples in the DB.

I'm pretty new to MediaWiki and I'm not sure if I understand this correctly... Here's my attempt at spelling it out in a bit more detail:

When a user edits a page and sends the new text to the server, the server / the RDF extension parses the text, extracts the desired data and saves it in a RDF store.

I hope I got that about right - please correct me if not!

Now when I think about the pros and cons of having this process run integrated in MediaWiki or on a different server, a few questions come up... again, I'm new to MediaWiki, so these may be newbie questions... :-)

How much parsing does MediaWiki currently do when it stores new text for an article? Are templates expanded / transcluded?

How are updates distributed? Do subscribers regularly poll the server for recent changes? Or is there some kind of store-and-forward / publish-subscribe?

Bye, Christopher

Magnus Manske

7:29 p.m.

On Tue, Apr 21, 2009 at 6:15 PM, Christopher Sahnwaldt jcsahnwaldt@gmail.com wrote:

...

On Tue, Apr 21, 2009 at 10:25, Daniel Kinzler daniel@brightbyte.de wrote:

...
Magnus Manske schrieb:

...
All in all, it would be much better directly integrated into MediaWiki (no need for text retrieval/parsing, no bulk updates). But I've been saying that for years, at least this is a first attempt.

Actually, this is part of my grand plan for world domination. I'm pushing for it behind the scenes... I have a few ideas on how it may be done nicely.

I think the main problem is that semantic mediawiki looks like the obvious answer. But i doubt it is. I only want a small subset of that functionality on wikipedia. Maybe SMW can be chopped up to fit that, but i'm personally more inclined to extend the RDF extension to store triples in the DB.

I'm pretty new to MediaWiki and I'm not sure if I understand this correctly... Here's my attempt at spelling it out in a bit more detail:

When a user edits a page and sends the new text to the server, the server / the RDF extension parses the text, extracts the desired data and saves it in a RDF store.

I hope I got that about right - please correct me if not!

Now when I think about the pros and cons of having this process run integrated in MediaWiki or on a different server, a few questions come up... again, I'm new to MediaWiki, so these may be newbie questions... :-)

How much parsing does MediaWiki currently do when it stores new text for an article? Are templates expanded / transcluded?

Yes. It is parsed to generate HTML, which is then cached, and some metadata like categories, which are already stored in the respective tables.

...

How are updates distributed? Do subscribers regularly poll the server for recent changes? Or is there some kind of store-and-forward / publish-subscribe?

There is a RSS/ATOM feed, but that's it in terms of "pushing". There's an extensive API, though:

http://en.wikipedia.org/w/api.php

Cheers, Magnus

P.S.: DBpedia does great work, it would just be so much better to have it all live and in one place...

Daniel Kinzler

9:15 p.m.

Christopher Sahnwaldt schrieb:

...

I'm pretty new to MediaWiki and I'm not sure if I understand this correctly... Here's my attempt at spelling it out in a bit more detail:

When a user edits a page and sends the new text to the server, the server / the RDF extension parses the text, extracts the desired data and saves it in a RDF store.

I hope I got that about right - please correct me if not!

More or less - the parser parses the text, and hands the bit that is RDF (turtle) to the RDF-Extension for analysis. It analyzes the statements and would save it to the database (this is not yet implemented).

...

Now when I think about the pros and cons of having this process run integrated in MediaWiki or on a different server, a few questions come up... again, I'm new to MediaWiki, so these may be newbie questions... :-)

How much parsing does MediaWiki currently do when it stores new text for an article? Are templates expanded / transcluded?

There is a preprocessor that expands all templates recursively. After that, the real "parser" (read: munger) is invoked to turn wiki text into HTML.

In the case of a "semantified" infobox, the substitution process would generate RDF/Turtle statements using the template parameters. These would in turn be handed to the RDF extension, which would write the resulting triples to the database.

...

How are updates distributed? Do subscribers regularly poll the server for recent changes? Or is there some kind of store-and-forward / publish-subscribe?

There is the RSS/Atom feed (human readable, not easy to parse), and an OAI-PMH interface ("life update feed"). There's also the web API for polling data in a machine readable form, and there's the RC ("recent changes") channel on IRC (human readable, can't be parsed reliably). True XMPP based pubsub is being worked on, see http://brightbyte.de/page/RecentChanges_via_Jabber.

-- daniel

Christopher Sahnwaldt

22 Apr 22 Apr

7:32 p.m.

On Tue, Apr 21, 2009 at 21:15, Daniel Kinzler daniel@brightbyte.de wrote:

...

More or less - the parser parses the text, and hands the bit that is RDF (turtle) to the RDF-Extension for analysis. It analyzes the statements and would save it to the database (this is not yet implemented).

There is a preprocessor that expands all templates recursively. After that, the real "parser" (read: munger) is invoked to turn wiki text into HTML.

In the case of a "semantified" infobox, the substitution process would generate RDF/Turtle statements using the template parameters. These would in turn be handed to the RDF extension, which would write the resulting triples to the database.

Thanks! My picture of the process is becoming clearer... :-)

To reiterate: Template definitions would be extended to generate not just Wikitext aimed at the HTML generator, but also stuff that is processed by the RDF generator but ignored by the HTML generator (or at least by the browser).

Maybe it would sometimes be better for the RDF generator to have access to the unexpanded templates?

Property values contain all kinds of stuff, and DBpedia experience shows that one often needs specialized parsers to extract only the desired info. One way to distinguish between desired and undesired info is to have some metadata about the targets of wikilinks. The RDF extension would have to be quite sophisticated...

...

...
How are updates distributed? Do subscribers regularly poll the server for recent changes? Or is there some kind of store-and-forward / publish-subscribe?

There is the RSS/Atom feed (human readable, not easy to parse), and an OAI-PMH interface ("life update feed"). There's also the web API for polling data in a machine readable form, and there's the RC ("recent changes") channel on IRC (human readable, can't be parsed reliably). True XMPP based pubsub is being worked on, see http://brightbyte.de/page/RecentChanges_via_Jabber.

Looks great! But if I understand this correctly, tools that need the whole article text would still have to pull it from wikipedia servers. Pushing the whole text might improve scalability. A hierarchical structure (like a content distribution network) would be cool...

An RDF extension could simply be one of these 'text updates' consumers. Performance-wise, that would duplicate some of the effort of expanding the templates etc., but distribution and separation would come "for free".

Christopher

Stefan Kühn

21 Apr 21 Apr

9:37 p.m.

Hi all,

Magnus wrote: "...Like TemplateTiger..." Thanks for this honor. :-)

If someone don't know this project, see here: http://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Vorlagenauswertung/en

In the next time we want support more languages. The same languages as in http://de.wikipedia.org/wiki/Benutzer:Stefan_K%C3%BChn/Check_Wikipedia The data for TemplateTiger will extracted from the monthly dumps with the script from CheckWikipedia.

If you want download and play with the data of TemplateTiger show here: http://toolserver.org/~sk/

Best regards, Stefan

5732

Age (days ago)

5734

Last active (days ago)

toolserver-l@lists.wikimedia.org

12 comments

5 participants

tags (0)

participants (5)

Christopher Sahnwaldt
Daniel Kinzler
James Hare
Magnus Manske
Stefan Kühn