Development of a Metadata/Cross-Indexing System

List overview All Threads
Download

newer

older

MediaWiki automated test run...

First upload of quantitative...

Henry Skelton

21 Jan 2007 21 Jan '07

12:47 p.m.

I'm interested in helping out in the development of mediawiki. I think it's very good, but one that seems missing is a good system for metadata and cross indexing information.

That might not be the best way to put it. As I see it, an article would include more than just the text of the article. It would have a summary (possible present on the page), and also metadata (like the classification for an organism). e.g., on an article for a family of turtles, say Chelydridae, you might want to list all of the genuses. Instead of searching and hoping you have them all, then copy/pasting or making your own summaries, you could just do something like

[[wikipedia searchForArticlesWithTags:genus,family=Testudinidae] listData:scientificname,commonname,conservationstatus]

This would then automatically get all of those genuses, with the information specified, and would update if anything changed or new ones were added. You could also automatically grab summaries, which are quite common [someArticle getSummary].

Something like this could also keep data in sync. I've seen several instances of conflicting information between an article, and a small summary in another article.

Is there something like this in progress? I'd like to help, but I'd like to avoid duplicating effort on something someone is already working on. I know how to program, and know a small amount of PHP.

Show replies by date

Simetrical

21 Jan 21 Jan

1 p.m.

On 1/21/07, Henry Skelton dimensiondude.oss@gmail.com wrote:

...

I'm interested in helping out in the development of mediawiki. I think it's very good, but one that seems missing is a good system for metadata and cross indexing information.

That might not be the best way to put it. As I see it, an article would include more than just the text of the article. It would have a summary (possible present on the page), and also metadata (like the classification for an organism). e.g., on an article for a family of turtles, say Chelydridae, you might want to list all of the genuses. Instead of searching and hoping you have them all, then copy/pasting or making your own summaries, you could just do something like

[[wikipedia searchForArticlesWithTags:genus,family=Testudinidae] listData:scientificname,commonname,conservationstatus]

This would then automatically get all of those genuses, with the information specified, and would update if anything changed or new ones were added. You could also automatically grab summaries, which are quite common [someArticle getSummary].

Something like this could also keep data in sync. I've seen several instances of conflicting information between an article, and a small summary in another article.

Is there something like this in progress? I'd like to help, but I'd like to avoid duplicating effort on something someone is already working on. I know how to program, and know a small amount of PHP.

The major issue here tends to be optimization. Semantic MediaWiki more or less has what you're looking for, but as I understand it, it scales far too poorly to be used for a site like Wikipedia. Category intersection is a more specific much less ambitious thing to do, and there appears to be some progress in that direction right now, although it's not necessarily ready for prime time.

But for the overall idea of what you posted, well, that strikes me as basically like allowing anyone to run arbitrary SELECT queries. You just can't do that with a database Wikipedia's size.

Magnus Manske

22 Jan 22 Jan

12:32 a.m.

On 1/21/07, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 1/21/07, Henry Skelton dimensiondude.oss@gmail.com wrote:

...
I'm interested in helping out in the development of mediawiki. I think it's very good, but one that seems missing is a good system for metadata and cross indexing information.

That might not be the best way to put it. As I see it, an article would include more than just the text of the article. It would have a summary (possible present on the page), and also metadata (like the classification for an organism). e.g., on an article for a family of turtles, say Chelydridae, you might want to list all of the genuses. Instead of searching and hoping you have them all, then copy/pasting or making your own summaries, you could just do something like

[[wikipedia searchForArticlesWithTags:genus,family=Testudinidae] listData:scientificname,commonname,conservationstatus]

This would then automatically get all of those genuses, with the information specified, and would update if anything changed or new ones were added. You could also automatically grab summaries, which are quite common [someArticle getSummary].

Something like this could also keep data in sync. I've seen several instances of conflicting information between an article, and a small summary in another article.

Is there something like this in progress? I'd like to help, but I'd like to avoid duplicating effort on something someone is already working on. I know how to program, and know a small amount of PHP.

The major issue here tends to be optimization. Semantic MediaWiki more or less has what you're looking for, but as I understand it, it scales far too poorly to be used for a site like Wikipedia. Category intersection is a more specific much less ambitious thing to do, and there appears to be some progress in that direction right now, although it's not necessarily ready for prime time.

But for the overall idea of what you posted, well, that strikes me as basically like allowing anyone to run arbitrary SELECT queries. You just can't do that with a database Wikipedia's size.

I once posted the idea (which, of course, was ignored;-) to store the names and values of variables passed to templates from articles in a SQL table. If you write {{xyz|a=1|b=2}} in article BLA and save, it would store BLA | xyz | a | 1 BLA | xyz | b | 2 in said table. Applied to {{Persondata}} [1], you could search for a specific birth date, or for "%January%1980%" to find people born in January 1980, which you can not do with the current category system, even with intersections, AFAIK.

Given the amount of data we put in navboxes via templates, this is a vast repository of unharvested metadata, IMHO.

Magnus

[1] http://en.wikipedia.org/wiki/Persondata

Rob Church

5:38 a.m.

On 22/01/07, Magnus Manske magnusmanske@googlemail.com wrote:

...

I once posted the idea (which, of course, was ignored;-) to store the names and values of variables passed to templates from articles in a SQL table. If you write {{xyz|a=1|b=2}} in article BLA and save, it

I had something along the same lines checked into an experimental branch; the user would insert metadata tags, and these would be dealt with in a similar manner to link updates:

[[Metadata:People|birth=1980]] etc.

It works on a "subject", "name" and "value" triplet concept - subjects group similar pieces of data, the name and value are self-evident. This was adapted from an idea Zocky had on IRC, which I hastily implemented.

It's in the repo, but incomplete - likely missing the upgraders and table definitions, and there's no interface for directly manipulating data outside of pages, nor any way of querying it at present.

Rob Church

Magnus Manske

7:30 a.m.

On 1/22/07, Rob Church robchur@gmail.com wrote:

...

On 22/01/07, Magnus Manske magnusmanske@googlemail.com wrote:

...
I once posted the idea (which, of course, was ignored;-) to store the names and values of variables passed to templates from articles in a SQL table. If you write {{xyz|a=1|b=2}} in article BLA and save, it

I had something along the same lines checked into an experimental branch; the user would insert metadata tags, and these would be dealt with in a similar manner to link updates:

[[Metadata:People|birth=1980]] etc.

It works on a "subject", "name" and "value" triplet concept - subjects group similar pieces of data, the name and value are self-evident. This was adapted from an idea Zocky had on IRC, which I hastily implemented.

It's in the repo, but incomplete - likely missing the upgraders and table definitions, and there's no interface for directly manipulating data outside of pages, nor any way of querying it at present.

Yes, but that would require users to go our X million articles and retag them. Granted, voluntary user slave labor is cheap ;-) but I wanted to use the already existing data from template inclusions.

It could even be limited to "first-level inclusion" - only templates that are included directly from the page. Templates in templates would only have variants of the originally passed variables, so that would only waste DB space. Also, templates without variables can be ignored - these can be checked through templatelinks just as well.

Magnus

Domas Mituzas

7:32 a.m.

Magnus,

...

Yes, but that would require users to go our X million articles and retag them. Granted, voluntary user slave labor is cheap ;-) but I wanted to use the already existing data from template inclusions.

The concept (and code) for parsing the template information was written back at Frankfurt's Wikimania. The problem - how to efficiently index and query that afterwards.

-- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

Rob Church

7:38 a.m.

On 22/01/07, Domas Mituzas midom.lists@gmail.com wrote:

...

The concept (and code) for parsing the template information was written back at Frankfurt's Wikimania. The problem - how to efficiently index and query that afterwards.

I think we can only really answer that question if we actually plan the thing properly, and to do that, we need to know

a. what we mean by "metadata" b. how we would use it

It's all well and good implementing huge catch-all solutions like Wikidata and Semantic MediaWiki, but they're pretty much sledgehammer to a nut if we haven't identified example use cases for this data. Obviously, what we use it for will lead in to how our users access it, and from there, the all important queries I know Domas loves to optimise to hell and back.

Rob Church

Gerard Meijssen

10:44 a.m.

Rob Church schreef:

...

On 22/01/07, Domas Mituzas midom.lists@gmail.com wrote:

...
The concept (and code) for parsing the template information was written back at Frankfurt's Wikimania. The problem - how to efficiently index and query that afterwards.

I think we can only really answer that question if we actually plan the thing properly, and to do that, we need to know

a. what we mean by "metadata" b. how we would use it

It's all well and good implementing huge catch-all solutions like Wikidata and Semantic MediaWiki, but they're pretty much sledgehammer to a nut if we haven't identified example use cases for this data. Obviously, what we use it for will lead in to how our users access it, and from there, the all important queries I know Domas loves to optimise to hell and back.

Rob Church

Hoi, The functionality of "Wikidata" is being expressed better and better in OmegaWiki. Other example use cases have been identified as long ago as May 2005. Obviously the functionality that is being created is what it will be used for. It is currently feasible to develop the functionality to use OmegaWiki for the tagging of the pictures that are in Commons. And yes, Domas is always welcome to optimise queries.

Thanks, GerardM

http://omegawiki.org http://meta.wikimedia.org/wiki/Using_Ultimate_Wiktionary_for_Commons

Evan Prodromou

1:02 p.m.

On Mon, 2007-22-01 at 15:38 +0000, Rob Church wrote:

...

I think we can only really answer that question if we actually plan the thing properly, and to do that, we need to know

a. what we mean by "metadata" b. how we would use it

It's all well and good implementing huge catch-all solutions like Wikidata and Semantic MediaWiki, but they're pretty much sledgehammer to a nut if we haven't identified example use cases for this data. Obviously, what we use it for will lead in to how our users access it, and from there, the all important queries I know Domas loves to optimise to hell and back.

I'll go ahead and plug the existing RDF extension:

http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/RDF/

It works pretty darn well, plugs into existing MW without a patch, and supports an incremental approach to metadata implementation.

I've had it in production on Wikitravel for more than a year; we use it for applications as diverse as mapping, geographical hierarchy, and article status.

The system works with a custom <rdf> tag, in which you can put Turtle RDF (a very simple RDF syntax). So I can say that Evan lives in Montreal by writing:

<rdf> person:Evan :livesIn place:Montreal . </rdf>

...or something similar. It works fine with the templating system, so you could make a template Template:LivesIn with this content:

<rdf> person:{{{1}}} :livesIn place:{{{2}}} . </rdf>

...and then use it like {{LivesIn|Evan|Montreal}}.

And, yes, I did have to be tricky in order to get template arguments to expand within a custom tag.

-Evan

________________________________________________________________________ Evan Prodromou evan@prodromou.name http://evan.prodromou.name/

Rob Church

1:29 p.m.

On 22/01/07, Evan Prodromou evan@prodromou.name wrote:

...

I'll go ahead and plug the existing RDF extension:

Plug away! I never looked at it before...

Rob Church

Lane, Ryan

3:09 p.m.

...

...or something similar. It works fine with the templating system, so you could make a template Template:LivesIn with this content:

<rdf> person:{{{1}}} :livesIn place:{{{2}}} . </rdf>

...and then use it like {{LivesIn|Evan|Montreal}}.

And, yes, I did have to be tricky in order to get template arguments to expand within a custom tag.

-Evan

I was just trying to figure out how to do this, do you have the process of doing this documented anywhere? I looked through your code for a bit, but there is quite a lot of code there...

Thanks,

Ryan Lane

Magnus Manske

7:48 a.m.

On 1/22/07, Domas Mituzas midom.lists@gmail.com wrote:

...

Magnus,

...
Yes, but that would require users to go our X million articles and retag them. Granted, voluntary user slave labor is cheap ;-) but I wanted to use the already existing data from template inclusions.

The concept (and code) for parsing the template information was written back at Frankfurt's Wikimania. The problem - how to efficiently index and query that afterwards.

Really? Who did that? Wasn't me, right? (creepy thoughts of that disease whose name I can't remember - something starting with A... ;-)

IMHO we can safely limit this to existing article => existing template, so template_variable table could be: tv_article_id int ; // Article ID tv_template_id int ; // Template ID tv_key varchar[255] ; // Alternatively int, refering to a secondary table with unique words tv_value varchar[255] ; // Indexes better than MEDIUMBLOB, and values longer than 250 bytes are too long anyway ;-)

As queries would usually be like WHERE tv_template="123" AND tv_key="year" AND tv_value LIKE "%1984%" tv_article doesn't necessarily have to be indexed.

To save indices, we could also extend tv_key to be "TEMPLATE|KEY" as "|" is not allowed in either article or variable title. We could then not index tv_template as well and go like WHERE tv_key="Persondata|year" AND tv_value LIKE "%1984%"

Either way, the LIKE clause would only be invoked a rather limited number of times, as most of the work would be done by tv_key="something", which is an index.

My 2c.

Magnus

Jim Hu

7:44 a.m.

I'm thinking of starting work on a box content editor that would have an edit link that opens a forms-based editor where the content of the form fields are stored in a parallel mysql database. The idea is not at all fleshed out, but I'm wondering if it fits with this general topic...and if so do people here have thoughts about the idea.

I'm planning to make this work for specific content boxes in a wiki I'm running, where we'd like structured content from users as well as freetext. The basic idea so far is to delimit the boxes of interest with some kind of tag, use that to pull the box in question out of the page text using a regex, and replace it with the updated content generated by a forms page. Has this already been done?

A potential side benefit is to make tables and boxes easier to populate for new users who are intimidated by the wiki markup in those areas, which tends to be a bit more complex than elsewhere on the pages.

Jim Hu

On Jan 22, 2007, at 8:38 AM, Rob Church wrote:

...

On 22/01/07, Magnus Manske magnusmanske@googlemail.com wrote:

...
I once posted the idea (which, of course, was ignored;-) to store the names and values of variables passed to templates from articles in a SQL table. If you write {{xyz|a=1|b=2}} in article BLA and save, it

I had something along the same lines checked into an experimental branch; the user would insert metadata tags, and these would be dealt with in a similar manner to link updates:

[[Metadata:People|birth=1980]] etc.

It works on a "subject", "name" and "value" triplet concept - subjects group similar pieces of data, the name and value are self-evident. This was adapted from an idea Zocky had on IRC, which I hastily implemented.

It's in the repo, but incomplete - likely missing the upgraders and table definitions, and there's no interface for directly manipulating data outside of pages, nor any way of querying it at present.

Rob Church

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

===================================== Jim Hu Associate Professor Dept. of Biochemistry and Biophysics 2128 TAMU Texas A&M Univ. College Station, TX 77843-2128 979-862-4054

Mark Clements

8:01 p.m.

"Magnus Manske" magnusmanske@googlemail.com wrote in message news:fab0ecb70701220032p2a20b5fcw86d18c916da7c5eb@mail.gmail.com...

...

I once posted the idea (which, of course, was ignored;-) to store the names and values of variables passed to templates from articles in a SQL table. If you write {{xyz|a=1|b=2}} in article BLA and save, it would store BLA | xyz | a | 1 BLA | xyz | b | 2 in said table. Applied to {{Persondata}} [1], you could search for a specific birth date, or for "%January%1980%" to find people born in January 1980, which you can not do with the current category system, even with intersections, AFAIK.

Given the amount of data we put in navboxes via templates, this is a vast repository of unharvested metadata, IMHO.

I have been working on a WikiDB extension that would work as a direct substitution for infoboxes (i.e. it would require the page to be edited, but it is the kind of edit that could be done by a bot).

The result would be visually identical, but it additionally adds the data to a named table that can then be queried.

See http://www.kennel17.co.uk/testwiki/WikiDB for details. I would appreciate some feedback!

- Mark Clements (HappyDog)

6516

Age (days ago)

6518

Last active (days ago)

wikitech-l@lists.wikimedia.org

13 comments

10 participants

tags (0)

participants (10)

Domas Mituzas
Evan Prodromou
Gerard Meijssen
Henry Skelton
Jim Hu
Lane, Ryan
Magnus Manske
Mark Clements
Rob Church
Simetrical