Datamining infoboxes

List overview All Threads
Download

newer

older

The State of Trunk

Re: [Wikitech-l] FW: [FOSDEM] News...

Andrew Dunbar

23 Oct 2009 23 Oct '09

4:21 a.m.

Infoboxes in Wikipedia often contain information which is quite useful outside Wikipedia but can be surprisingly difficult to data-mine.

I would like to find all Wikipedia pages that use Template:Infobox_Language and parse the parameters iso3 and fam1...fam15

But my attempts to find such pages using either the Toolserver's Wikipedia database or the Mediawiki API have not been fruitful. In particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title?

Andrew Dunbar (hippietrail)

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Show replies by date

Daniel Schwen

23 Oct 23 Oct

5:33 a.m.

...

particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title?

How are you planning to get the template parameters? Have I missed a recent schema change? I'd be interested in following your progress. I'm not extracting infobox data, but parameters of the coordinate template. Maybe a similar approach could be interesting for you:

The coordinate template stuffs all its parameters int an external link (which can easily be obtained from the externallinks table). Creating dummy links containing parameters for some infoboxes could be one way of making the data available for automatic extraction (yes, it's a hack, but I'd prefer better suggestions over flames).

The link could actually be made useful, it could point to a query page for the data in these infoboxes.

[[User:Dschwen]]

Andrew Dunbar

8:13 a.m.

2009/10/22 Daniel Schwen lists@schwen.de:

...

...
particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title?

How are you planning to get the template parameters? Have I missed a recent schema change?

I've been trying to parse the wikitext of section 0 with a minimal parser that uses just the tokens {{ }} {{{ and }}} but it already has probems when it sees }}}}

...

I'd be interested in following your progress. I'm not extracting infobox data, but parameters of the coordinate template. Maybe a similar approach could be interesting for you:

The coordinate template stuffs all its parameters int an external link (which can easily be obtained from the externallinks table). Creating dummy links containing parameters for some infoboxes could be one way of making the data available for automatic extraction (yes, it's a hack, but I'd prefer better suggestions over flames).

The link could actually be made useful, it could point to a query page for the data in these infoboxes.

The template and parameters I'm interested don't generate any such external links and probably couldn't very easily...

But I have just discovered the rvgeneratexml parameter to action=query&prop=revisions This includes a <part> field for each template parameter with a <name> and a <value> for each...

Andrew Dunbar (hippietrail)

...

[[User:Dschwen]]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

George Herbert

8:37 a.m.

This discussion brings to mind several historical threads.

I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets...

On Thu, Oct 22, 2009 at 11:13 PM, Andrew Dunbar hippytrail@gmail.com wrote:

...

2009/10/22 Daniel Schwen lists@schwen.de:

...
...
particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title?

How are you planning to get the template parameters? Have I missed a recent schema change?

I've been trying to parse the wikitext of section 0 with a minimal parser that uses just the tokens {{ }} {{{ and }}} but it already has probems when it sees }}}}

...
I'd be interested in following your progress. I'm not extracting infobox data, but parameters of the coordinate template. Maybe a similar approach could be interesting for you:

The coordinate template stuffs all its parameters int an external link (which can easily be obtained from the externallinks table). Creating dummy links containing parameters for some infoboxes could be one way of making the data available for automatic extraction (yes, it's a hack, but I'd prefer better suggestions over flames).

The link could actually be made useful, it could point to a query page for the data in these infoboxes.

The template and parameters I'm interested don't generate any such external links and probably couldn't very easily...

But I have just discovered the rvgeneratexml parameter to action=query&prop=revisions This includes a <part> field for each template parameter with a <name> and a <value> for each...

Andrew Dunbar (hippietrail)

...
[[User:Dschwen]]

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- -george william herbert george.herbert@gmail.com

William Pietri

8:51 a.m.

George Herbert wrote:

...

This discussion brings to mind several historical threads.

I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets...

I don't know anybody on the data side at Metaweb anymore, but I know that they did something like that to import a lot of structured Wikipedia data into their Freebase project. They publish some sort of data dump here:

http://download.freebase.com/wex/

Perhaps they'd be willing to open-source their parser.

William

David Gerard

5:28 p.m.

2009/10/23 William Pietri william@scissor.com:

...

George Herbert wrote:

...

...
This discussion brings to mind several historical threads. I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets...

...

I don't know anybody on the data side at Metaweb anymore, but I know that they did something like that to import a lot of structured Wikipedia data into their Freebase project. They publish some sort of data dump here: http://download.freebase.com/wex/ Perhaps they'd be willing to open-source their parser.

They're right into open source, I suspect they would.

- d.

Jona Christopher Sahnwaldt

1:55 p.m.

On Fri, Oct 23, 2009 at 08:37, George Herbert george.herbert@gmail.com wrote:

...

I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets...

That's what DBpedia is doing.

The extracted data can be found here, in N-Triples and CSV format:

http://wiki.dbpedia.org/Downloads

The entries in the row labelled 'Infoboxes' are files that contain the extracted values of all template properties in each page of a Wikipedia instance. For large Wikipedias like en, the unzipped files are pretty big (several GB).

Most of the extraction code can be found in these PHP classes:

https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/In... https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/in...

Christopher

Daniel Schwen

2:10 p.m.

...

I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets...

That is a brilliant idea... ...that somebody else already had and implemented

Templatetiger http://toolserver.org/~kolossos/templatetiger/template-choice.php?lang=enwik...

Should have mentioned that earlier.

Roan Kattouw

9:10 a.m.

2009/10/23 Andrew Dunbar hippytrail@gmail.com:

...

But my attempts to find such pages using either the Toolserver's Wikipedia database or the Mediawiki API have not been fruitful. In particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title?

There are: CREATE UNIQUE INDEX /*i*/tl_from ON /*_*/templatelinks (tl_from,tl_namespace,tl_title); CREATE UNIQUE INDEX /*i*/tl_namespace ON /*_*/templatelinks (tl_namespace,tl_title,tl_from);

It's just that tl_title is always coupled with tl_namespace because that's how you should be using it (tl_namespace=10 for the template namespace). Note that the former index can be used as an index on (tl_from) as well.

Roan Kattouw (Catrope)

Robert Ullmann

1:56 p.m.

Hi Hippietrail!

What do you mean by "intractably slow"? Just how fast must it be?

If I do http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eitit... it says (on one given try) that it was served in 0,047 seconds. How long can it take to read them all? A few minutes?

Seems to me that time would be swamped by the time it takes to pull the wikitext for the pages?

And methinks you might be trying too hard to parse the text, some fairly simple regex or such can extract the template invocation and the parameters; people use it in a pretty regular way.

Oh, and do remember to look for "Template:Infobox language" as well, depending on which way you find them.

Robert

Andrew Dunbar

2:27 p.m.

2009/10/23 Robert Ullmann rlullmann@gmail.com:

...

Hi Hippietrail!

What do you mean by "intractably slow"? Just how fast must it be?

If I do http://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eitit... it says (on one given try) that it was served in 0,047 seconds. How long can it take to read them all? A few minutes?

Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow.

...

Seems to me that time would be swamped by the time it takes to pull the wikitext for the pages?

And methinks you might be trying too hard to parse the text, some fairly simple regex or such can extract the template invocation and the parameters; people use it in a pretty regular way.

I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }}

Andrew Dunbar (hippietrail)

...

Oh, and do remember to look for "Template:Infobox language" as well, depending on which way you find them.

Robert

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Robert Rohde

2:35 p.m.

Given the fairly obvious utility for data mining, it might make sense for someone to extend the Mediawiki API to generate a list of template calls and the parameters sent in each case.

-Robert Rohde

Roan Kattouw

4:04 p.m.

2009/10/23 Robert Rohde rarohde@gmail.com:

...

Given the fairly obvious utility for data mining, it might make sense for someone to extend the Mediawiki API to generate a list of template calls and the parameters sent in each case.

We had a discussion about this Tuesday in the tech staff meeting, and decided that we want to put this data mining possibility in core at some point (using a table like pagelinks to store these key/value pairs and modifying the parser). As you may understand this is not a very high priority project, and I don't know if any of the paid developers are gonna do it any time soon.

Roan Kattouw (Catrope)

Robert Rohde

9:19 p.m.

On Fri, Oct 23, 2009 at 7:04 AM, Roan Kattouw roan.kattouw@gmail.com wrote:

...

2009/10/23 Robert Rohde rarohde@gmail.com:

...
Given the fairly obvious utility for data mining, it might make sense for someone to extend the Mediawiki API to generate a list of template calls and the parameters sent in each case.

We had a discussion about this Tuesday in the tech staff meeting, and decided that we want to put this data mining possibility in core at some point (using a table like pagelinks to store these key/value pairs and modifying the parser). As you may understand this is not a very high priority project, and I don't know if any of the paid developers are gonna do it any time soon.

Out of curiousity, did you discuss internal uses for this? The biggest class of users would probably be external data miners, but I suspect there might be some uses for such data within Mediawiki as well, so I'm curious if you had some application(s) in mind.

-Robert Rohde

Magnus Manske

2:42 p.m.

I am so glad that someone re-re-resurrects this topic :-)

On Fri, Oct 23, 2009 at 1:27 PM, Andrew Dunbar hippytrail@gmail.com wrote:

...

I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }}

Not perfect, but try http://toolserver.org/~magnus/wiki2xml/w2x.php

1. Unckeck "Use API", chose "Do not use templates" 2. Enter article name(s) 3. Get XML 4. Parse XML, re-submit the wiki text in templates to process the next level of templates

I should really offer #4 in this...

Caveat: Will break on things like HTML attributes that are filled by templates etc.

Cheers, Magnus

Robert Ullmann

3:22 p.m.

...

I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }}

Andrew Dunbar (hippietrail)

Hi,

Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in the text, then count braces. Start at depth=2, count up and down 'till you reach 0, and you are at the end of the template. (you can be picky about only counting them if paired if you like ;-)

Then just regex match the lines/parameters you want.

However, if you are pulling the wikitext with the API, the XML parse tree option sounds good; then you can just use elementTree (or the like) and pull out the parameters directly

Robert

Andrew Dunbar

4:02 p.m.

2009/10/23 Robert Ullmann rlullmann@gmail.com:

...

...
I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }}

Andrew Dunbar (hippietrail)

Hi,

Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in the text, then count braces. Start at depth=2, count up and down 'till you reach 0, and you are at the end of the template. (you can be picky about only counting them if paired if you like ;-)

Actually you have to find "{{[Ii]nfobox[ _][Ll]anguage" And I wanted to be robust. It's perfectly legal for single unmatched braces to apear anywhere and I didn't want them to break my code. As it happens there don't seem to currently be any in the language infofoxes. I couldn't be sure whether there would be any cases where a {{{ or }}} might show up either. And a few other edge cases such as HTML comments, <nowiki> and friends, template invocations in values, and even possibly template invokations in names?

...

Then just regex match the lines/parameters you want.

However, if you are pulling the wikitext with the API, the XML parse tree option sounds good; then you can just use elementTree (or the like) and pull out the parameters directly

I've got it extracting the name/value pairs from the XML finally but parsing XML is always a pain. And it still misses Norwegian, Bokmal, and Nynorsk which wrap the infobox in another template...

Andrew Dunbar (hippietrail)

...

Robert

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Neil Harris

5:10 p.m.

Robert Ullmann wrote:

...

...
I've been spending hours on the parsing now and don't find it simple at all due to the fact that templates can be nested. Just extracting the Infobox as one big lump is hard due to the need to match nested {{ and }}

Andrew Dunbar (hippietrail)

Hi,

Come now, you are over-thinking it. Find "{{Infobox [Ll]anguage" in the text, then count braces. Start at depth=2, count up and down 'till you reach 0, and you are at the end of the template. (you can be picky about only counting them if paired if you like ;-)

Then just regex match the lines/parameters you want.

However, if you are pulling the wikitext with the API, the XML parse tree option sounds good; then you can just use elementTree (or the like) and pull out the parameters directly

Robert

Or you could use the pyparsing Python library, with which you can implement the grammar of your choice, making matching nested template extraction trivial. Using the psyco package to accelerate it, you can parse a whole en: dump in a few hours.

See the code below for a sample grammar...

-- Neil

------------------------------------------------

# Use pyparsing, enablePackrat() _and_ psyco for a considerable speed-up from pyparsing import * import psyco # These two must be in the correct order, or bad things will happen ParserElement.enablePackrat() psyco.full()

wikitemplate = Forward()

wikilink = Combine("[[" + SkipTo("]]") + "]]")

wikiargname = CharsNotIn("|{}=") wikiargval = ZeroOrMore( wikilink | Group(wikitemplate) | CharsNotIn("[|{}") | "[" | "{" | Regex("}[^}]"))

wikiarg = Group(Optional(wikiargname + Suppress("="), default="??") + wikiargval)

wikitemplate << (Suppress("{{") + wikiargname + Optional(Suppress("|") + delimitedList(wikiarg, "|")) + Suppress("}}"))

wikitext = ZeroOrMore(CharsNotIn("{") | Group(wikitemplate) | "{" )

def parse_page(text): return wikitext.parseString(text)

Daniel Schwen

5:25 p.m.

Fascinating! It seems to be a repeating pattern on these mailing lists that people ignore existing solutions and discuss re-inventing wheels (please correct me if I'm wrong here). While I agree this is fun some it rarely helps the OP...

[[User:Dschwen]]

Aryeh Gregor

5:18 p.m.

On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar hippytrail@gmail.com wrote:

...

Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); 3144 rows in set (0.12 sec)

Your query might have been what was slow.

Andrew Dunbar

6:20 p.m.

2009/10/23 Aryeh Gregor Simetrical+wikilist@gmail.com:

...

On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar hippytrail@gmail.com wrote:

...
Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); 3144 rows in set (0.12 sec)

Your query might have been what was slow.

But I don't know much about databases and SQL...

I have reached an important milestone of extracting all the name value pairs for language infobox ISO 639 language codes and language family string by the way.

But the values still need some work before I can try to match them against ISO 639-5 language family codes which is my ultimate goal.

Thanks for all the tips.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Jona Christopher Sahnwaldt

6:45 p.m.

Because of result count restrictions, these queries don't return all ISO language codes extracted by DBpedia, but I think they give a good impression of the data quality and coverage (or sometimes lack thereof):

http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where%7B%3Fs+%3...

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&que...

On Fri, Oct 23, 2009 at 18:20, Andrew Dunbar hippytrail@gmail.com wrote:

...

2009/10/23 Aryeh Gregor Simetrical+wikilist@gmail.com:

...
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar hippytrail@gmail.com wrote:

...
Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); 3144 rows in set (0.12 sec)

Your query might have been what was slow.

Yes I didn't specify tl_namespace and when I check for which columns have keys I could see none: mysql> describe templatelinks; +--------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------+------+-----+---------+-------+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--------------+-----------------+------+-----+---------+-------+ 3 rows in set (0.01 sec)

But I don't know much about databases and SQL...

I have reached an important milestone of extracting all the name value pairs for language infobox ISO 639 language codes and language family string by the way.

But the values still need some work before I can try to match them against ISO 639-5 language family codes which is my ultimate goal.

Thanks for all the tips.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jona Christopher Sahnwaldt

6:47 p.m.

Note: the trailing "}" is part of the URL. Some mail readers may cut it off.

On Fri, Oct 23, 2009 at 18:45, Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com wrote:

...

Because of result count restrictions, these queries don't return all ISO language codes extracted by DBpedia, but I think they give a good impression of the data quality and coverage (or sometimes lack thereof):

http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where%7B%3Fs+%3...

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&que...

On Fri, Oct 23, 2009 at 18:20, Andrew Dunbar hippytrail@gmail.com wrote:

...
2009/10/23 Aryeh Gregor Simetrical+wikilist@gmail.com:

...
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar hippytrail@gmail.com wrote:

...
Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); 3144 rows in set (0.12 sec)

Your query might have been what was slow.

Yes I didn't specify tl_namespace and when I check for which columns have keys I could see none: mysql> describe templatelinks; +--------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------+------+-----+---------+-------+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--------------+-----------------+------+-----+---------+-------+ 3 rows in set (0.01 sec)

But I don't know much about databases and SQL...

I have reached an important milestone of extracting all the name value pairs for language infobox ISO 639 language codes and language family string by the way.

But the values still need some work before I can try to match them against ISO 639-5 language family codes which is my ultimate goal.

Thanks for all the tips.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Andrew Dunbar

26 Oct 26 Oct

3:55 a.m.

2009/10/23 Jona Christopher Sahnwaldt jcsahnwaldt@gmail.com:

...

Because of result count restrictions, these queries don't return all ISO language codes extracted by DBpedia, but I think they give a good impression of the data quality and coverage (or sometimes lack thereof):

http://dbpedia.org/sparql?query=select+distinct+%3Fs%2C+%3Fo+where%7B%3Fs+%3...

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&que...

This is really amazing and I think I'm going to be learning the query language and possibly spending some time with dbpedia. Have you thought about doing the same for Wiktionary?

Andrew Dunbar (hippietrail)

...

On Fri, Oct 23, 2009 at 18:20, Andrew Dunbar hippytrail@gmail.com wrote:

...
2009/10/23 Aryeh Gregor Simetrical+wikilist@gmail.com:

...
On Fri, Oct 23, 2009 at 8:27 AM, Andrew Dunbar hippytrail@gmail.com wrote:

...
Yes I found how to get it through the API now. It was actually just the Toolserver database that was intractably slow.

There's nothing slow about the TS database here:

mysql> pager true PAGER set to 'true' mysql> SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); 3144 rows in set (0.12 sec)

Your query might have been what was slow.

Yes I didn't specify tl_namespace and when I check for which columns have keys I could see none: mysql> describe templatelinks; +--------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------+------+-----+---------+-------+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--------------+-----------------+------+-----+---------+-------+ 3 rows in set (0.01 sec)

But I don't know much about databases and SQL...

I have reached an important milestone of extracting all the name value pairs for language infobox ISO 639 language codes and language family string by the way.

But the values still need some work before I can try to match them against ISO 639-5 language family codes which is my ultimate goal.

Thanks for all the tips.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Jona Christopher Sahnwaldt

4:49 p.m.

On Mon, Oct 26, 2009 at 02:55, Andrew Dunbar hippytrail@gmail.com wrote:

...

Have you thought about doing the same for Wiktionary?

Interesting idea. I don't know much about Wiktionary. Are its pages structured similarly? How difficult would it be to extract structured data from them? What kind of data would you expect to extract?

I don't think there is much dictionary data on the Web of Data yet, so extracting data from wiktionary may be an interesting first step in that direction. We haven't really thought about it yet, and it probably won't get to the top of our to-do-list anytime soon, but it's an interesting idea!

Christopher

Aryeh Gregor

23 Oct 23 Oct

11:09 p.m.

On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar hippytrail@gmail.com wrote:

...

Yes I didn't specify tl_namespace

In MySQL that will usually make it impossible to effectively use an index on (tl_namespace, tl_title), so it's essential that you specify the NS. (Which you should anyway to avoid hitting things like [[Template talk:Infobox language]].) Some DBMSes (including sometimes MySQL >= 5.0, although apparently not here) are smart enough to use this kind of index pretty well even if you don't specify the namespace, but it would still be somewhat more efficient to specify it -- the DB would have to do O(1/n) times as many index lookups, where n is the number of namespaces.

...

and when I check for which columns have keys I could see none: mysql> describe templatelinks; +--------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------+------+-----+---------+-------+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--------------+-----------------+------+-----+---------+-------+ 3 rows in set (0.01 sec)

The toolserver database uses views. In MySQL, views can't have indexes themselves, but your query is rewritten to run against the real table -- which you can't access directly, but which does have indexes. EXPLAIN is your best bet here:

Note the number of rows scanned in each case. Your query was scanning all of templatelinks, the other is retrieving the exact rows needed and not looking at any others ("type" = "index" vs. "range"). The reason for this is given in the "possible_keys" column: MySQL can find no keys that are usable for lookup, if you omit tl_namespace.

Andrew Dunbar

26 Oct 26 Oct

3:58 a.m.

2009/10/23 Aryeh Gregor Simetrical+wikilist@gmail.com:

...

On Fri, Oct 23, 2009 at 12:20 PM, Andrew Dunbar hippytrail@gmail.com wrote:

...
Yes I didn't specify tl_namespace

In MySQL that will usually make it impossible to effectively use an index on (tl_namespace, tl_title), so it's essential that you specify the NS. (Which you should anyway to avoid hitting things like [[Template talk:Infobox language]].) Some DBMSes (including sometimes MySQL >= 5.0, although apparently not here) are smart enough to use this kind of index pretty well even if you don't specify the namespace, but it would still be somewhat more efficient to specify it -- the DB would have to do O(1/n) times as many index lookups, where n is the number of namespaces.

...
and when I check for which columns have keys I could see none: mysql> describe templatelinks; +--------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-----------------+------+-----+---------+-------+ | tl_from | int(8) unsigned | NO | | 0 | | | tl_namespace | int(11) | NO | | 0 | | | tl_title | varchar(255) | NO | | | | +--------------+-----------------+------+-----+---------+-------+ 3 rows in set (0.01 sec)

The toolserver database uses views. In MySQL, views can't have indexes themselves, but your query is rewritten to run against the real table -- which you can't access directly, but which does have indexes. EXPLAIN is your best bet here:

mysql> EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_title IN ('Infobox_Language', 'Infobox_language'); +----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+ | 1 | SIMPLE | templatelinks | index | NULL | tl_from | 265 | NULL | 149740990 | Using where; Using index | +----+-------------+---------------+-------+---------------+---------+---------+------+-----------+--------------------------+ 1 row in set (0.00 sec)

mysql> EXPLAIN SELECT tl_from FROM templatelinks WHERE tl_namespace=10 AND tl_title IN ('Infobox_Language', 'Infobox_language'); +----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+ | 1 | SIMPLE | templatelinks | range | tl_namespace | tl_namespace | 261 | NULL | 6949 | Using where; Using index | +----+-------------+---------------+-------+---------------+--------------+---------+------+------+--------------------------+ 1 row in set (0.00 sec)

Note the number of rows scanned in each case. Your query was scanning all of templatelinks, the other is retrieving the exact rows needed and not looking at any others ("type" = "index" vs. "range"). The reason for this is given in the "possible_keys" column: MySQL can find no keys that are usable for lookup, if you omit tl_namespace.

Thanks for the very informative reply. I already knew most of this stuff passively except database/SQL views. Now I've just got to put it into more practice.

Andrew Dunbar (hippietrail)

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

5508

Age (days ago)

5511

Last active (days ago)

wikitech-l@lists.wikimedia.org

26 comments

12 participants

tags (0)

participants (12)

Andrew Dunbar
Aryeh Gregor
Daniel Schwen
David Gerard
George Herbert
Jona Christopher Sahnwaldt
Magnus Manske
Neil Harris
Roan Kattouw
Robert Rohde
Robert Ullmann
William Pietri