This discussion brings to mind several historical threads.
I wonder if a project to simply mine the whole article contents and provide a DB of some sort with the articles and infobox contents would be worthwhile. Develop a specific parser and generate and publish the complete set of article-infobox-(key-value) sets...
On Thu, Oct 22, 2009 at 11:13 PM, Andrew Dunbar hippytrail@gmail.com wrote:
2009/10/22 Daniel Schwen lists@schwen.de:
particular, SQL queries on the templatelinks table are intractably slow. Why are there no keys on tl_from or tl_title?
How are you planning to get the template parameters? Have I missed a recent schema change?
I've been trying to parse the wikitext of section 0 with a minimal parser that uses just the tokens {{ }} {{{ and }}} but it already has probems when it sees }}}}
I'd be interested in following your progress. I'm not extracting infobox data, but parameters of the coordinate template. Maybe a similar approach could be interesting for you:
The coordinate template stuffs all its parameters int an external link (which can easily be obtained from the externallinks table). Creating dummy links containing parameters for some infoboxes could be one way of making the data available for automatic extraction (yes, it's a hack, but I'd prefer better suggestions over flames).
The link could actually be made useful, it could point to a query page for the data in these infoboxes.
The template and parameters I'm interested don't generate any such external links and probably couldn't very easily...
But I have just discovered the rvgeneratexml parameter to action=query&prop=revisions This includes a <part> field for each template parameter with a <name> and a <value> for each...
Andrew Dunbar (hippietrail)
[[User:Dschwen]]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l