Hey,
I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/
Martynas graphity.org
On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad jeblad@gmail.com wrote:
Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.
John
On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:
Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... you are Norwegian).
/Leo
Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin
http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources
> sources, especially those that do not have any prepared an
well-defined API?
A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html
In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.
Anyhow, this can quite easily be formulated both as a parser function and a tag function.
At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.
Similar data as in the previous example can be found in
http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.
Any idea? Some kind of click-sequence recording?
Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/
John
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l