[Wikidata-l] Import from external sources

List overview All Threads
Download

newer

older

[Wikidata-l] White-papers about...

Re: [Wikidata-l] Wikidata...

John Erling Blad

18 Mar 2012 18 Mar '12

10:57 a.m.

Is there anyone that has considered how to import data from external sources, especially those that do not have any prepared an well-defined API? A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page. Anyhow, this can quite easily be formulated both as a parser function and a tag function. At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages. Similar data as in the previous example can be found in http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=F… But it is very difficult to formulate a kind of click-sequence inside that page. Any idea? Some kind of click-sequence recording? Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/ John

Show replies by thread

Leonard Wallentin

18 Mar 18 Mar

12:32 p.m.

Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-til… (assuming you are Norwegian). /Leo Leonard Wallentin leo_wallentin(a)hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin http://svt.se/nyhetslabbethttp://säsongsmat.nuWikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/http://nairobikoll.se

...

Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad(a)gmail.com To: wikidata-l(a)lists.wikimedia.org Subject: [Wikidata-l] Import from external sources sources, especially those that do not have any prepared an well-defined API? A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page. Anyhow, this can quite easily be formulated both as a parser function and a tag function. At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages. Similar data as in the previous example can be found in http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=F… But it is very difficult to formulate a kind of click-sequence inside that page. Any idea? Some kind of click-sequence recording? Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/ John _______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

John Erling Blad

12:55 p.m.

Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate. John On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin <leo_wallentin(a)hotmail.com> wrote:

...

Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-til… (assuming you are Norwegian). /Leo ________________________________ Leonard Wallentin leo_wallentin(a)hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin http://svt.se/nyhetslabbet http://säsongsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se

Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad(a)gmail.com To: wikidata-l(a)lists.wikimedia.org Subject: [Wikidata-l] Import from external sources

> > sources, especially those that do not have any prepared an > well-defined API? > > A rather simple example from the website for Statistics Norway is an > article on a website like this > http://www.ssb.no/fobstud/ > and a table like this > http://www.ssb.no/fobstud/tab-2002-11-21-02.html > > In that example you must follow a link to a new page which you then > must monitor for changes. Inside that page you can use Xpath to to > extract a field, and then optionally use something like a regexp to > identify and split fields. As an alternate solution you might use XLT > to transform the whole page. > > Anyhow, this can quite easily be formulated both as a parser function > and a tag function. > > At the same site there is something called "Statistikkbanken" > (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on > and then iterate through a sequence of pages. > > Similar data as in the previous example can be found in > > http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=F… > But it is very difficult to formulate a kind of click-sequence inside that > page. > > Any idea? Some kind of click-sequence recording? > > Statistics Norway publish statistics about Norway for free reuse as > long as they are credited as appropriate. > http://www.ssb.no/english/help/ > > John > > _______________________________________________ > Wikidata-l mailing list > Wikidata-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Martynas Jusevicius

1:31 p.m.

Hey, I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/ Martynas graphity.org On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad <jeblad(a)gmail.com> wrote:

...

Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-til… (assuming you are Norwegian). /Leo ________________________________ Leonard Wallentin leo_wallentin(a)hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin http://svt.se/nyhetslabbet http://säsongsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se

Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad(a)gmail.com To: wikidata-l(a)lists.wikimedia.org Subject: [Wikidata-l] Import from external sources

_______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Denny Vrandečić

19 Mar 19 Mar

12:31 p.m.

For now, Wikidata does not plan to cover the part of scraping data automatically from the Web, but only to provide a place where such data can be edited, stored, and re-published, including references. My assumption is that the community might create bots to perform such scrapings, and maybe upload it to Wikidata, but if they want this and how this might happen is a decision for the community, once it exists. Cheers, Denny 2012/3/18 Martynas Jusevicius <martynas(a)graphity.org>

...

Docs

> is a good tool for screen scraping, that can be used to produce

csv-files

> for you wiki from sources without an API. I wrote about it here, in

Swedish:

http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-til… (assuming >> you are Norwegian).

>> /Leo

>> ________________________________ >> Leonard Wallentin >> leo_wallentin(a)hotmail.com >> +46 (0)735-933 543 >> Twitter: @leo_wallentin >> Skype: leo_wallentin

>> http://svt.se/nyhetslabbet >> http://säsongsmat.nu <http://xn--ssongsmat-v2a.nu> >> WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ >> http://nairobikoll.se

>>> Date: Sun, 18 Mar 2012 09:57:34 +0100 >>> From: jeblad(a)gmail.com >>> To: wikidata-l(a)lists.wikimedia.org >>> Subject: [Wikidata-l] Import from external sources

>>> >> > sources, especially those that do not have any prepared an >>> well-defined API? >>> >>> A rather simple example from the website for Statistics Norway is an >>> article on a website like this >>> http://www.ssb.no/fobstud/ >>> and a table like this >>> http://www.ssb.no/fobstud/tab-2002-11-21-02.html >>> >>> In that example you must follow a link to a new page which you then >>> must monitor for changes. Inside that page you can use Xpath to to >>> extract a field, and then optionally use something like a regexp to >>> identify and split fields. As an alternate solution you might use XLT >>> to transform the whole page. >>> >>> Anyhow, this can quite easily be formulated both as a parser function >>> and a tag function. >>> >>> At the same site there is something called "Statistikkbanken" >>> (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on >>> and then iterate through a sequence of pages. >>> >>> Similar data as in the previous example can be found in >>> >>> http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=F…

>> But it is very difficult to formulate a kind of click-sequence inside

that

> page. > > Any idea? Some kind of click-sequence recording? > > Statistics Norway publish statistics about Norway for free reuse as > long as they are credited as appropriate. > http://www.ssb.no/english/help/ > > John > > _______________________________________________ > Wikidata-l mailing list > Wikidata-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata-l

_______________________________________________ Wikidata-l mailing list Wikidata-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

4415

days inactive

4416

days old

wikidata@lists.wikimedia.org

Manage subscription

4 comments

4 participants

tags (0)

participants (4)

Denny Vrandečić
John Erling Blad
Leonard Wallentin
Martynas Jusevicius