[Wikidata-l] Import from external sources

List overview All Threads
Download

newer

older

[Wikidata-l] White-papers about...

Re: [Wikidata-l] Wikidata...

John Erling Blad

18 Mar 2012 18 Mar '12

9:57 a.m.

Is there anyone that has considered how to import data from external sources, especially those that do not have any prepared an well-defined API?

A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html

In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.

Anyhow, this can quite easily be formulated both as a parser function and a tag function.

At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.

Similar data as in the previous example can be found in http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.

Any idea? Some kind of click-sequence recording?

Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/

John

Show replies by date

Leonard Wallentin

18 Mar 18 Mar

11:32 a.m.

Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... (assuming you are Norwegian). /Leo

Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin http://svt.se/nyhetslabbethttp://s%C3%A4songsmat.nuWikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/http://nairobikoll.se

...

Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources

sources, especially those that do not have any prepared an well-defined API?

A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html

In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.

Anyhow, this can quite easily be formulated both as a parser function and a tag function.

At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.

Similar data as in the previous example can be found in http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.

Any idea? Some kind of click-sequence recording?

Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/

John

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

John Erling Blad

11:55 a.m.

Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.

John

On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:

...

Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... you are Norwegian).

/Leo

Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin

http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se

...
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources

...
> sources, especially those that do not have any prepared an

...
well-defined API?

A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html

In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.

Anyhow, this can quite easily be formulated both as a parser function and a tag function.

At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.

Similar data as in the previous example can be found in

http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.

Any idea? Some kind of click-sequence recording?

Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/

John

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Martynas Jusevicius

12:31 p.m.

Hey,

I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/

Martynas graphity.org

On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad jeblad@gmail.com wrote:

...

Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.

John

On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:

...
Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... you are Norwegian).

/Leo

Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin

http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se

...
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources

...
> sources, especially those that do not have any prepared an

...
well-defined API?

A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html

In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.

Anyhow, this can quite easily be formulated both as a parser function and a tag function.

At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.

Similar data as in the previous example can be found in

http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.

Any idea? Some kind of click-sequence recording?

Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/

John

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Denny Vrandečić

19 Mar 19 Mar

11:31 a.m.

For now, Wikidata does not plan to cover the part of scraping data automatically from the Web, but only to provide a place where such data can be edited, stored, and re-published, including references. My assumption is that the community might create bots to perform such scrapings, and maybe upload it to Wikidata, but if they want this and how this might happen is a decision for the community, once it exists.

Cheers, Denny

2012/3/18 Martynas Jusevicius martynas@graphity.org

...

Hey,

I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/

Martynas graphity.org

On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad jeblad@gmail.com wrote:

...
Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.

John

On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:

...
Are you trying to achieve this from within MediaWiki? Otherwise Google

Docs

...
...
is a good tool for screen scraping, that can be used to produce

csv-files

...
...
for you wiki from sources without an API. I wrote about it here, in

Swedish:

...
...
http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... (assuming

...
...
you are Norwegian).

/Leo

Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin

http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu http://xn--ssongsmat-v2a.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se

...
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources

...
sources, especially those that do not have any prepared an well-defined API?

A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html

In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.

Anyhow, this can quite easily be formulated both as a parser function and a tag function.

At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.

Similar data as in the previous example can be found in

http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo...

...
...
...
But it is very difficult to formulate a kind of click-sequence inside

that

...
...
...
page.

Any idea? Some kind of click-sequence recording?

Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/

John

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

4641

Age (days ago)

4642

Last active (days ago)

wikidata@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

Denny Vrandečić
John Erling Blad
Leonard Wallentin
Martynas Jusevicius