Is there anyone that has considered how to import data from external sources, especially those that do not have any prepared an well-defined API?
A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html
In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.
Anyhow, this can quite easily be formulated both as a parser function and a tag function.
At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.
Similar data as in the previous example can be found in http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.
Any idea? Some kind of click-sequence recording?
Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/
John
Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... (assuming you are Norwegian). /Leo
Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin http://svt.se/nyhetslabbethttp://s%C3%A4songsmat.nuWikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/http://nairobikoll.se
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources
sources, especially those that do not have any prepared an well-defined API?
A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html
In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.
Anyhow, this can quite easily be formulated both as a parser function and a tag function.
At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.
Similar data as in the previous example can be found in http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.
Any idea? Some kind of click-sequence recording?
Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/
John
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.
John
On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:
Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... you are Norwegian).
/Leo
Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin
http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources
> sources, especially those that do not have any prepared an
well-defined API?
A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html
In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.
Anyhow, this can quite easily be formulated both as a parser function and a tag function.
At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.
Similar data as in the previous example can be found in
http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.
Any idea? Some kind of click-sequence recording?
Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/
John
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Hey,
I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/
Martynas graphity.org
On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad jeblad@gmail.com wrote:
Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.
John
On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:
Are you trying to achieve this from within MediaWiki? Otherwise Google Docs is a good tool for screen scraping, that can be used to produce csv-files for you wiki from sources without an API. I wrote about it here, in Swedish: http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... you are Norwegian).
/Leo
Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin
http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources
> sources, especially those that do not have any prepared an
well-defined API?
A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html
In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.
Anyhow, this can quite easily be formulated both as a parser function and a tag function.
At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.
Similar data as in the previous example can be found in
http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo... But it is very difficult to formulate a kind of click-sequence inside that page.
Any idea? Some kind of click-sequence recording?
Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/
John
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
For now, Wikidata does not plan to cover the part of scraping data automatically from the Web, but only to provide a place where such data can be edited, stored, and re-published, including references. My assumption is that the community might create bots to perform such scrapings, and maybe upload it to Wikidata, but if they want this and how this might happen is a decision for the community, once it exists.
Cheers, Denny
2012/3/18 Martynas Jusevicius martynas@graphity.org
Hey,
I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/
Martynas graphity.org
On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad jeblad@gmail.com wrote:
Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.
John
On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:
Are you trying to achieve this from within MediaWiki? Otherwise Google
Docs
is a good tool for screen scraping, that can be used to produce
csv-files
for you wiki from sources without an API. I wrote about it here, in
Swedish:
http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till... (assuming
you are Norwegian).
/Leo
Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543 Twitter: @leo_wallentin Skype: leo_wallentin
http://svt.se/nyhetslabbet http://s%C3%A4songsmat.nu http://xn--ssongsmat-v2a.nu WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ http://nairobikoll.se
Date: Sun, 18 Mar 2012 09:57:34 +0100 From: jeblad@gmail.com To: wikidata-l@lists.wikimedia.org Subject: [Wikidata-l] Import from external sources
sources, especially those that do not have any prepared an well-defined API?
A rather simple example from the website for Statistics Norway is an article on a website like this http://www.ssb.no/fobstud/ and a table like this http://www.ssb.no/fobstud/tab-2002-11-21-02.html
In that example you must follow a link to a new page which you then must monitor for changes. Inside that page you can use Xpath to to extract a field, and then optionally use something like a regexp to identify and split fields. As an alternate solution you might use XLT to transform the whole page.
Anyhow, this can quite easily be formulated both as a parser function and a tag function.
At the same site there is something called "Statistikkbanken" (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on and then iterate through a sequence of pages.
Similar data as in the previous example can be found in
http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=Fo...
But it is very difficult to formulate a kind of click-sequence inside
that
page.
Any idea? Some kind of click-sequence recording?
Statistics Norway publish statistics about Norway for free reuse as long as they are credited as appropriate. http://www.ssb.no/english/help/
John
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l