Hi Sebastian,
This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.
A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:
- conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“
- copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.
Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
-- Joachim Neubert
ZBW – German National Library of Economics Leibniz Information Centre for Economics Neuer Jungfernstieg 21 20354 Hamburg Phone +49-42834-462
Von: Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] Im Auftrag von Sebastian Hellmann Gesendet: Sonntag, 15. Oktober 2017 09:45 An: wikidata@lists.wikimedia.orgmailto:wikidata@lists.wikimedia.org Betreff: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a clearly identifiable conceptual or material entity. The entity must be notable, in the sense that it can be described using serious and publicly available references. If there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.
How can I get clearance to proceed on this?
All the best, Sebastian
Entity data
Saxony District court Leipzig HRB 32853 – A&A Dienstleistungsgesellschaft mbH
Legal status:
Gesellschaft mit beschränkter Haftung
Capital:
25.000,00 EUR
Date of entry:
29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!)
Date of removal:
-
Balance sheet available:
-
Address (subject to correction):
A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40 04317 Leipzig
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lthttp://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique.
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find and deduplicate
- The crawl can be done easily, a colleague has done so before.
The issues here are:
- Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead
- If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this.
All the best,
Sebastian
On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,
This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.
A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:
-conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“
-copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.
Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21 20354 Hamburg
Phone +49-42834-462
*Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.
How can I get clearance to proceed on this?
All the best, Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH
Legal status:
Gesellschaft mit beschränkter Haftung
Capital:
25.000,00 EUR
Date of entry:
29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!)
Date of removal:
Balance sheet available:
Address (subject to correction):
A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40
04317 Leipzig
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Ah yes, forgot to mention:
there is no URI or unique identifier given by the Handelsregister system. However, the courts take care that the registrations are unique, so it is implicit. Handelsregister could easily create stable URIs out of the court+type+number like /Leipzig_HRB_32853
For Wikidata this is not a problem to handle. So no technical issues from this side either.
All the best,
Sebastian
On 16.10.2017 13:41, Sebastian Hellmann wrote:
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find and deduplicate
- The crawl can be done easily, a colleague has done so before.
The issues here are:
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this.
All the best,
Sebastian
On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,
This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.
A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:
-conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“
-copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.
Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21 20354 Hamburg
Phone +49-42834-462
*Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org mailto:wikidata@lists.wikimedia.org *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.
How can I get clearance to proceed on this?
All the best, Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH
Legal status:
Gesellschaft mit beschränkter Haftung
Capital:
25.000,00 EUR
Date of entry:
29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!)
Date of removal:
Balance sheet available:
Address (subject to correction):
A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40
04317 Leipzig
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
The best way then to not create duplicates is to look at all existing organizations in Wikidata and add the court and court number manually, if they are German and then exclude these from the import.
Guarantees that there will be no duplicates.
So the technical side is feasible. Barriers are political and legal.
Sebastian
Am 16. Oktober 2017 14:24:51 MESZ schrieb Sebastian Hellmann hellmann@informatik.uni-leipzig.de:
Ah yes, forgot to mention:
there is no URI or unique identifier given by the Handelsregister system. However, the courts take care that the registrations are unique, so it is implicit. Handelsregister could easily create stable URIs out of the court+type+number like /Leipzig_HRB_32853
For Wikidata this is not a problem to handle. So no technical issues from this side either.
All the best,
Sebastian
On 16.10.2017 13:41, Sebastian Hellmann wrote:
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find and deduplicate
- The crawl can be done easily, a colleague has done so before.
The issues here are:
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as
the
data can be retrieved from the official source directly, however,
here
this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess.
Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on
this.
All the best,
Sebastian
On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,
This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.
A good way for such an approach would be proposing a property for an
external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably
with
location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:
-conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“
-copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much
more
so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the
information they have (common name, location, perhaps founding date,
and most important sector of trade), so disambiguation may still be difficult.
Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as
Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21 20354 Hamburg
Phone +49-42834-462
*Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org
mailto:wikidata@lists.wikimedia.org
*Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example
below):
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or
amazon
turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability
2. It refers to an instance of a *clearly identifiable
conceptual
or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly
identifiable
legal entities.
How can I get clearance to proceed on this?
All the best, Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH
Legal status:
Gesellschaft mit beschränkter Haftung
Capital:
25.000,00 EUR
Date of entry:
29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!)
Date of removal:
Balance sheet available:
Address (subject to correction):
A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40
04317 Leipzig
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT)
Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig
University
Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig
University
Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A
%3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A...
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann < hellmann@informatik.uni-leipzig.de>:
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find and deduplicate
- The crawl can be done easily, a colleague has done so before.
The issues here are:
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under open
license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this.
All the best,
Sebastian
On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,
This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.
A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.
I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:
conducting an extended search for the exact string “A&A
Dienstleistungsgesellschaft mbH“
copying the register number “32853” plus selecting the court
(Leipzig) from the according dropdown list and search that
Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.
Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?
Cheers, Joachim
--
Joachim Neubert
ZBW – German National Library of Economics
Leibniz Information Centre for Economics
Neuer Jungfernstieg 21 20354 Hamburg
Phone +49-42834-462
*Von:* Wikidata [mailto:wikidata-bounces@lists.wikimedia.org wikidata-bounces@lists.wikimedia.org] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata
Hi all,
the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):
https://www.handelsregister.de/rp_web/mask.do?Typ=e
I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk.
It should meet notability criteria 2: https://www.wikidata.org/wiki/ Wikidata:Notability
- It refers to an instance of a *clearly identifiable conceptual or
material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.
The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities.
How can I get clearance to proceed on this?
All the best, Sebastian
Entity data
Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH
Legal status:
Gesellschaft mit beschränkter Haftung
Capital:
25.000,00 EUR
Date of entry:
29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!)
Date of removal:
Balance sheet available:
Address (subject to correction):
A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40
04317 Leipzig
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type).
So, to retrieve all items of a certain type in SPARQL, you need to use something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A...
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de>:
Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that____ __ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times).
So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type).
So, to retrieve all items of a certain type in SPARQL, you need to use something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A...
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de>:
Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that____ __ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
@Antonin : Thanks for this counting method, it seems very effective (I already knew that there were 3.6 M of humans (Q5) in Wikidata).
https://query.wikidata.org/#%23compter%20le%20nombre%20d%27%C3%A9l%C3%A9ment...
2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) < lists@antonin.delpeuch.eu>:
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times).
So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type).
So, to retrieve all items of a certain type in SPARQL, you need to use something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies.
3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd% 3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd% 3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_ LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de>:
Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search
that____
__ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig
University
Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig
University
Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
While I'm on the subject, I would like to draw attention to the Neckar project http://event.ifi.uni-heidelberg.de/?page_id=532, which aims precisely to classify Wikidata entities in people, places and organizations. Frequently updated Json dumps are available.
2017-10-16 16:08 GMT+02:00 Ettore RIZZA ettorerizza@gmail.com:
@Antonin : Thanks for this counting method, it seems very effective (I already knew that there were 3.6 M of humans (Q5) in Wikidata).
https://query.wikidata.org/#%23compter%20le%20nombre%20d% 27%C3%A9l%C3%A9ments%20appartenant%20%C3%A0%20la%20cat%C3%A9gorie%0A% 23organisation%20ou%20%C3%A0%20ses%20enfants%0ASELECT% 20DISTINCT%20%28COUNT%28DISTINCT%20%3Fitem%29%20AS% 20%3Fcount%29%20WHERE%20%7B%20%3Fitem%20%28wdt%3AP31% 2Fwdt%3AP279%2a%29%20wd%3AQ5.%20%7D
2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) < lists@antonin.delpeuch.eu>:
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times).
So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type).
So, to retrieve all items of a certain type in SPARQL, you need to use something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
{
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results.
And
I imagine there are many other categories that bring together
commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%
20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.% 0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AservicePa ram%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de>:
Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel
{
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to
find
and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real
big
extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into
wikidata,
as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search
that____
__ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig
University
Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig
University
Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve this automatically.
Ok, so let's divide the problem:
# Task 1:
Connect all existing organisations with the data from the handelsregister. (No new identifiers added, we can start right now)
Add a constraint that all German organisations should be connected to a court, i.e. the registering organisation as well as the id assigned by the court.
@all: any properties I can reuse for this?
I will focus on this as it seems quite easy. We can first filter orgs by other criteria, i.e. country as a blocking key and then string match the rest.
# Task 2:
Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
# regarding maintenance: I find Wikidata as such very hard to maintain as all data is copied from somewhere else eventually, but Wikipedia has the same problem. In the case of the German Business register, maintenance is especially easy as the orgs are stable and uniquely identifiable. Even the fact that a company gets shut down should still be in Wikidata, so you have historical information. I mean, you also keep the Roman Empire, the Hanse and even finished projects in Wikidata. So even if an org ceases to exist, the entry in Wikidata should stay.
# regarding Opencorporates I have a critical opinion with Opencorporates. It appears to be open, but you actually can not get the data. If somebody has a data dump, please forward to me. Thanks. More on top, I consider Opencorporates a danger to open data. It appears to push open availability of data, but then it is limited to open licenses. Usefulness is limited as there are no free dumps and no possibility to duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API for exactly this reason. Everytime somebody wants to create an open organisation dataset with no barriers, the existence of Opencorporates is blocking this.
Cheers, Sebastian
On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times).
So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type).
So, to retrieve all items of a certain type in SPARQL, you need to use something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A...
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de>:
Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that____ __ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Dear All,
it is great that we are having this discussion, but may I please suggest to have it on the RfP page on Wikidata? People already asked similar questions there, and, in my experience, on-wiki discussion will likely lead to refined request which will accomodate all suggestions.
Cheers Yaroslav
On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann < hellmann@informatik.uni-leipzig.de> wrote:
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve this automatically.
Ok, so let's divide the problem:
# Task 1:
Connect all existing organisations with the data from the handelsregister. (No new identifiers added, we can start right now)
Add a constraint that all German organisations should be connected to a court, i.e. the registering organisation as well as the id assigned by the court.
@all: any properties I can reuse for this?
I will focus on this as it seems quite easy. We can first filter orgs by other criteria, i.e. country as a blocking key and then string match the rest.
# Task 2:
Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently.
# regarding maintenance: I find Wikidata as such very hard to maintain as all data is copied from somewhere else eventually, but Wikipedia has the same problem. In the case of the German Business register, maintenance is especially easy as the orgs are stable and uniquely identifiable. Even the fact that a company gets shut down should still be in Wikidata, so you have historical information. I mean, you also keep the Roman Empire, the Hanse and even finished projects in Wikidata. So even if an org ceases to exist, the entry in Wikidata should stay.
# regarding Opencorporates I have a critical opinion with Opencorporates. It appears to be open, but you actually can not get the data. If somebody has a data dump, please forward to me. Thanks. More on top, I consider Opencorporates a danger to open data. It appears to push open availability of data, but then it is limited to open licenses. Usefulness is limited as there are no free dumps and no possibility to duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API for exactly this reason. Everytime somebody wants to create an open organisation dataset with no barriers, the existence of Opencorporates is blocking this.
Cheers, Sebastian
On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times).
So, I get 1 168 084 now.http://tinyurl.com/yaeqlsnl
It's easy to get these things wrong!
Antonin
On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type).
So, to retrieve all items of a certain type in SPARQL, you need to use something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A...
On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporateshttps://opencorporates.com/ https://opencorporates.com/. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.demailto:hellmann@informatik.uni-leipzig.de hellmann@informatik.uni-leipzig.de>:
Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote: Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that____ __ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <wikidata-bounces@lists.wikimedia.org> <mailto:wikidata-bounces@lists.wikimedia.org> <wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:* wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> <wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability> <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing listWikidata@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/wikidata
-- All the best, Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt http://www.w3.org/community/ld4lt Homepage: http://aksw.org/SebastianHellmann Research Group: http://aksw.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Yaroslav,
in addition to this list, I added it here:
https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsr...
and here:
https://www.wikidata.org/wiki/Wikidata:Project_chat#Handelsregister
but I received more and longer answers on this list.
All the best,
Sebastian
On 16.10.2017 18:06, Yaroslav Blanter wrote:
Dear All,
it is great that we are having this discussion, but may I please suggest to have it on the RfP page on Wikidata? People already asked similar questions there, and, in my experience, on-wiki discussion will likely lead to refined request which will accomodate all suggestions.
Cheers Yaroslav
On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de> wrote:
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve this automatically. Ok, so let's divide the problem: # Task 1: Connect all existing organisations with the data from the handelsregister. (No new identifiers added, we can start right now) Add a constraint that all German organisations should be connected to a court, i.e. the registering organisation as well as the id assigned by the court. @all: any properties I can reuse for this? I will focus on this as it seems quite easy. We can first filter orgs by other criteria, i.e. country as a blocking key and then string match the rest. # Task 2: Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently. # regarding maintenance: I find Wikidata as such very hard to maintain as all data is copied from somewhere else eventually, but Wikipedia has the same problem. In the case of the German Business register, maintenance is especially easy as the orgs are stable and uniquely identifiable. Even the fact that a company gets shut down should still be in Wikidata, so you have historical information. I mean, you also keep the Roman Empire, the Hanse and even finished projects in Wikidata. So even if an org ceases to exist, the entry in Wikidata should stay. # regarding Opencorporates I have a critical opinion with Opencorporates. It appears to be open, but you actually can not get the data. If somebody has a data dump, please forward to me. Thanks. More on top, I consider Opencorporates a danger to open data. It appears to push open availability of data, but then it is limited to open licenses. Usefulness is limited as there are no free dumps and no possibility to duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API for exactly this reason. Everytime somebody wants to create an open organisation dataset with no barriers, the existence of Opencorporates is blocking this. Cheers, Sebastian On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times). So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl It's easy to get these things wrong! Antonin On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that! Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type). So, to retrieve all items of a certain type in SPARQL, you need to use something like this: ?item wdt:P31/wdt:P279* ?type You can also have other variants to accept non-truthy statements. Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"… So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct. I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future… Antonin On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} Hi, I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies. https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D <https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D> On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates <https://opencorporates.com/> <https://opencorporates.com/>. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day. 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de <mailto:hellmann@informatik.uni-leipzig.de> <mailto:hellmann@informatik.uni-leipzig.de> <mailto:hellmann@informatik.uni-leipzig.de>>: Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that____ __ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org> <mailto:wikidata-bounces@lists.wikimedia.org> <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:*wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> <mailto:wikidata@lists.wikimedia.org> <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e> <https://www.handelsregister.de/rp_web/mask.do?Typ=e> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability> <https://www.wikidata.org/wiki/Wikidata:Notability> <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects:http://dbpedia.org,http://nlp2rdf.org, http://linguistics.okfn.org,https://www.w3.org/community/ld4lt <https://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> Homepage:http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> Research Group:http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects:http://dbpedia.org,http://nlp2rdf.org, http://linguistics.okfn.org,https://www.w3.org/community/ld4lt <https://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> Homepage:http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> Research Group:http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Ok, I put some effort into https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Handelsr... to move the discussion there.
All the best,
Sebastian
On 16.10.2017 18:06, Yaroslav Blanter wrote:
Dear All,
it is great that we are having this discussion, but may I please suggest to have it on the RfP page on Wikidata? People already asked similar questions there, and, in my experience, on-wiki discussion will likely lead to refined request which will accomodate all suggestions.
Cheers Yaroslav
On Mon, Oct 16, 2017 at 5:53 PM, Sebastian Hellmann <hellmann@informatik.uni-leipzig.de mailto:hellmann@informatik.uni-leipzig.de> wrote:
ah, ok, sorry, I was assuming that Blazegraph would transitively resolve this automatically. Ok, so let's divide the problem: # Task 1: Connect all existing organisations with the data from the handelsregister. (No new identifiers added, we can start right now) Add a constraint that all German organisations should be connected to a court, i.e. the registering organisation as well as the id assigned by the court. @all: any properties I can reuse for this? I will focus on this as it seems quite easy. We can first filter orgs by other criteria, i.e. country as a blocking key and then string match the rest. # Task 2: Add all missing identifiers for the remaining orgs in Handelsregister. Whereas 2 can be rediscussed and decided, if 1 is finished sufficiently. # regarding maintenance: I find Wikidata as such very hard to maintain as all data is copied from somewhere else eventually, but Wikipedia has the same problem. In the case of the German Business register, maintenance is especially easy as the orgs are stable and uniquely identifiable. Even the fact that a company gets shut down should still be in Wikidata, so you have historical information. I mean, you also keep the Roman Empire, the Hanse and even finished projects in Wikidata. So even if an org ceases to exist, the entry in Wikidata should stay. # regarding Opencorporates I have a critical opinion with Opencorporates. It appears to be open, but you actually can not get the data. If somebody has a data dump, please forward to me. Thanks. More on top, I consider Opencorporates a danger to open data. It appears to push open availability of data, but then it is limited to open licenses. Usefulness is limited as there are no free dumps and no possibility to duplicate it effectlively. Wikipedia and Wikidata provide dumps and an API for exactly this reason. Everytime somebody wants to create an open organisation dataset with no barriers, the existence of Opencorporates is blocking this. Cheers, Sebastian On 16.10.2017 15:34, Antonin Delpeuch (lists) wrote:
And… my own count was wrong too, because I forgot to add DISTINCT in my query (if there are multiple paths from the class to "organization (Q43229)", items will appear multiple times). So, I get 1 168 084 now. http://tinyurl.com/yaeqlsnl It's easy to get these things wrong! Antonin On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
Thanks Ettore for spotting that! Wikidata types (P31) only make sense when you consider the "subclass of" (P279) property that we use to build the ontology (except in a few cases where the community has decided not to use any subclass for a particular type). So, to retrieve all items of a certain type in SPARQL, you need to use something like this: ?item wdt:P31/wdt:P279* ?type You can also have other variants to accept non-truthy statements. Just with this truthy version, I currently get 1 208 227 items. But note that there are still a lot of items where P31 is not provided, or subclasses which have not been connected to "organization (Q43229)"… So in general, it's very hard to have any "guarantees that there are no duplicates", just because you don't have any guarantees that the information currently in Wikidata is complete or correct. I would recommend trying to import something a bit smaller to get acquainted with how Wikidata works and what the matching process looks like in practice. And beyond a one-off import, as Ettore said it is important to think how the data will be maintained in the future… Antonin On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} Hi, I think Wikidata contains many more organizations than that. If we choose the "instance of Business enterprise", we get 135570 results. And I imagine there are many other categories that bring together commercial companies. https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D <https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D> On the substance, the project to add all companies of a country would make Wikidata a kind of totally free clone of Open Corporates <https://opencorporates.com/> <https://opencorporates.com/>. I would of course be delighted to see that, but is it not a challenge to maintain such a database? Companies are like humans, it appears and disappears every day. 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann <hellmann@informatik.uni-leipzig.de <mailto:hellmann@informatik.uni-leipzig.de> <mailto:hellmann@informatik.uni-leipzig.de> <mailto:hellmann@informatik.uni-leipzig.de>>: Hi all, the technical challenges are not so difficult. - 2.2 million are the exact number of German organisations, i.e. associations and companies. They are also unique. - Wikidata has 40k organisations: https://query.wikidata.org/#SELECT <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel { bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A} so there would be a maximum of 40k duplicates These are easy to find and deduplicate - The crawl can be done easily, a colleague has done so before. The issues here are: - Do you want to upload the data in Wikidata? It would be a real big extension. Can I go ahead - If the data were available externally as structured data under open license, I would probably not suggest loading it into wikidata, as the data can be retrieved from the official source directly, however, here this data will not be published in a decent format. I thought that the way data is copied from coyrighted sources, i.e. only facts is ok for wikidata. This done in a lot of places, I guess. Same for Wikipedia, i.e. News articles and copyrighted books are referenced. So Wikimedia or the Wikimedia community are experts on this. All the best, Sebastian On 16.10.2017 10:18, Neubert, Joachim wrote:
Hi Sebastian,____ __ __ This is huge! It will cover almost all currently existing German companies. Many of these will have similar names, so preparing for disambiguation is a concern.____ __ __ A good way for such an approach would be proposing a property for an external identifier, loading the data into Mix-n-match, creating links for companies already in Wikidata, and adding the rest (or perhaps only parts of them - I’m not sure if having all of them in Wikidata makes sense, but that’s another discussion), preferably with location and/or sector of trade in the description field.____ __ __ I’ve tried to figure out what could be used as key for a external identifier property. However, it looks like the registry does not offer any (persistent) URL to its entries. So for looking up a company, apparently there are two options:____ __ __ - conducting an extended search for the exact string “A&A Dienstleistungsgesellschaft mbH“____ - copying the register number “32853” plus selecting the court (Leipzig) from the according dropdown list and search that____ __ __ Both ways are not very intuitive, even if we can provide a link to the search form. This would make a weak connection to the source of information. Much more important, it makes disambiguation in Mix-n-match difficult. This applies for the preparation of your initial load (you would not want to create duplicates). But much more so for everybody else who wants to match his or her data later on. Being forced to search for entries manually in a cumbersome way for disambiguation of a new, possibly large and rich dataset is, in my eyes, not something we want to impose on future contributors. And often, the free information they find in the registry (formal name, register number, legal form, address) will not easily match with the information they have (common name, location, perhaps founding date, and most important sector of trade), so disambiguation may still be difficult.____ __ __ Have you checked which parts of the accessible information as below can be crawled and added legally to external databases such as Wikidata?____ __ __ Cheers, Joachim____ __ __ --____ Joachim Neubert____ __ __ ZBW – German National Library of Economics____ Leibniz Information Centre for Economics____ Neuer Jungfernstieg 21 20354 Hamburg____ Phone +49-42834-462____ __ __ __ __ __ __ *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org <mailto:wikidata-bounces@lists.wikimedia.org> <mailto:wikidata-bounces@lists.wikimedia.org> <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von *Sebastian Hellmann *Gesendet:* Sonntag, 15. Oktober 2017 09:45 *An:*wikidata@lists.wikimedia.org <mailto:wikidata@lists.wikimedia.org> <mailto:wikidata@lists.wikimedia.org> <mailto:wikidata@lists.wikimedia.org> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata____ __ __ Hi all,____ the German business registry contains roughly 2.2 million organisations. Some information is paid, but other is public, i.e. the info you are searching for at and clicking on UT (see example below):____ https://www.handelsregister.de/rp_web/mask.do?Typ=e <https://www.handelsregister.de/rp_web/mask.do?Typ=e> <https://www.handelsregister.de/rp_web/mask.do?Typ=e> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____ __ __ I would like to add this to Wikidata, either by crawling or by raising money to use crowdsourcing concepts like crowdflour or amazon turk. ____ __ __ It should meet notability criteria 2: https://www.wikidata.org/wiki/Wikidata:Notability <https://www.wikidata.org/wiki/Wikidata:Notability> <https://www.wikidata.org/wiki/Wikidata:Notability> <https://www.wikidata.org/wiki/Wikidata:Notability>____ 2. It refers to an instance of a *clearly identifiable conceptual or material entity*. The entity must be notable, in the sense that it *can be described using serious and publicly available references*. If there is no item about you yet, you are probably not notable.____ The reference is the official German business registry, which is serious and public. Orgs are also per definition clearly identifiable legal entities. How can I get clearance to proceed on this? All the best, Sebastian____ __ __ __ __ Entity data____ __ __ Saxony District court *Leipzig HRB 32853 * – A&A Dienstleistungsgesellschaft mbH ____ Legal status:____ Gesellschaft mit beschränkter Haftung ____ Capital:____ 25.000,00 EUR ____ Date of entry:____ 29/08/2016 (When entering date of entry, wrong data input can occur due to system failures!) ____ Date of removal:____ - ____ Balance sheet available: ____ - ____ Address (subject to correction):____ A&A Dienstleistungsgesellschaft mbH Prager Straße 38-40____ 04317 Leipzig ____ __ __ -- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects:http://dbpedia.org,http://nlp2rdf.org, http://linguistics.okfn.org,https://www.w3.org/community/ld4lt <https://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> Homepage:http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> Research Group:http://aksw.org____ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects:http://dbpedia.org,http://nlp2rdf.org, http://linguistics.okfn.org,https://www.w3.org/community/ld4lt <https://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> <http://www.w3.org/community/ld4lt> Homepage:http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> <http://aksw.org/SebastianHellmann> Research Group:http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> <https://lists.wikimedia.org/mailman/listinfo/wikidata> _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
-- All the best, Sebastian Hellmann Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center at the Institute for Applied Informatics (InfAI) at Leipzig University Executive Director of the DBpedia Association Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt <http://www.w3.org/community/ld4lt> Homepage: http://aksw.org/SebastianHellmann <http://aksw.org/SebastianHellmann> Research Group: http://aksw.org _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata