Thanks Ettore for spotting that!
Wikidata types (P31) only make sense when you consider the "subclass of"
(P279) property that we use to build the ontology (except in a few cases
where the community has decided not to use any subclass for a particular
type).
So, to retrieve all items of a certain type in SPARQL, you need to use
something like this:
?item wdt:P31/wdt:P279* ?type
You can also have other variants to accept non-truthy statements.
Just with this truthy version, I currently get 1 208 227 items. But note
that there are still a lot of items where P31 is not provided, or
subclasses which have not been connected to "organization (Q43229)"…
So in general, it's very hard to have any "guarantees that there are no
duplicates", just because you don't have any guarantees that the
information currently in Wikidata is complete or correct.
I would recommend trying to import something a bit smaller to get
acquainted with how Wikidata works and what the matching process looks
like in practice. And beyond a one-off import, as Ettore said it is
important to think how the data will be maintained in the future…
Antonin
On 16/10/2017 13:46, Ettore RIZZA wrote:
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
Hi,
I think Wikidata contains many more organizations than that. If we
choose the "instance of Business enterprise", we get 135570 results. And
I imagine there are many other categories that bring together commercial
companies.
https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0…
<https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D>
On the substance, the project to add all companies of a country would
make Wikidata a kind of totally free clone of Open Corporates
<https://opencorporates.com/> <https://opencorporates.com/>. I would of
course be delighted to see
that, but is it not a challenge to maintain such a database? Companies
are like humans, it appears and disappears every day.
2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
<hellmann(a)informatik.uni-leipzig.de
<mailto:hellmann@informatik.uni-leipzig.de>
<mailto:hellmann@informatik.uni-leipzig.de>
<mailto:hellmann@informatik.uni-leipzig.de>>:
Hi all,
the technical challenges are not so difficult.
- 2.2 million are the exact number of German organisations, i.e.
associations and companies. They are also unique.
- Wikidata has 40k organisations:
https://query.wikidata.org/#SELECT
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT>
<https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
%0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
so there would be a maximum of 40k duplicates These are easy to find
and deduplicate
- The crawl can be done easily, a colleague has done so before.
The issues here are:
- Do you want to upload the data in Wikidata? It would be a real big
extension. Can I go ahead
- If the data were available externally as structured data under
open license, I would probably not suggest loading it into wikidata,
as the data can be retrieved from the official source directly,
however, here this data will not be published in a decent format.
I thought that the way data is copied from coyrighted sources, i.e.
only facts is ok for wikidata. This done in a lot of places, I
guess. Same for Wikipedia, i.e. News articles and copyrighted books
are referenced. So Wikimedia or the Wikimedia community are experts
on this.
All the best,
Sebastian
On 16.10.2017 10:18, Neubert, Joachim wrote:
> Hi Sebastian,____
>
> __ __
>
> This is huge! It will cover almost all currently existing German
> companies. Many of these will have similar names, so preparing for
> disambiguation is a concern.____
>
> __ __
>
> A good way for such an approach would be proposing a property for
> an external identifier, loading the data into Mix-n-match,
> creating links for companies already in Wikidata, and adding the
> rest (or perhaps only parts of them - I’m not sure if having all
> of them in Wikidata makes sense, but that’s another discussion),
> preferably with location and/or sector of trade in the description
> field.____
>
> __ __
>
> I’ve tried to figure out what could be used as key for a external
> identifier property. However, it looks like the registry does not
> offer any (persistent) URL to its entries. So for looking up a
> company, apparently there are two options:____
>
> __ __
>
> - conducting an extended search for the exact string “A&A
> Dienstleistungsgesellschaft mbH“____
>
> - copying the register number “32853” plus selecting the
> court (Leipzig) from the according dropdown list and search that____
>
> __ __
>
> Both ways are not very intuitive, even if we can provide a link to
> the search form. This would make a weak connection to the source
> of information. Much more important, it makes disambiguation in
> Mix-n-match difficult. This applies for the preparation of your
> initial load (you would not want to create duplicates). But much
> more so for everybody else who wants to match his or her data
> later on. Being forced to search for entries manually in a
> cumbersome way for disambiguation of a new, possibly large and
> rich dataset is, in my eyes, not something we want to impose on
> future contributors. And often, the free information they find in
> the registry (formal name, register number, legal form, address)
> will not easily match with the information they have (common name,
> location, perhaps founding date, and most important sector of
> trade), so disambiguation may still be difficult.____
>
> __ __
>
> Have you checked which parts of the accessible information as
> below can be crawled and added legally to external databases such
> as Wikidata?____
>
> __ __
>
> Cheers, Joachim____
>
> __ __
>
> --____
>
> Joachim Neubert____
>
> __ __
>
> ZBW – German National Library of Economics____
>
> Leibniz Information Centre for Economics____
>
> Neuer Jungfernstieg 21
> 20354 Hamburg____
>
> Phone +49-42834-462____
>
> __ __
>
> __ __
>
> __ __
>
> *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org
> <mailto:wikidata-bounces@lists.wikimedia.org>
> <mailto:wikidata-bounces@lists.wikimedia.org>
> <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von
> *Sebastian Hellmann
> *Gesendet:* Sonntag, 15. Oktober 2017 09:45
> *An:*wikidata@lists.wikimedia.org
<mailto:wikidata@lists.wikimedia.org>
> <mailto:wikidata@lists.wikimedia.org>
> <mailto:wikidata@lists.wikimedia.org>
> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
> organisations to Wikidata____
>
> __ __
>
> Hi all,____
>
> the German business registry contains roughly 2.2 million
> organisations. Some information is paid, but other is public, i.e.
> the info you are searching for at and clicking on UT (see example
> below):____
>
>
https://www.handelsregister.de/rp_web/mask.do?Typ=e
> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>
> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>
> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
>
> __ __
>
> I would like to add this to Wikidata, either by crawling or by
> raising money to use crowdsourcing concepts like crowdflour or
> amazon turk. ____
>
> __ __
>
> It should meet notability criteria 2:
>
https://www.wikidata.org/wiki/Wikidata:Notability
> <https://www.wikidata.org/wiki/Wikidata:Notability>
> <https://www.wikidata.org/wiki/Wikidata:Notability>
> <https://www.wikidata.org/wiki/Wikidata:Notability>____
>
> 2. It refers to an instance of a *clearly identifiable
> conceptual or material entity*. The entity must be notable, in
> the sense that it *can be described using serious and publicly
> available references*. If there is no item about you yet, you
> are probably not notable.____
>
>
> The reference is the official German business registry, which is
> serious and public. Orgs are also per definition clearly
> identifiable legal entities.
>
> How can I get clearance to proceed on this?
>
> All the best,
> Sebastian____
>
> __ __
>
> __ __
>
>
> Entity data____
>
> __ __
>
> Saxony District court *Leipzig HRB 32853 * – A&A
> Dienstleistungsgesellschaft mbH ____
>
> Legal status:____
>
>
>
> Gesellschaft mit beschränkter Haftung ____
>
>
>
>
> Capital:____
>
>
>
> 25.000,00 EUR ____
>
>
>
>
> Date of entry:____
>
>
>
> 29/08/2016
> (When entering date of entry, wrong data input can occur due to
> system failures!) ____
>
>
>
>
> Date of removal:____
>
>
>
> - ____
>
>
>
>
> Balance sheet available: ____
>
>
>
> - ____
>
>
>
>
> Address (subject to correction):____
>
>
>
> A&A Dienstleistungsgesellschaft mbH
> Prager Straße 38-40____
>
> 04317 Leipzig ____
>
>
>
>
> __ __
>
> --
> All the best,
> Sebastian Hellmann
>
> Director of Knowledge Integration and Linked Data Technologies
> (KILT) Competence Center
> at the Institute for Applied Informatics (InfAI) at Leipzig University
> Executive Director of the DBpedia Association
>
Projects:http://dbpedia.org,http://nlp2rdf.org,
>
http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
> <https://www.w3.org/community/ld4lt>
> <http://www.w3.org/community/ld4lt>
> <http://www.w3.org/community/ld4lt>
>
Homepage:http://aksw.org/SebastianHellmann
> <http://aksw.org/SebastianHellmann>
> <http://aksw.org/SebastianHellmann>
> <http://aksw.org/SebastianHellmann>
> Research Group:http://aksw.org____
>
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org>
> <mailto:Wikidata@lists.wikimedia.org>
>
https://lists.wikimedia.org/mailman/listinfo/wikidata
> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
--
All the best,
Sebastian Hellmann
Director of Knowledge Integration and Linked Data Technologies
(KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects:http://dbpedia.org,http://nlp2rdf.org,
http://linguistics.okfn.org,https://www.w3.org/community/ld4lt
<https://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
<http://www.w3.org/community/ld4lt>
Homepage:http://aksw.org/SebastianHellmann
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
<http://aksw.org/SebastianHellmann>
Research
Group:http://aksw.org
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>