Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata

16 Oct 2017

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
...
  Thanks Ettore for spotting that!

 Wikidata types (P31) only make sense when you consider the "subclass of"
 (P279) property that we use to build the ontology (except in a few cases
 where the community has decided not to use any subclass for a particular
 type).

 So, to retrieve all items of a certain type in SPARQL, you need to use
 something like this:

 ?item wdt:P31/wdt:P279* ?type

 You can also have other variants to accept non-truthy statements.

 Just with this truthy version, I currently get 1 208 227 items. But note
 that there are still a lot of items where P31 is not provided, or
 subclasses which have not been connected to "organization (Q43229)"…

 So in general, it's very hard to have any "guarantees that there are no
 duplicates", just because you don't have any guarantees that the
 information currently in Wikidata is complete or correct.

 I would recommend trying to import something a bit smaller to get
 acquainted with how Wikidata works and what the matching process looks
 like in practice. And beyond a one-off import, as Ettore said it is
 important to think how the data will be maintained in the future…

 Antonin

 On 16/10/2017 13:46, Ettore RIZZA wrote:
      - Wikidata has 40k organisations: 

     https://query.wikidata.org/#SELECT
     <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}

 Hi, 

 I think Wikidata contains many more organizations than that. If we
 choose the "instance of Business enterprise", we get 135570 results. And
 I imagine there are many other categories that bring together commercial
 companies.

https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0…

 On the substance, the project to add all companies of a country would
 make Wikidata a kind of totally free clone of Open Corporates
 <https://opencorporates.com/>. I would of course be delighted to see
 that, but is it not a challenge to maintain such a database? Companies
 are like humans, it appears and disappears every day.

 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
 &lt;hellmann(a)informatik.uni-leipzig.de
 <mailto:hellmann@informatik.uni-leipzig.de>>:

     Hi all,

     the technical challenges are not so difficult.

     - 2.2 million are the exact number of German organisations, i.e.
     associations and companies. They are also unique.

     - Wikidata has 40k organisations:

     https://query.wikidata.org/#SELECT
     <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
     %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
     bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}

     so there would be a maximum of 40k duplicates These are easy to find
     and deduplicate

     - The crawl can be done easily, a colleague has done so before.  

     The issues here are:

     - Do you want to upload the data in Wikidata? It would be a real big
     extension. Can I go ahead

     - If the data were available externally as structured data under
     open license, I would probably not suggest loading it into wikidata,
     as the data can be retrieved from the official source directly,
     however, here this data will not be published in a decent format.

     I thought that the way data is copied from coyrighted sources, i.e.
     only facts is ok for wikidata. This done in a lot of places, I
     guess. Same for Wikipedia, i.e. News articles and copyrighted books
     are referenced. So Wikimedia or the Wikimedia community are experts
     on this.

     All the best,

     Sebastian

     On 16.10.2017 10:18, Neubert, Joachim wrote:

     Hi Sebastian,____

     __ __

     This is huge! It will cover almost all currently existing German
     companies. Many of these will have similar names, so preparing for
     disambiguation is a concern.____

     __ __

     A good way for such an approach would be proposing a property for
     an external identifier, loading the data into Mix-n-match,
     creating links for companies already in Wikidata, and adding the
     rest (or perhaps only parts of them - I’m not sure if having all
     of them in Wikidata makes sense, but that’s another discussion),
     preferably with location and/or sector of trade in the description
     field.____

     __ __

     I’ve tried to figure out what could be used as key for a external
     identifier property. However, it looks like the registry does not
     offer any (persistent) URL to its entries. So for looking up a
     company, apparently there are two options:____

     __ __

     -          conducting an extended search for the exact string “A&A
     Dienstleistungsgesellschaft mbH“____

     -          copying the register number “32853” plus selecting the
     court (Leipzig) from the according dropdown list and search that____

     __ __

     Both ways are not very intuitive, even if we can provide a link to
     the search form. This would make a weak connection to the source
     of information. Much more important, it makes disambiguation in
     Mix-n-match difficult. This applies for the preparation of your
     initial load (you would not want to create duplicates). But much
     more so for everybody else who wants to match his or her data
     later on. Being forced to search for entries manually in a
     cumbersome way for disambiguation of a new, possibly large and
     rich dataset is, in my eyes, not something we want to impose on
     future contributors. And often, the free information they find in
     the registry (formal name, register number, legal form, address)
     will not easily match with the information they have (common name,
     location, perhaps founding date, and most important sector of
     trade), so disambiguation may still be difficult.____

     __ __

     Have you checked which parts of the accessible information as
     below can be crawled and added legally to external databases such
     as Wikidata?____

     __ __

     Cheers, Joachim____

     __ __

     --____

     Joachim Neubert____

     __ __

     ZBW – German National Library of Economics____

     Leibniz Information Centre for Economics____

     Neuer Jungfernstieg 21
     20354 Hamburg____

     Phone +49-42834-462____

     __ __

     __ __

     __ __

     *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org
     <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von
     *Sebastian Hellmann
     *Gesendet:* Sonntag, 15. Oktober 2017 09:45
     *An:* wikidata(a)lists.wikimedia.org
     <mailto:wikidata@lists.wikimedia.org>
     *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
     organisations to Wikidata____

     __ __

     Hi all,____

     the German business registry contains roughly 2.2 million
     organisations. Some information is paid, but other is public, i.e.
     the info you are searching for at and clicking on UT (see example
     below):____

     https://www.handelsregister.de/rp_web/mask.do?Typ=e
     <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____

     __ __

     I would like to add this to Wikidata, either by crawling or by
     raising money to use crowdsourcing concepts like crowdflour or
     amazon turk. ____

     __ __

     It should meet notability criteria 2:
     https://www.wikidata.org/wiki/Wikidata:Notability
     <https://www.wikidata.org/wiki/Wikidata:Notability>____

         2. It refers to an instance of a *clearly identifiable
         conceptual or material entity*. The entity must be notable, in
         the sense that it *can be described using serious and publicly
         available references*. If there is no item about you yet, you
         are probably not notable.____

     The reference is the official German business registry, which is
     serious and public. Orgs are also per definition clearly
     identifiable legal entities.

     How can I get clearance to proceed on this?

     All the best,
     Sebastian____

     __ __

     __ __

           Entity data____

     __ __

     Saxony District court *Leipzig HRB 32853 * – A&A
     Dienstleistungsgesellschaft mbH ____

     Legal status:____

     Gesellschaft mit beschränkter Haftung  ____

     Capital:____

     25.000,00 EUR ____

     Date of entry:____

     29/08/2016
     (When entering date of entry, wrong data input can occur due to
     system failures!) ____

     Date of removal:____

     - ____

     Balance sheet available: ____

     - ____

     Address (subject to correction):____

     A&A Dienstleistungsgesellschaft mbH
     Prager Straße 38-40____

     04317 Leipzig ____

     __ __

     -- 
     All the best,
     Sebastian Hellmann

     Director of Knowledge Integration and Linked Data Technologies
     (KILT) Competence Center
     at the Institute for Applied Informatics (InfAI) at Leipzig University
     Executive Director of the DBpedia Association
     Projects: http://dbpedia.org, http://nlp2rdf.org,
     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
     <http://www.w3.org/community/ld4lt>
     Homepage: http://aksw.org/SebastianHellmann
     <http://aksw.org/SebastianHellmann>
     Research Group: http://aksw.org____

     _______________________________________________
     Wikidata mailing list
     Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata
     <https://lists.wikimedia.org/mailman/listinfo/wikidata> 
     -- 
     All the best,
     Sebastian Hellmann

     Director of Knowledge Integration and Linked Data Technologies
     (KILT) Competence Center
     at the Institute for Applied Informatics (InfAI) at Leipzig University
     Executive Director of the DBpedia Association
     Projects: http://dbpedia.org, http://nlp2rdf.org,
     http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
     <http://www.w3.org/community/ld4lt>
     Homepage: http://aksw.org/SebastianHellmann
     <http://aksw.org/SebastianHellmann>
     Research Group: http://aksw.org

     _______________________________________________
     Wikidata mailing list
     Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata
     <https://lists.wikimedia.org/mailman/listinfo/wikidata>

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Kickstartet: Adding 2.2 million German organisations to Wikidata