2017-10-16 15:34 GMT+02:00 Antonin Delpeuch (lists) <lists@antonin.delpeuch.eu>:

And… my own count was wrong too, because I forgot to add DISTINCT in my
query (if there are multiple paths from the class to "organization
(Q43229)", items will appear multiple times).

So, I get 1 168 084 now.
http://tinyurl.com/yaeqlsnl

It's easy to get these things wrong!

Antonin

On 16/10/2017 14:16, Antonin Delpeuch (lists) wrote:
> Thanks Ettore for spotting that!
>
> Wikidata types (P31) only make sense when you consider the "subclass of"
> (P279) property that we use to build the ontology (except in a few cases
> where the community has decided not to use any subclass for a particular
> type).
>
> So, to retrieve all items of a certain type in SPARQL, you need to use
> something like this:
>
> ?item wdt:P31/wdt:P279* ?type
>
> You can also have other variants to accept non-truthy statements.
>
> Just with this truthy version, I currently get 1 208 227 items. But note
> that there are still a lot of items where P31 is not provided, or
> subclasses which have not been connected to "organization (Q43229)"…
>
> So in general, it's very hard to have any "guarantees that there are no
> duplicates", just because you don't have any guarantees that the
> information currently in Wikidata is complete or correct.
>
> I would recommend trying to import something a bit smaller to get
> acquainted with how Wikidata works and what the matching process looks
> like in practice. And beyond a one-off import, as Ettore said it is
> important to think how the data will be maintained in the future…
>
> Antonin
>
> On 16/10/2017 13:46, Ettore RIZZA wrote:
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT
>> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>>
>> Hi,
>>
>> I think Wikidata contains many more organizations than that. If we
>> choose the "instance of Business enterprise", we get 135570 results. And
>> I imagine there are many other categories that bring together commercial
>> companies.
>>
>>
>> https://query.wikidata.org/#SELECT%20%3Fitem%20%3FitemLabel%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ4830453.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D
>>
>> On the substance, the project to add all companies of a country would
>> make Wikidata a kind of totally free clone of Open Corporates
>> <https://opencorporates.com/>. I would of course be delighted to see
>> that, but is it not a challenge to maintain such a database? Companies
>> are like humans, it appears and disappears every day.
>>
>>
>>
>> 2017-10-16 13:41 GMT+02:00 Sebastian Hellmann
>> <hellmann@informatik.uni-leipzig.de
>> <mailto:hellmann@informatik.uni-leipzig.de>>:
>>
>> Hi all,
>>
>> the technical challenges are not so difficult.
>>
>> - 2.2 million are the exact number of German organisations, i.e.
>> associations and companies. They are also unique.
>>
>> - Wikidata has 40k organisations:
>>
>> https://query.wikidata.org/#SELECT
>> <https://query.wikidata.org/#SELECT> %3Fitem %3FitemLabel %0AWHERE
>> %0A{%0A %3Fitem wdt%3AP31 wd%3AQ43229.%0A SERVICE wikibase%3Alabel {
>> bd%3AserviceParam wikibase%3Alanguage "[AUTO_LANGUAGE]%2Cen". }%0A}
>>
>> so there would be a maximum of 40k duplicates These are easy to find
>> and deduplicate
>>
>> - The crawl can be done easily, a colleague has done so before.
>>
>>
>> The issues here are:
>>
>> - Do you want to upload the data in Wikidata? It would be a real big
>> extension. Can I go ahead
>>
>> - If the data were available externally as structured data under
>> open license, I would probably not suggest loading it into wikidata,
>> as the data can be retrieved from the official source directly,
>> however, here this data will not be published in a decent format.
>>
>> I thought that the way data is copied from coyrighted sources, i.e.
>> only facts is ok for wikidata. This done in a lot of places, I
>> guess. Same for Wikipedia, i.e. News articles and copyrighted books
>> are referenced. So Wikimedia or the Wikimedia community are experts
>> on this.
>>
>> All the best,
>>
>> Sebastian
>>
>>
>> On 16.10.2017 10:18, Neubert, Joachim wrote:
>>>
>>> Hi Sebastian,____
>>>
>>> __ __
>>>
>>> This is huge! It will cover almost all currently existing German
>>> companies. Many of these will have similar names, so preparing for
>>> disambiguation is a concern.____
>>>
>>> __ __
>>>
>>> A good way for such an approach would be proposing a property for
>>> an external identifier, loading the data into Mix-n-match,
>>> creating links for companies already in Wikidata, and adding the
>>> rest (or perhaps only parts of them - I’m not sure if having all
>>> of them in Wikidata makes sense, but that’s another discussion),
>>> preferably with location and/or sector of trade in the description
>>> field.____
>>>
>>> __ __
>>>
>>> I’ve tried to figure out what could be used as key for a external
>>> identifier property. However, it looks like the registry does not
>>> offer any (persistent) URL to its entries. So for looking up a
>>> company, apparently there are two options:____
>>>
>>> __ __
>>>
>>> -          conducting an extended search for the exact string “A&A
>>> Dienstleistungsgesellschaft mbH“____
>>>
>>> -          copying the register number “32853” plus selecting the
>>> court (Leipzig) from the according dropdown list and search that____
>>>
>>> __ __
>>>
>>> Both ways are not very intuitive, even if we can provide a link to
>>> the search form. This would make a weak connection to the source
>>> of information. Much more important, it makes disambiguation in
>>> Mix-n-match difficult. This applies for the preparation of your
>>> initial load (you would not want to create duplicates). But much
>>> more so for everybody else who wants to match his or her data
>>> later on. Being forced to search for entries manually in a
>>> cumbersome way for disambiguation of a new, possibly large and
>>> rich dataset is, in my eyes, not something we want to impose on
>>> future contributors. And often, the free information they find in
>>> the registry (formal name, register number, legal form, address)
>>> will not easily match with the information they have (common name,
>>> location, perhaps founding date, and most important sector of
>>> trade), so disambiguation may still be difficult.____
>>>
>>> __ __
>>>
>>> Have you checked which parts of the accessible information as
>>> below can be crawled and added legally to external databases such
>>> as Wikidata?____
>>>
>>> __ __
>>>
>>> Cheers, Joachim____
>>>
>>> __ __
>>>
>>> --____
>>>
>>> Joachim Neubert____
>>>
>>> __ __
>>>
>>> ZBW – German National Library of Economics____
>>>
>>> Leibniz Information Centre for Economics____
>>>
>>> Neuer Jungfernstieg 21
>>> 20354 Hamburg____
>>>
>>> Phone +49-42834-462____
>>>
>>> __ __
>>>
>>> __ __
>>>
>>> __ __
>>>
>>> *Von:*Wikidata [mailto:wikidata-bounces@lists.wikimedia.org
>>> <mailto:wikidata-bounces@lists.wikimedia.org>] *Im Auftrag von
>>> *Sebastian Hellmann
>>> *Gesendet:* Sonntag, 15. Oktober 2017 09:45
>>> *An:* wikidata@lists.wikimedia.org
>>> <mailto:wikidata@lists.wikimedia.org>
>>> *Betreff:* [Wikidata] Kickstartet: Adding 2.2 million German
>>> organisations to Wikidata____
>>>
>>> __ __
>>>
>>> Hi all,____
>>>
>>> the German business registry contains roughly 2.2 million
>>> organisations. Some information is paid, but other is public, i.e.
>>> the info you are searching for at and clicking on UT (see example
>>> below):____
>>>
>>> https://www.handelsregister.de/rp_web/mask.do?Typ=e
>>> <https://www.handelsregister.de/rp_web/mask.do?Typ=e>____
>>>
>>> __ __
>>>
>>> I would like to add this to Wikidata, either by crawling or by
>>> raising money to use crowdsourcing concepts like crowdflour or
>>> amazon turk. ____
>>>
>>> __ __
>>>
>>> It should meet notability criteria 2:
>>> https://www.wikidata.org/wiki/Wikidata:Notability
>>> <https://www.wikidata.org/wiki/Wikidata:Notability>____
>>>
>>> 2. It refers to an instance of a *clearly identifiable
>>> conceptual or material entity*. The entity must be notable, in
>>> the sense that it *can be described using serious and publicly
>>> available references*. If there is no item about you yet, you
>>> are probably not notable.____
>>>
>>>
>>> The reference is the official German business registry, which is
>>> serious and public. Orgs are also per definition clearly
>>> identifiable legal entities.
>>>
>>> How can I get clearance to proceed on this?
>>>
>>> All the best,
>>> Sebastian____
>>>
>>> __ __
>>>
>>> __ __
>>>
>>>
>>> Entity data____
>>>
>>> __ __
>>>
>>> Saxony District court *Leipzig HRB 32853 * – A&A
>>> Dienstleistungsgesellschaft mbH ____
>>>
>>> Legal status:____
>>>
>>>
>>>
>>> Gesellschaft mit beschränkter Haftung  ____
>>>
>>>
>>>
>>>
>>> Capital:____
>>>
>>>
>>>
>>> 25.000,00 EUR ____
>>>
>>>
>>>
>>>
>>> Date of entry:____
>>>
>>>
>>>
>>> 29/08/2016
>>> (When entering date of entry, wrong data input can occur due to
>>> system failures!) ____
>>>
>>>
>>>
>>>
>>> Date of removal:____
>>>
>>>
>>>
>>> - ____
>>>
>>>
>>>
>>>
>>> Balance sheet available: ____
>>>
>>>
>>>
>>> - ____
>>>
>>>
>>>
>>>
>>> Address (subject to correction):____
>>>
>>>
>>>
>>> A&A Dienstleistungsgesellschaft mbH
>>> Prager Straße 38-40____
>>>
>>> 04317 Leipzig ____
>>>
>>>
>>>
>>>
>>> __ __
>>>
>>> --
>>> All the best,
>>> Sebastian Hellmann
>>>
>>> Director of Knowledge Integration and Linked Data Technologies
>>> (KILT) Competence Center
>>> at the Institute for Applied Informatics (InfAI) at Leipzig University
>>> Executive Director of the DBpedia Association
>>> Projects: http://dbpedia.org, http://nlp2rdf.org,
>>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>>> <http://www.w3.org/community/ld4lt>
>>> Homepage: http://aksw.org/SebastianHellmann
>>> <http://aksw.org/SebastianHellmann>
>>> Research Group: http://aksw.org____
>>>
>>>
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>>
>> --
>> All the best,
>> Sebastian Hellmann
>>
>> Director of Knowledge Integration and Linked Data Technologies
>> (KILT) Competence Center
>> at the Institute for Applied Informatics (InfAI) at Leipzig University
>> Executive Director of the DBpedia Association
>> Projects: http://dbpedia.org, http://nlp2rdf.org,
>> http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
>> <http://www.w3.org/community/ld4lt>
>> Homepage: http://aksw.org/SebastianHellmann
>> <http://aksw.org/SebastianHellmann>
>> Research Group: http://aksw.org
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>> <https://lists.wikimedia.org/mailman/listinfo/wikidata>
>>
>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata