Dear all
Over the past year or so I've been working quite a lot on Wikidata
documentation and have been thinking more about how these individual
documents link together to allow different kinds of people to understand
and contribute to Wikidata. To help with the community's understanding and
planning of what resources are needed I've started an RFC to try and
collate this information together. Please take a look
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Improving_Wikid…
Thanks very much
John
DBpedia Databus (alpha version)
The DBpedia Databus is a platform that allows to exchange, curate and
access data between multiple stakeholders. Any data entering the bus
will be versioned, cleaned, mapped, linked and its licenses and
provenance tracked. Hosting in multiple formats will be provided to
access the data either as dump download or as API. Data governance stays
with the data contributors.
Vision
Working with data is hard and repetitive. We envision a hub, where
everybody can upload data and then useful operations like versioning,
cleaning, transformation, mapping, linking, merging, hosting is done
automagically on a central communication system (the bus) and then
dispersed again in a decentral network to the consumers and applications.
On the databus, data flows from data producers through the platform to
the consumers (left to right), any errors or feedback flows in the
opposite direction and reaches the data source to provide a continuous
integration service and improve the data at the source.
Open Data vs. Closed (paid) Data
We have studied the data network for 10 years now and we conclude that
organisations with open data are struggling to work together properly,
although they could and should, but are hindered by technical and
organisational barriers. They duplicate work on the same data. On the
other hand, companies selling data can not do so in a scalable way. The
loser is the consumer with the choice of inferior open data or buying
from a djungle-like market.
Publishing data on the databus
If you are grinding your teeth about how to publish data on the web, you
can just use the databus to do so. Data loaded on the bus will be highly
visible, available and queryable. You should think of it as a service:
*
Visibility guarantees, that your citations and reputation goes up
*
Besides a web download, we can also provide a Linked Data interface,
SPARQL endpoint, Lookup (autocomplete) or many other means of
availability (like AWS or Docker images)
*
Any distribution we are doing will funnel feedback and collaboration
opportunities your way to improve your dataset and your internal
data quality
*
You will receive an enriched dataset, which is connected and
complemented with any other available data (see the same folder
names in data and fusion folders).
Data Sellers
If you are selling data, the databus provides numerous opportunities for
you. You can link your offering to the open entities in the databus.
This allows consumers to discover your services better by showing it
with each request.
Data Consumers
Open data on the databus will be a commodity. We are greatly downing the
cost for understanding the data, retrieving and reformatting it. We are
constantly extending ways of using the data and are willing to implement
any formats and APIs you need.
If you are lacking a certain kind of data, we can also scout for it and
load it onto the databus.
How the Databus works at the moment
We are still in an initial state, but we already load 10 datasets (6
from DBpedia, 4 external) on the bus using these phases:
1.
Acquisition: data is downloaded from the source and logged in
2.
Conversion: data is converted to N-Triples and cleaned (Syntax
parsing, datatype validation and SHACL)
3.
Mapping: the vocabulary is mapped on the DBpedia Ontology and
converted (We have been doing this for Wikipedia’s Infoboxes and
Wikidata, but now we do it for other datasets as well)
4.
Linking: Links are mainly collected from the sources, cleaned and
enriched
5.
IDying: All entities found are given a new Databus ID for tracking
6.
Clustering: ID’s are merged onto clusters using one of the Databus
ID’s as cluster representative
7.
Data Comparison: Each dataset is compared with all other datasets.
We have an algorithm that decides on the best value, but the main
goal here is transparency, i.e. to see which data value was chosen
and how it compares to the other sources.
8.
A main knowledge graph fused from all the sources, i.e. a
transparent aggregate
9.
For each source, we are producing a local fused version called the
“Databus Complement”. This is a major feedback mechanism for all
data providers, where they can see what data they are missing, what
data differs in other sources and what links are available for their
IDs.
10.
You can compare all data via a webservice (early prototype, just
works for Eiffel Tower):
_http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general_
We aim for a real-time system, but at the moment we are doing a monthly
cycle.
Is it free?
Maintaining the Databus is a lot of work and servers incurring a high
cost. As a rule of thumb, we are providing everything for free that we
can afford to provide for free. DBpedia was providing everything for
free in the past, but this is not a healthy model, as we can neither
maintain quality properly, nor grow.
On the Databus everything is provided “As is” without any guarantees or
warranty. Improvements can be done by the volunteer community. The
DBpedia Association will provide a business interface to allow
guarantees, major improvements, stable maintenance and hosting.
License
Final databases are licensed under ODC-By. This covers our work on
recomposition of data. Each fact is individually licensed, e.g.
Wikipedia abstracts are CC-BY-SA, some are CC-BY-NC, some are
copyrighted. This means that data is available for research,
informational and educational purposes. We recommend to contact us for
any professional use of the data (clearing), so we can guarantee that
legal matters are handled correctly. Otherwise professional use is at
own risk.
Current Stats
Download
The databus data is available at _http://downloads.dbpedia.org/databus/_
ordered into three main folders:
*
Data: the data that is loaded on the databus at the moment
*
Global: a folder that contains provenance data and the mappings to
the new IDs
*
Fusion: the output of the databus
Most notably you can find:
*
Provenance mapping of the new ids in
_global/persistence-core/cluster-iri-provenance-ntriples/_
<http://downloads.dbpedia.org/databus/global/persistence-core/cluster-iri-pr…>
and _global/persistence-core/global-ids-ntriples/_
<http://downloads.dbpedia.org/databus/global/persistence-core/global-ids-ntr…>
*
The final fused version for the core: _fusion/core/fused/_
<http://downloads.dbpedia.org/databus/fusion/core/fused/>
*
A detailed JSON-LD file for data comparison: _fusion/core/json/_
<http://downloads.dbpedia.org/databus/fusion/core/json/>
*
Complements, i.e. the enriched Dutch DBpedia Version:
_fusion/core/nl.dbpedia.org/_
<http://downloads.dbpedia.org/databus/fusion/core/nl.dbpedia.org/>
(Note that the file and folder structure are still subject to change)
Sources
Glue
Source
Target
Amount
_de.dbpedia.org_ <http://de.dbpedia.org/>
_www.viaf.org_ <http://www.viaf.org/>
387,106
_diffbot.com_ <http://diffbot.com/>
_www.wikidata.org_ <http://www.wikidata.org/>
516,493
_d-nb.info_ <http://d-nb.info/>
_viaf.org_ <http://viaf.org/>
5,382,783
_d-nb.info_ <http://d-nb.info/>
_dbpedia.org_ <http://dbpedia.org/>
80,497
_d-nb.info_ <http://d-nb.info/>
_sws.geonames.org_ <http://sws.geonames.org/>
50,966
_fr.dbpedia.org_ <http://fr.dbpedia.org/>
_www.viaf.org_ <http://www.viaf.org/>
266
_sws.geonames.org_ <http://sws.geonames.org/>
_dbpedia.org_ <http://dbpedia.org/>
545,815
_kb.nl_ <http://kb.nl/>
_viaf.org_ <http://viaf.org/>
2,607,255
_kb.nl_ <http://kb.nl/>
_www.wikidata.org_ <http://www.wikidata.org/>
121,012
_kb.nl_ <http://kb.nl/>
_dbpedia.org_ <http://dbpedia.org/>
37,676
_www.wikidata.org_ <http://www.wikidata.org/>
_https://permid.org_ <https://permid.org/>
5,133
_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>
_www.wikidata.org_ <http://www.wikidata.org/>
45,344,233
_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>
_sws.geonames.org_ <http://sws.geonames.org/>
3,495,358
_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>
_viaf.org_ <http://viaf.org/>
1,179,550
_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>
_d-nb.info_ <http://d-nb.info/>
601,665
Plan for the next releases
*
Include more existing data from DBpedia
*
Renew all DBpedia releases in a separate fashion:
o
DBpedia Wikidata is running already: _http://78.46.100.7/wikidata/_
o
Basic extractors like infobox properties and mapping will be
active soon
o
Text extraction will take a while
*
Load all data in the comparison tool:
_http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general_
*
Load all data into a SPARQL endpoint
*
Create a simple open source software that let’s everybody push data
on the databus in an automated way
Hoi,
When you imply that I do not support Creative Commons and its work on
licenses, you are explicitly wrong. It is because of the CC that a
harmonisation has taken place. It it thanks to this harmonisation that a
lot of material gained a license, becoming accessible. This does not mean
that the practice of copyright is not evil, it means that thanks to CC
copyright became less open to abuse.
I am old school Wikipedia. I strongly believe that our mission is to "share
the sum of all knowledge". When people like you aim to claim copyright on
Wikipedia articles, you do not argue how this would play. You do not
consider how this is a knife that cuts both ways and most prominently will
hinder our quest to share the sum of all knowledge to all people. When a
company abuses our content by ignoring the license, they gain a public for
our content. When this is done right, we benefit; there is a symbiotic
relation with Google for instance. The only disadvantage happens when
because of a lack of attribution people do not come to Wikipedia or
Wikidata to curate the data. Practically the whole license issue of
Wikipedia is a mess because it is not enforced and because there are too
many copyright warriors claiming that things should be different, never
stop arguing and never coming to a practical point.
What I am saying is that when multiple sources claim the same thing, it
follows that any and all of them can not claim exclusive copyright to it.
For me the databus that DBpeida will show how little is original in
databases. On the one hand this is cool because it will indicate that such
things are likely correct on the other hand it is cool because it will
indicate what to curate in order to gain a better understanding. It also
follows that in order to bring things into doubt, you must publish facts
and strongly support the underlying data in order to be noticed. This is
why the work on the gender gap is so important. This is why work needs to
be done where all of us / all the databases are weak. This is why fake news
is so easy, there is nothing that easily finds where the data goes off the
rails.
<grin> so then we get to </grin> This is why we need the databus of
DBpedia, this is why we should stop mocking DBpedia and collaborate with
them in stead of what some say: "everything you can do, we can do better".
The fact of the matter is that they do what we might do and we have to
learn to collaborate.
Now why would you use Wikidata when DBpedia by definition can include all
of Wikidata and is better equipped to bring all the data together? You
would because it is not the copyright, it is superior functionality.
Thanks,
GerardM
On 17 May 2018 at 17:39, Rob Speer <rob(a)luminoso.com> wrote:
> > As always, copyright is predatory. As we can prove that copyright is the
> enemy of science and knowledge
>
> Well, this kind of gets to the heart of the issue, doesn't it.
>
> I support the Creative Commons license, including the share-alike term,
> which requires copyright in order to work, and I've contributed to multiple
> Wikimedia projects with the understanding that my work would be protected
> by CC-By-SA.
>
> Wikidata is engaged in a project-wide act of disobedience against CC-By-SA.
> I would say that GerardM has provided an excellent summary of the attitude
> toward Creative Commons that I've encountered on Wikidata: "it's holding us
> back", "it's the enemy", "you can't copyright knowledge", "you can't make
> us follow it", etc.
>
> The result of this, by the way, is that commercial entities sell modified
> versions of Wikidata with impunity. It undermines the terms of other
> resources such as DBPedia, which also contains facts extracted from
> Wikipedia and respects its Share-Alike terms. Why would anyone use DBPedia
> and have to agree to share alike, when they can get similar data from
> Wikidata which promises them it's CC-0?
>
> On Wed, 16 May 2018 at 21:43 Gerard Meijssen <gerard.meijssen(a)gmail.com>
> wrote:
>
> > Hoi,
> > Thank you for the overly broad misrepresentation. As always, copyright is
> > predatory. As we can prove that copyright is the enemy of science and
> > knowledge we should not be upset that *copyright *is abused we should
> > welcome it as it proves the point. Also when we use texts from everywhere
> > and rephrase it in Wikipedia articles "we" are not lily white either.
> >
> > In "them old days" generally we felt that when people would use
> Wikipedia,
> > it would only serve our purpose; share the sum of all knowledge. I still
> > feel really good about that. And, it has been shown that what we do;
> > maintain / curate / update that data that it is not easily given to do as
> > well as "we" do it.
> >
> > When we are to be more precise with our copyright, there are a few things
> > we could do to make copyright more transparent. When data is to be
> uploaded
> > (Commons / Wikipedia or Wikidata) we should use a user that is OWNED and
> > operated by the copyright holder. The operation may be by proxy and as a
> > consequence there is no longer a question about copyright as the
> copyright
> > holder can do as we wants. This makes any future noises just that,
> > annoying.
> >
> > As to copyright on Wikidata, when you consider copyright using data from
> > Wikipedia. The question is: "What Wikipedia" I have copied a lot of data
> > from several Wikipedias and believe me, from a quality point of view
> there
> > is much to be gained by using Wikidata as an instrument for good because
> it
> > is really strong in identifying friends and false friends. It is superior
> > as a tool for disambiguation.
> >
> > About the copyright on data, the overriding question with data is: do you
> > copy data wholesale in Wikidata. That is what a database copyright is
> > about. As I wrote on my blog [1], the best data to include is data that
> is
> > corroborated by the fact that it is present in multiple sources. This
> > negates the notion of a single source, it also underscores that much of
> the
> > data everywhere is replicated a lot. It also underscores, again, the
> notion
> > that data that is only present in single sources is what needs attention.
> > It needs tender loving care, it needs other sources to establish
> > credentials. That is in its own right what makes any claim of copyright
> > moot. It is in this process that it becomes a "creative" process negating
> > the copyright held on databases.
> >
> > I welcome the attention that is given to copyright in Wikidata. However
> our
> > attention to copyright is predatory in two ways. It is how can we get
> > around existing copyright and how can we protect our own. As argued,
> > Wikidata shines when it is used for what it is intended to be; the place
> > that brings data, of Wikipedias first and elsewhere second, together to
> be
> > used as a repository of quality, open and linked data.
> > Thanks,
> > GerardM
> >
> > [1]
> >
> > https://ultimategerardm.blogspot.nl/2018/05/wikidata-
> copyright-and-linked-data.html
> >
> > On 11 May 2018 at 23:10, Rob Speer <rob(a)luminoso.com> wrote:
> >
> > > Wow, thanks for the heads up. When I was getting upset about projects
> > that
> > > change the license on Wikimedia content and commercialize it, I had no
> > idea
> > > that Wikidata was providing them the cover to do so. The Creative
> Commons
> > > violation is coming from inside the house!
> > >
> > > On Tue, 8 May 2018 at 03:48 mathieu stumpf guntz <
> > > psychoslave(a)culture-libre.org> wrote:
> > >
> > > > Hello everybody,
> > > >
> > > > There is a phabricator ticket on Solve legal uncertainty of Wikidata
> > > > <https://phabricator.wikimedia.org/T193728> that you might be
> > interested
> > > > to look at and participate in.
> > > >
> > > > As Denny suggested in the ticket to give it more visibility through
> the
> > > > discussion on the Wikidata chat
> > > > <
> > > > https://www.wikidata.org/wiki/Wikidata:Project_chat#
> > > Importing_datasets_under_incompatible_licenses>,
> > > >
> > > > I thought it was interesting to highlight it a bit more.
> > > >
> > > > Cheers
> > > >
> > > > _______________________________________________
> > > > Wikimedia-l mailing list, guidelines at:
> > > > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> > > > https://meta.wikimedia.org/wiki/Wikimedia-l
> > > > New messages to: Wikimedia-l(a)lists.wikimedia.org
> > > > Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> > > > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > wiki/Wikimedia-l
> > > New messages to: Wikimedia-l(a)lists.wikimedia.org
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> > https://meta.wikimedia.org/wiki/Wikimedia-l
> > New messages to: Wikimedia-l(a)lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
>
If one knows the Q code (or URI) for an entity on Wikidata, how can one find the Dbpedia Id and the information linked to it?
Thank you.
Sent from my iPad
Hi,
I am looking for the most efficient way of getting the following
information out of WDQS:
* One language only (e.g. fr.wikipedia.org)
* All instances of human (e.g. of the abstraction: wd:Q9916|Dwight
David
Eisenhower|États-Unis|Dwight|Eisenhower|<https://fr.wikipedia.org/wiki/Dwight_D._Eisenhower>
<https://fr.wikipedia.org/wiki/Dwight_D._Eisenhower>|militaire
américain, président des États-Unis)
Let's say we have a list of all sovereign states (Q16, Q30, Q142, ...)
and all letters of the requested language (French: a, b, c, ...) , we
can automate requests and get a lot of results. Unfortunately, it's
costly and not efficient. It takes about a day to succeed.
|SELECT ?person ?personLabel ?countryLabel ?givenNameLabel
?familyNameLabel ?article ?persondesc||
||WHERE||
||{||
|| ?person wdt:P31 wd:Q5;||
|| wdt:P27 wd:Q30;||
|| wdt:P27 ?country;||
|| wdt:P734 ?familyName;||
|| wdt:P735 ?givenName ;||
|| rdfs:label ?personLabel.||
|| ?familyName rdfs:label ?familyNameLabel.||
|| ?country rdfs:label ?countryLabel.||
|| ?givenName rdfs:label ?givenNameLabel.||
|| ?person schema:description ?persondesc.||
|| FILTER(LANG(?personLabel) = "fr").||
|| FILTER(LANG(?familyNameLabel) = "en").||
|| FILTER(LANG(?countryLabel) = "fr").||
|| FILTER(LANG(?givenNameLabel) = "en").||
|| FILTER(LANG(?persondesc) = "fr").||
|| FILTER(STRSTARTS(?personLabel, "D")).||
|| FILTER(STRSTARTS(?familyNameLabel, "E")).||
||||
|| ?article schema:about ?person;||
|| schema:inLanguage "fr";||
|| schema:isPartOf <https://fr.wikipedia.org/> . ||
||}|
<https://query.wikidata.org/#SELECT%20%3Fperson%20%3FpersonLabel%20%3Fcountr…>
https://query.wikidata.org/#SELECT%20%3Fperson%20%3FpersonLabel%20%3Fcountr…
Such a request takes an average of 20 seconds to complete.
Any help will be much appreciated. Thanks for your time.
Justin
Hoi,
Thank you for the overly broad misrepresentation. As always, copyright is
predatory. As we can prove that copyright is the enemy of science and
knowledge we should not be upset that *copyright *is abused we should
welcome it as it proves the point. Also when we use texts from everywhere
and rephrase it in Wikipedia articles "we" are not lily white either.
In "them old days" generally we felt that when people would use Wikipedia,
it would only serve our purpose; share the sum of all knowledge. I still
feel really good about that. And, it has been shown that what we do;
maintain / curate / update that data that it is not easily given to do as
well as "we" do it.
When we are to be more precise with our copyright, there are a few things
we could do to make copyright more transparent. When data is to be uploaded
(Commons / Wikipedia or Wikidata) we should use a user that is OWNED and
operated by the copyright holder. The operation may be by proxy and as a
consequence there is no longer a question about copyright as the copyright
holder can do as we wants. This makes any future noises just that, annoying.
As to copyright on Wikidata, when you consider copyright using data from
Wikipedia. The question is: "What Wikipedia" I have copied a lot of data
from several Wikipedias and believe me, from a quality point of view there
is much to be gained by using Wikidata as an instrument for good because it
is really strong in identifying friends and false friends. It is superior
as a tool for disambiguation.
About the copyright on data, the overriding question with data is: do you
copy data wholesale in Wikidata. That is what a database copyright is
about. As I wrote on my blog [1], the best data to include is data that is
corroborated by the fact that it is present in multiple sources. This
negates the notion of a single source, it also underscores that much of the
data everywhere is replicated a lot. It also underscores, again, the notion
that data that is only present in single sources is what needs attention.
It needs tender loving care, it needs other sources to establish
credentials. That is in its own right what makes any claim of copyright
moot. It is in this process that it becomes a "creative" process negating
the copyright held on databases.
I welcome the attention that is given to copyright in Wikidata. However our
attention to copyright is predatory in two ways. It is how can we get
around existing copyright and how can we protect our own. As argued,
Wikidata shines when it is used for what it is intended to be; the place
that brings data, of Wikipedias first and elsewhere second, together to be
used as a repository of quality, open and linked data.
Thanks,
GerardM
[1]
https://ultimategerardm.blogspot.nl/2018/05/wikidata-copyright-and-linked-d…
On 11 May 2018 at 23:10, Rob Speer <rob(a)luminoso.com> wrote:
> Wow, thanks for the heads up. When I was getting upset about projects that
> change the license on Wikimedia content and commercialize it, I had no idea
> that Wikidata was providing them the cover to do so. The Creative Commons
> violation is coming from inside the house!
>
> On Tue, 8 May 2018 at 03:48 mathieu stumpf guntz <
> psychoslave(a)culture-libre.org> wrote:
>
> > Hello everybody,
> >
> > There is a phabricator ticket on Solve legal uncertainty of Wikidata
> > <https://phabricator.wikimedia.org/T193728> that you might be interested
> > to look at and participate in.
> >
> > As Denny suggested in the ticket to give it more visibility through the
> > discussion on the Wikidata chat
> > <
> > https://www.wikidata.org/wiki/Wikidata:Project_chat#
> Importing_datasets_under_incompatible_licenses>,
> >
> > I thought it was interesting to highlight it a bit more.
> >
> > Cheers
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> > https://meta.wikimedia.org/wiki/Wikimedia-l
> > New messages to: Wikimedia-l(a)lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>