Re: [Wikidata] Are we ready for our future

6 May 2019

Hi all,

I would like to throw in a slightly different angle here. The 
GlobalFactSync Project 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE 
will start in June.

As a preparation we wrote this paper describing the engine behind it: 
https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

There has already been very constructive comments by 
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncR…

which led us to focus on syncing music (bands, singles, albums) as 1 of 
the 10 sync targets. Other proposals for domains are very welcome.

The rationale behind GlobalFactSync is this:

Managing Data Quality is pareto-efficient, i.e. the first 80% are easy 
to achieve and each percent after that gets much more expensive 
following the law of diminishing returns. As a consequence for Wikidata: 
WD is probably at 80% now, so maintaining it gets harder because you 
need to micro-optimize to find the new errors and fill in missing 
information. This is exponentiated by growing Wikidata further in terms 
of entities.

GlobalFactSync does not solve the pareto-efficiency, but it cheats it as 
we hope that it will pool all the manpower of Wikipedia editors and 
Wikidata editors and also mobilize DBpedia users to edit either in WP or 
WD.

In general, Wikimedia runs the 6th largest website in the World. They 
are in the same league as Google or Facebook and I have absolutely no 
doubt that they have ample expertise in tackling scalability of hosting, 
e.g. by doubling the number of servers or web-caching. The problem I see 
is that you can not easily double the editor manpower or bot edits. 
Hence the GlobalFactSync Grant.

We will send out an announcement in a week or two. Fell free to suggest 
sync targets. We are still looking into the complexity of managing 
references as this is bread and butter for the project.

All the best,

Sebastian

On 05.05.19 18:07, Yaroslav Blanter wrote:
...
  Indeed, these collaborations in high-energy physics
are not static 
 quantities, they change essentially every day (people getting hired 
 and had their contract expired, and most likely every two papers have 
 a slightly different author list.

 Cheers
 Yaroslav

 On Sun, May 5, 2019 at 5:58 PM Darren Cook &lt;darren(a)dcook.org 
 <mailto:darren@dcook.org>> wrote:

  We may also want to consider if Wikidata is
actually the best      store for
  all kinds of data. Let's consider example:

 https://www.wikidata.org/w/index.php?title=Q57009452

 This is an entity that is almost 2M in size, almost 3000      statements ...

     A paper with 2884 authors! arxiv.org <http://arxiv.org> deals with
     it by calling them the
     "Atlas Collaboration": https://arxiv.org/abs/1403.0489
     The actual paper does the same (with the full list of names and
     affiliations in the Appendix).

     The nice thing about graph databases is we should be able to set
     author
     to point to an "Atlas Collaboration" node, and then have that node
     point
     to the 2884 individual author nodes (and each of those nodes point to
     their affiliation).

     What are the reasons to not re-organize it that way?

     My first thought was that who is in the collaboration changes over
     time?
     But does it change day to day, or only change each academic year?

     Either way, maybe we need to point the author field to something like
     "Atlas Collaboration 2014a", and clone-and-modify that node each
     time we
     come to a paper that describes a different membership?

     Or is it better to do each persons membership of such a group with a
     start and end date?

     (BTW, arxiv.org <http://arxiv.org> tells me there are 1059 results
     for ATLAS Collaboration;
     don't know if one "result" corresponds to one "paper",
though.)

  While I am not against storing this as such, I do
wonder if it's
 sustainable to keep such kind of data together with other      Wikidata data
  in a single database. 
     It feels like it belongs in "core" Wikidata. Being able to ask "which
     papers has this researcher written?" seems like a good example of a
     Wikidata query. Similarly,  "which papers have The ATLAS
     Collaboration"
     worked on?"

     But, also, are queries like "Which authors of Physics papers went to a
     high school that had more than 1000 students?" part of the goal of
     Wikidata? If so, Wikidata needs optimizing in such a way that
     makes such
     queries both possible and tractable.

     Darren

     _______________________________________________
     Wikidata mailing list
     Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata -- 
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Are we ready for our future