Hi all,
I would like to throw in a slightly different angle here. The
GlobalFactSync Project
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE
will start in June.
As a preparation we wrote this paper describing the engine behind
it: https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf
There has already been very constructive comments by
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE#Interfacing_with_Wikidata's_data_quality_issues_in_certain_areas
which led us to focus on syncing music (bands, singles, albums) as
1 of the 10 sync targets. Other proposals for domains are very
welcome.
The rationale behind GlobalFactSync is this:
Managing Data Quality is pareto-efficient, i.e. the first 80% are
easy to achieve and each percent after that gets much more
expensive following the law of diminishing returns. As a
consequence for Wikidata: WD is probably at 80% now, so
maintaining it gets harder because you need to micro-optimize to
find the new errors and fill in missing information. This is
exponentiated by growing Wikidata further in terms of entities.
GlobalFactSync does not solve the pareto-efficiency, but it
cheats it as we hope that it will pool all the manpower of
Wikipedia editors and Wikidata editors and also mobilize DBpedia
users to edit either in WP or WD.
In general, Wikimedia runs the 6th largest website in the World.
They are in the same league as Google or Facebook and I have
absolutely no doubt that they have ample expertise in tackling
scalability of hosting, e.g. by doubling the number of servers or
web-caching. The problem I see is that you can not easily double
the editor manpower or bot edits. Hence the GlobalFactSync Grant.
We will send out an announcement in a week or two. Fell free to
suggest sync targets. We are still looking into the complexity of
managing references as this is bread and butter for the project.
All the best,
Sebastian
YaroslavIndeed, these collaborations in high-energy physics are not static quantities, they change essentially every day (people getting hired and had their contract expired, and most likely every two papers have a slightly different author list.Cheers
On Sun, May 5, 2019 at 5:58 PM Darren Cook <darren@dcook.org> wrote:
> We may also want to consider if Wikidata is actually the best store for
> all kinds of data. Let's consider example:
>
> https://www.wikidata.org/w/index.php?title=Q57009452
>
> This is an entity that is almost 2M in size, almost 3000 statements ...
A paper with 2884 authors! arxiv.org deals with it by calling them the
"Atlas Collaboration": https://arxiv.org/abs/1403.0489
The actual paper does the same (with the full list of names and
affiliations in the Appendix).
The nice thing about graph databases is we should be able to set author
to point to an "Atlas Collaboration" node, and then have that node point
to the 2884 individual author nodes (and each of those nodes point to
their affiliation).
What are the reasons to not re-organize it that way?
My first thought was that who is in the collaboration changes over time?
But does it change day to day, or only change each academic year?
Either way, maybe we need to point the author field to something like
"Atlas Collaboration 2014a", and clone-and-modify that node each time we
come to a paper that describes a different membership?
Or is it better to do each persons membership of such a group with a
start and end date?
(BTW, arxiv.org tells me there are 1059 results for ATLAS Collaboration;
don't know if one "result" corresponds to one "paper", though.)
> While I am not against storing this as such, I do wonder if it's
> sustainable to keep such kind of data together with other Wikidata data
> in a single database.
It feels like it belongs in "core" Wikidata. Being able to ask "which
papers has this researcher written?" seems like a good example of a
Wikidata query. Similarly, "which papers have The ATLAS Collaboration"
worked on?"
But, also, are queries like "Which authors of Physics papers went to a
high school that had more than 1000 students?" part of the goal of
Wikidata? If so, Wikidata needs optimizing in such a way that makes such
queries both possible and tractable.
Darren
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata