Re: [Wikidata] Are we ready for our future

4 May 2019

Hi Stas,

Many thanks for writing this down! It is very useful to have a clear
statement like this from the dev team.

Given the sustainability concerns that you mention, I think the way
forward for the community could be to hold a RFC to determine a stricter
admissibility criterion for scholarly articles.

It could be one of (or a boolean combination of) these:
- having a site link;
- being used as a reference for a statement on Wikidata;
- being cited in a sister project;
- being cited in a sister project using a template that fetches the
metadata from Wikidata such as {{cite Q}};
- being authored by someone with Wikipedia page about them;
- … any other criterion that comes to mind.

This way, the size of the corpus could be kept in control, and the
criterion could be loosened later if the scalability concerns are addressed.

Cheers,
Antonin

On 5/4/19 8:37 AM, Stas Malyshev wrote:
...
  Hi!

  For the technical guys, consider our growth and
plan for at least one
 year. When the impression exists that the current architecture will not
 scale beyond two years, start a project to future proof Wikidata.  
 We may also want to consider if Wikidata is actually the best store for
 all kinds of data. Let's consider example:

 https://www.wikidata.org/w/index.php?title=Q57009452

 This is an entity that is almost 2M in size, almost 3000 statements and
 each edit to it produces another 2M data structure. And its dump, albeit
 slightly smaller, still 780K and will need to be updated on each edit.

 Our database is obviously not optimized for such entities, and they
 won't perform very well. We have 21 million scientific articles in the
 DB, and if even 2% of them would be like this, it's almost a terabyte of
 data (multiplied by number of revisions) and billions of statements.

 While I am not against storing this as such, I do wonder if it's
 sustainable to keep such kind of data together with other Wikidata data
 in a single database. After all, each query that you run - even if not
 related to that 21 million in any way - will have to still run in within
 the same enormous database and be hosted on the same hardware. This is
 especially important for services like Wikidata Query Service where all
 data (at least currently) occupies a shared space and can not be easily
 separated.

 Any thoughts on this?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Are we ready for our future