Hoi, Wikidata grows like mad. This is something we all experience in the really bad response times we are suffering. It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
Given that Wikidata is growing like a weed, it follows that there are two issues. Technical - what is the maximum that the current approach supports - how long will this last us. Fundamental - what funding is available to sustain Wikidata.
For the financial guys, growth like Wikidata is experiencing is not something you can reliably forecast. As an organisation we have more money than we need to spend, so there is no credible reason to be stingy.
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
It will grow and the situation will get worse before it gets better. Thanks, GerardM
PS I know about phabricator tickets, they do not give the answers to the questions we need to address.
Hi!
Indeed, Wikidata grows and will continue growing. But I don't see clearly what the purpose of this thread is. Is it to propose possible technical and financial improvements?
Regards, David
On 5/3/19 14:24, Gerard Meijssen wrote:
Hoi, Wikidata grows like mad. This is something we all experience in the really bad response times we are suffering. It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
Given that Wikidata is growing like a weed, it follows that there are two issues. Technical - what is the maximum that the current approach supports - how long will this last us. Fundamental - what funding is available to sustain Wikidata.
For the financial guys, growth like Wikidata is experiencing is not something you can reliably forecast. As an organisation we have more money than we need to spend, so there is no credible reason to be stingy.
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
It will grow and the situation will get worse before it gets better. Thanks, GerardM
PS I know about phabricator tickets, they do not give the answers to the questions we need to address.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I agree it is not clear what is being discussed here. It is growing, but in (my opinion) in a positive way, i.e. being accepted as a viable knowledge graph.
Regards,
Andra
On Fri, May 3, 2019 at 3:27 PM David Abián davidabian@wikimedia.es wrote:
Hi!
Indeed, Wikidata grows and will continue growing. But I don't see clearly what the purpose of this thread is. Is it to propose possible technical and financial improvements?
Regards, David
On 5/3/19 14:24, Gerard Meijssen wrote:
Hoi, Wikidata grows like mad. This is something we all experience in the really bad response times we are suffering. It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
Given that Wikidata is growing like a weed, it follows that there are two issues. Technical - what is the maximum that the current approach supports - how long will this last us. Fundamental - what funding is available to sustain Wikidata.
For the financial guys, growth like Wikidata is experiencing is not something you can reliably forecast. As an organisation we have more money than we need to spend, so there is no credible reason to be stingy.
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
It will grow and the situation will get worse before it gets better. Thanks, GerardM
PS I know about phabricator tickets, they do not give the answers to the questions we need to address.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- David Abián
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Gerard mentioned the PROBLEM in the 2nd sentence. I read it clearly....
we all experience in the really bad response times we are suffering. It is
so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
The response times are typically attributed to SPARQL queries from what I have seen, as well as applying multiple edits with scripts or mass operations. Although I recall there is a light queue mechanism inherent in the Blazegraph architecture that contributes to this, and I am fine with slower writes.
What most users are not comfortable with is the slower reads in different areas of Wikidata. We need to identify those slow read areas or figure out a way to get consensus on what parts of Wikidata reading affect our users the most.
So let's be constructive here: Gerard - did you have specific areas that affect your daily work, and what from of work is that (reading/writing , which areas) ?
Hoi, This mail thread is NOT about the issues that I or others face at this time. They are serious enough but that is not for this thread. People are working hard to find a solution for now. That is cool.
What I want to know is are we technically and financially ready for a continued exponential growth. If so, what are the plans and what if those plans are needed in half the time expected. Are we ready for a continued growth. When we hesitate we will lose the opportunities that are currently open to us. Thanks, GerardM
On Fri, 3 May 2019 at 16:24, Thad Guidry thadguidry@gmail.com wrote:
Gerard mentioned the PROBLEM in the 2nd sentence. I read it clearly....
we all experience in the really bad response times we are suffering. It
is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
The response times are typically attributed to SPARQL queries from what I have seen, as well as applying multiple edits with scripts or mass operations. Although I recall there is a light queue mechanism inherent in the Blazegraph architecture that contributes to this, and I am fine with slower writes.
What most users are not comfortable with is the slower reads in different areas of Wikidata. We need to identify those slow read areas or figure out a way to get consensus on what parts of Wikidata reading affect our users the most.
So let's be constructive here: Gerard - did you have specific areas that affect your daily work, and what from of work is that (reading/writing , which areas) ?
Thad https://www.linkedin.com/in/thadguidry/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Gerard, I like wikidata a lot, kudos to the community for keeping it going. But keep it real, there is no exponential growth here.
We are looking at a slow and sustainable growth at the moment with possibly a plateauing of number of users and when it comes to total number of wikidata items. just take a look at the statistics.
Date | Content pages | Page edits since Wikidata was set up | Registered users | Active users
4/2015 | 13,911,417 | 213,027,375 | 1,913,828 | 15,168 5/2016 | 17,432,789 | 328,781,525 | 2,688,788 | 16,833 7/2017 | 28,037,196 | 514,252,789 | 2,835,219 | 18,081 7/2018 | 49,081,962 | 701,319,718 | 2,970,150 | 18,578 4/2019 | 56,377,647 | 931,449,205 | 3,236,569 | 20,857
When you refer to "growing like a weed". What's that page views? queries per day? Mentions in the media?
Best, Marco
On Fri, May 3, 2019 at 3:36 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, This mail thread is NOT about the issues that I or others face at this time. They are serious enough but that is not for this thread. People are working hard to find a solution for now. That is cool.
What I want to know is are we technically and financially ready for a continued exponential growth. If so, what are the plans and what if those plans are needed in half the time expected. Are we ready for a continued growth. When we hesitate we will lose the opportunities that are currently open to us. Thanks, GerardM
On Fri, 3 May 2019 at 16:24, Thad Guidry thadguidry@gmail.com wrote:
Gerard mentioned the PROBLEM in the 2nd sentence. I read it clearly....
we all experience in the really bad response times we are suffering. It
is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
The response times are typically attributed to SPARQL queries from what I have seen, as well as applying multiple edits with scripts or mass operations. Although I recall there is a light queue mechanism inherent in the Blazegraph architecture that contributes to this, and I am fine with slower writes.
What most users are not comfortable with is the slower reads in different areas of Wikidata. We need to identify those slow read areas or figure out a way to get consensus on what parts of Wikidata reading affect our users the most.
So let's be constructive here: Gerard - did you have specific areas that affect your daily work, and what from of work is that (reading/writing , which areas) ?
Thad https://www.linkedin.com/in/thadguidry/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, Lies, damned lies and statistics. The quality of Wikidata suffers, it could be so much better if we truly wanted Wikidata to grow. Your numbers only show growth within the limits of what has been made possible. Traffic and numbers could be much more. Thanks, GerardM
On Fri, 3 May 2019 at 17:17, Marco Neumann marco.neumann@gmail.com wrote:
Gerard, I like wikidata a lot, kudos to the community for keeping it going. But keep it real, there is no exponential growth here.
We are looking at a slow and sustainable growth at the moment with possibly a plateauing of number of users and when it comes to total number of wikidata items. just take a look at the statistics.
Date | Content pages | Page edits since Wikidata was set up | Registered users | Active users
4/2015 | 13,911,417 | 213,027,375 | 1,913,828 | 15,168 5/2016 | 17,432,789 | 328,781,525 | 2,688,788 | 16,833 7/2017 | 28,037,196 | 514,252,789 | 2,835,219 | 18,081 7/2018 | 49,081,962 | 701,319,718 | 2,970,150 | 18,578 4/2019 | 56,377,647 | 931,449,205 | 3,236,569 | 20,857
When you refer to "growing like a weed". What's that page views? queries per day? Mentions in the media?
Best, Marco
On Fri, May 3, 2019 at 3:36 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, This mail thread is NOT about the issues that I or others face at this time. They are serious enough but that is not for this thread. People are working hard to find a solution for now. That is cool.
What I want to know is are we technically and financially ready for a continued exponential growth. If so, what are the plans and what if those plans are needed in half the time expected. Are we ready for a continued growth. When we hesitate we will lose the opportunities that are currently open to us. Thanks, GerardM
On Fri, 3 May 2019 at 16:24, Thad Guidry thadguidry@gmail.com wrote:
Gerard mentioned the PROBLEM in the 2nd sentence. I read it clearly....
we all experience in the really bad response times we are suffering.
It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
The response times are typically attributed to SPARQL queries from what I have seen, as well as applying multiple edits with scripts or mass operations. Although I recall there is a light queue mechanism inherent in the Blazegraph architecture that contributes to this, and I am fine with slower writes.
What most users are not comfortable with is the slower reads in different areas of Wikidata. We need to identify those slow read areas or figure out a way to get consensus on what parts of Wikidata reading affect our users the most.
So let's be constructive here: Gerard - did you have specific areas that affect your daily work, and what from of work is that (reading/writing , which areas) ?
Thad https://www.linkedin.com/in/thadguidry/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
looks like you are ready for the weekend Gerard :-) I don't see a scale issue at the moment for the type of wikidata use cases I come across. Even total number of triples is plateauing at 7.6bn*. ( of course it's easy to write "bad" queries that bring down the server). Allowing people to setup their own local instances with their own triple stores in the future is a good approach for a distributed and decentralized data management approach here.
that said a faster and better wikidata instance is always appreciated. And can certainly be provided. What's the current cost of running/hosting the service with wikibase+blazegraph per month?
Marco
* https://grafana.wikimedia.org/d/000000489/wikidata-query-service?refresh=1m&...
On Fri, May 3, 2019 at 4:28 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, Lies, damned lies and statistics. The quality of Wikidata suffers, it could be so much better if we truly wanted Wikidata to grow. Your numbers only show growth within the limits of what has been made possible. Traffic and numbers could be much more. Thanks, GerardM
On Fri, 3 May 2019 at 17:17, Marco Neumann marco.neumann@gmail.com wrote:
Gerard, I like wikidata a lot, kudos to the community for keeping it going. But keep it real, there is no exponential growth here.
We are looking at a slow and sustainable growth at the moment with possibly a plateauing of number of users and when it comes to total number of wikidata items. just take a look at the statistics.
Date | Content pages | Page edits since Wikidata was set up | Registered users | Active users
4/2015 | 13,911,417 | 213,027,375 | 1,913,828 | 15,168 5/2016 | 17,432,789 | 328,781,525 | 2,688,788 | 16,833 7/2017 | 28,037,196 | 514,252,789 | 2,835,219 | 18,081 7/2018 | 49,081,962 | 701,319,718 | 2,970,150 | 18,578 4/2019 | 56,377,647 | 931,449,205 | 3,236,569 | 20,857
When you refer to "growing like a weed". What's that page views? queries per day? Mentions in the media?
Best, Marco
On Fri, May 3, 2019 at 3:36 PM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, This mail thread is NOT about the issues that I or others face at this time. They are serious enough but that is not for this thread. People are working hard to find a solution for now. That is cool.
What I want to know is are we technically and financially ready for a continued exponential growth. If so, what are the plans and what if those plans are needed in half the time expected. Are we ready for a continued growth. When we hesitate we will lose the opportunities that are currently open to us. Thanks, GerardM
On Fri, 3 May 2019 at 16:24, Thad Guidry thadguidry@gmail.com wrote:
Gerard mentioned the PROBLEM in the 2nd sentence. I read it clearly....
we all experience in the really bad response times we are suffering.
It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
The response times are typically attributed to SPARQL queries from what I have seen, as well as applying multiple edits with scripts or mass operations. Although I recall there is a light queue mechanism inherent in the Blazegraph architecture that contributes to this, and I am fine with slower writes.
What most users are not comfortable with is the slower reads in different areas of Wikidata. We need to identify those slow read areas or figure out a way to get consensus on what parts of Wikidata reading affect our users the most.
So let's be constructive here: Gerard - did you have specific areas that affect your daily work, and what from of work is that (reading/writing , which areas) ?
Thad https://www.linkedin.com/in/thadguidry/ _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Marco Neumann KONA
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I imagine the federation of Wikibase instances becoming part of the solution:
https://addshore.com/2018/04/wikibase-of-wikibases/
Along those lines, I wonder if there is potential for Wikibase as a component in the W3C Solid project’s effort:
https://www.w3.org/community/solid/wiki/Main_Page
Jeff
From: Wikidata wikidata-bounces@lists.wikimedia.org on behalf of Andra Waagmeester andra@micel.io Reply-To: Discussion list for the Wikidata project wikidata@lists.wikimedia.org Date: Friday, May 3, 2019 at 10:11 AM To: Discussion list for the Wikidata project wikidata@lists.wikimedia.org Subject: [External] Re: [Wikidata] Are we ready for our future
I agree it is not clear what is being discussed here. It is growing, but in (my opinion) in a positive way, i.e. being accepted as a viable knowledge graph.
Regards,
Andra
On Fri, May 3, 2019 at 3:27 PM David Abián <davidabian@wikimedia.esmailto:davidabian@wikimedia.es> wrote: Hi!
Indeed, Wikidata grows and will continue growing. But I don't see clearly what the purpose of this thread is. Is it to propose possible technical and financial improvements?
Regards, David
On 5/3/19 14:24, Gerard Meijssen wrote:
Hoi, Wikidata grows like mad. This is something we all experience in the really bad response times we are suffering. It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
Given that Wikidata is growing like a weed, it follows that there are two issues. Technical - what is the maximum that the current approach supports - how long will this last us. Fundamental - what funding is available to sustain Wikidata.
For the financial guys, growth like Wikidata is experiencing is not something you can reliably forecast. As an organisation we have more money than we need to spend, so there is no credible reason to be stingy.
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
It will grow and the situation will get worse before it gets better. Thanks, GerardM
PS I know about phabricator tickets, they do not give the answers to the questions we need to address.
Wikidata mailing list Wikidata@lists.wikimedia.orgmailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidatahttps://lists.wikimedia.org/mailman/listinfo/wikidata
-- David Abián
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.orgmailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidatahttps://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata grows like mad. This is something we all experience in the really bad response times we are suffering. It is so bad that people are asked what kind of updates they are running because it makes a difference in the lag times there are.
Given that Wikidata is growing like a weed, ...
As I've delved deeper into Wikidata I get the feeling it is being developed with the assumptions of infinite resources, and no strong guidelines of exactly what the scope is (i.e. where you draw the line between what belongs in Wikidata and what does not).
This (and concerns of it being open to data vandalism) has personally made me back-off a bit. I'd originally planned to have Wikidata be the primary data source, but I'm now leaning towards keeping data tables and graphs outside, with scheduled scripts to import into Wikidata, and export from Wikidata.
For the technical guys, consider our growth and plan for at least one year.
The 37GB (json, bz2) data dump file (it was already 33GB, twice the size of the English wikipedia dump, when I grabbed it last November) is unwieldy. And, as there is no incremental changes being published, it is hard to create a mirror.
Can that dump file be split up in some functional way, I wonder?
Darren
I hope that splitting the wikidata dump into smaller, more functional chunks is something the wikidata project considers.
It's probably less about splitting the dumps up and more about starting to split the main wikidata namespace into more discrete areas, because without that the full wikidata graph is hard to partition/dumps to be functionally split up into something. For example, the latest wikidata news was "The sixty-three millionth item, about a protein, is created." (yay!) - but there are lots and lots of proteins. If someone is mirroring wikidata locally to speed up their queries for say an astronomy use case, having to download, store, and process a bunch of triples about a huge collection of proteins is only making their life harder. Maybe some of these specialized collections should go into their own namespace, like "wikidata-proteins" or "wikidata-biology". The project can have some guidelines about how "notable" an item has to be before it gets moved into "wikidata-core". Hemoglobin, yeah, that probably belongs in "wikidata-core". "MGG_03181-t26_1" aka Q63000000 (which is some protein that's been found in rice blast fungus) - well, maybe that's not quite notable enough just yet, but is certainly still valuable to some subset of the community.
Federated queries mean that this isn't too much harder to manage from a usability standpoint. If my local graph query processor/database knows that it has large chunks of wikidata mirrored into it, it doesn't need to use federated SPARQL to make remote network calls to wikidata.org's WDQS to resolve my query - but if it stumbles across a graph item that it needs to follow back across the network to wikidata.org, it can.
And wikidata.org could and still should strive to manage as many entities in its knowledge base as possible, and load as many of these different datasets into its local graph database to feed the WDQS, potentially even knowledgebases that aren't from wikidata.org. That way, federated queries that previously would have had to have made network calls can instead be just integrated into the local query plan and hopefully go much faster.
-Erik
On Fri, May 3, 2019 at 9:50 AM Darren Cook darren@dcook.org wrote:
Wikidata grows like mad. This is something we all experience in the
really bad
response times we are suffering. It is so bad that people are asked what
kind of
updates they are running because it makes a difference in the lag times
there are.
Given that Wikidata is growing like a weed, ...
As I've delved deeper into Wikidata I get the feeling it is being developed with the assumptions of infinite resources, and no strong guidelines of exactly what the scope is (i.e. where you draw the line between what belongs in Wikidata and what does not).
This (and concerns of it being open to data vandalism) has personally made me back-off a bit. I'd originally planned to have Wikidata be the primary data source, but I'm now leaning towards keeping data tables and graphs outside, with scheduled scripts to import into Wikidata, and export from Wikidata.
For the technical guys, consider our growth and plan for at least one
year.
The 37GB (json, bz2) data dump file (it was already 33GB, twice the size of the English wikipedia dump, when I grabbed it last November) is unwieldy. And, as there is no incremental changes being published, it is hard to create a mirror.
Can that dump file be split up in some functional way, I wonder?
Darren
-- Darren Cook, Software Researcher/Developer
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Erik Paulson, 12/05/19 01:54:
It's probably less about splitting the dumps up and more about starting to split the main wikidata namespace into more discrete areas [...]
In fact that was one of the proposals in "The future of bibliographic data in Wikidata: 4 possible federation scenarios". https://www.wikidata.org/wiki/Wikidata:WikiCite/Roadmap
Federico
Hoi, When you consider splitting up in parts, the question becomes how do we bring things together.. Where a person may have publications and awards and awards are in a separate area, how do you find the person in either environment? It is one thing to consider splitting up because they may help with "operational" issues, the key thing is how will interaction be. Without a practical way to mix and match it will lead to towers of knowledge that are effectively hardly interacting. Thanks, GerardM
On Sun, 12 May 2019 at 09:20, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Erik Paulson, 12/05/19 01:54:
It's probably less about splitting the dumps up and more about starting to split the main wikidata namespace into more discrete areas [...]
In fact that was one of the proposals in "The future of bibliographic data in Wikidata: 4 possible federation scenarios". https://www.wikidata.org/wiki/Wikidata:WikiCite/Roadmap
Federico
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
yeah, the wikibase storage doesn't sound right here, but these are two different issues, one with wikibase (sql) and one with the Wikidata Query Service (blazegraph).
that 2M footprint is the sql db blob? each additional 2M edit is the version history correct?
So the issue your are referring to here is in the design of the SQL based "Wikibase Repository"? How does the 2M footprint and its versions compare to a large wikipedia blob?
WQS data doesn't have versions, it doesn't have to be in one space and can easily be separated. The whole point of LOD is to decentralize your data. But I understand that Wikidata/WQS is currently designend as a centralized closed shop service for several reasons granted.
On Sat, May 4, 2019 at 8:57 AM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
WQS data doesn't have versions, it doesn't have to be in one space and can easily be separated. The whole point of LOD is to decentralize your data. But I understand that Wikidata/WQS is currently designend as a centralized closed shop service for several reasons granted.
True, WDQS does not have versions. But each time the edit is made, we now have to download and work through the whole 2M... It wasn't a problem when we were dealing with regular-sized entities, but current system certainly is not good for such giant ones.
As for decentralizing, WDQS supports federation, but for obvious reasons federated queries are slower and less efficient. That said, if there were separate store for such kind of data, it might work as cross-querying against other Wikidata data wouldn't be very frequent. But this is something that Wikidata community needs to figure out how to do.
maybe it would be a good idea to run sparql updates directly to the endpoint. rather than taking the de-tour via SQL blobs here.
How large is the RDF TTL of the page?
On Sat, May 4, 2019 at 7:37 PM Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
WQS data doesn't have versions, it doesn't have to be in one space and can easily be separated. The whole point of LOD is to decentralize your data. But I understand that Wikidata/WQS is currently designend as a centralized closed shop service for several reasons granted.
True, WDQS does not have versions. But each time the edit is made, we now have to download and work through the whole 2M... It wasn't a problem when we were dealing with regular-sized entities, but current system certainly is not good for such giant ones.
As for decentralizing, WDQS supports federation, but for obvious reasons federated queries are slower and less efficient. That said, if there were separate store for such kind of data, it might work as cross-querying against other Wikidata data wouldn't be very frequent. But this is something that Wikidata community needs to figure out how to do.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, Your approach is technically valid. It is equally obvious in part the wrong approach. Where you say we have to consider if Wikidata is the best store for all kinds of data, you may indicate the inadequacies of Wikidata in relation to particular kinds of data we already store, want to use. The fact is that this is what Wikidata is being used for. In addition there is more data that people want to include in Wikidata that will provide a real service, a service that blends in really well with our mission.
For me it does not really matter how and where what is stored. In this thread it is relevant to pursue an answer to the question how will we scale, how will we serve the needs that are now served with Wikidata and the needs that are not yet served by Wikidata. Wikidata is the project and as long as data comes together to be manipulated or queried in a consistent manner it may be Wikibase or whatever.
The issue is how do we scale, not why we are to accept too little resources by restricting the functionality of Wikidata. Thanks, GerardM
On Sat, 4 May 2019 at 09:38, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
-- Stas Malyshev smalyshev@wikimedia.org
Hi Stas,
Many thanks for writing this down! It is very useful to have a clear statement like this from the dev team.
Given the sustainability concerns that you mention, I think the way forward for the community could be to hold a RFC to determine a stricter admissibility criterion for scholarly articles.
It could be one of (or a boolean combination of) these: - having a site link; - being used as a reference for a statement on Wikidata; - being cited in a sister project; - being cited in a sister project using a template that fetches the metadata from Wikidata such as {{cite Q}}; - being authored by someone with Wikipedia page about them; - … any other criterion that comes to mind.
This way, the size of the corpus could be kept in control, and the criterion could be loosened later if the scalability concerns are addressed.
Cheers, Antonin
On 5/4/19 8:37 AM, Stas Malyshev wrote:
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
Hoi, Yes we could do that. What follows is that functionality of Wikidata is killed. Completely dead.
Also, this thread is about us being ready for what we could be, not for what we are. At that we are not as good as we could be. In your suggestions for inclusion we could have the "concept cloud" for Wikipedia articles defined by its wikilinks and define them in Wikidata. We don't. We could have a usable user interface like Reasonator. We don't.
The reason for this thread is what does it take for us to have a perfoming system because we don't. Our growth is less than what it could be. Our functionality is less that what it could be. At the same tme we are restricted it seems by annual budgets that do not take into account what functionality we provide and could provide. The notion that we should restrict our content for performance sake.. A rich organisation that is the Wikimedia Foundation. REALLY!! Thanks, GerardM
On Sat, 4 May 2019 at 20:27, Antonin Delpeuch (lists) < lists@antonin.delpeuch.eu> wrote:
Hi Stas,
Many thanks for writing this down! It is very useful to have a clear statement like this from the dev team.
Given the sustainability concerns that you mention, I think the way forward for the community could be to hold a RFC to determine a stricter admissibility criterion for scholarly articles.
It could be one of (or a boolean combination of) these:
- having a site link;
- being used as a reference for a statement on Wikidata;
- being cited in a sister project;
- being cited in a sister project using a template that fetches the
metadata from Wikidata such as {{cite Q}};
- being authored by someone with Wikipedia page about them;
- … any other criterion that comes to mind.
This way, the size of the corpus could be kept in control, and the criterion could be loosened later if the scalability concerns are addressed.
Cheers, Antonin
On 5/4/19 8:37 AM, Stas Malyshev wrote:
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
So, I'm not particularly involved with the scholarly-papers work, but with my day-job bibliographic analysis hat on...
Papers like this are a *remarkable* anomaly - hyperauthorship like this is confined to some quite specific areas of physics, and is still relatively uncommon even in those. I don't think we have to worry about it approaching anything like 2% of papers any time soon :-)
For 2018 publications, the global mean number of authors/paper is slightly under five (all disciplines). Over all time, allowing for there being more new papers than old ones, I'd guess it's something like three.
Andrew.
On Sat, 4 May 2019 at 08:58, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- - Andrew Gray andrew@generalist.org.uk
Hoi, For your information, these huge numbers of authors are particularly noticable when organisations like CERN are involved. Those people have all an ORCID identifier and slowly but surely more authors are being associated with publications. As a consequence papers are getting to be complete for their authors.. As more authors become available, it will be possible to get more initial value in the first instance of a paper.
Given that the SOURCEMD jobs run in a narrow batch mode, more jobs running concurrently offset the impact of "CERN" jobs. Over time fewer edits associated with big articles with large amounts of co-authors will be processed.
NB this is an answer to off topic issues raised. This is only one instance of functionality that we support. Thanks, GerardM
On Sun, 5 May 2019 at 17:06, Andrew Gray andrew@generalist.org.uk wrote:
So, I'm not particularly involved with the scholarly-papers work, but with my day-job bibliographic analysis hat on...
Papers like this are a *remarkable* anomaly - hyperauthorship like this is confined to some quite specific areas of physics, and is still relatively uncommon even in those. I don't think we have to worry about it approaching anything like 2% of papers any time soon :-)
For 2018 publications, the global mean number of authors/paper is slightly under five (all disciplines). Over all time, allowing for there being more new papers than old ones, I'd guess it's something like three.
Andrew.
On Sat, 4 May 2019 at 08:58, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
For the technical guys, consider our growth and plan for at least one year. When the impression exists that the current architecture will not scale beyond two years, start a project to future proof Wikidata.
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements and each edit to it produces another 2M data structure. And its dump, albeit slightly smaller, still 780K and will need to be updated on each edit.
Our database is obviously not optimized for such entities, and they won't perform very well. We have 21 million scientific articles in the DB, and if even 2% of them would be like this, it's almost a terabyte of data (multiplied by number of revisions) and billions of statements.
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database. After all, each query that you run - even if not related to that 21 million in any way - will have to still run in within the same enormous database and be hosted on the same hardware. This is especially important for services like Wikidata Query Service where all data (at least currently) occupies a shared space and can not be easily separated.
Any thoughts on this?
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
- Andrew Gray andrew@generalist.org.uk
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements ...
A paper with 2884 authors! arxiv.org deals with it by calling them the "Atlas Collaboration": https://arxiv.org/abs/1403.0489 The actual paper does the same (with the full list of names and affiliations in the Appendix).
The nice thing about graph databases is we should be able to set author to point to an "Atlas Collaboration" node, and then have that node point to the 2884 individual author nodes (and each of those nodes point to their affiliation).
What are the reasons to not re-organize it that way?
My first thought was that who is in the collaboration changes over time? But does it change day to day, or only change each academic year?
Either way, maybe we need to point the author field to something like "Atlas Collaboration 2014a", and clone-and-modify that node each time we come to a paper that describes a different membership?
Or is it better to do each persons membership of such a group with a start and end date?
(BTW, arxiv.org tells me there are 1059 results for ATLAS Collaboration; don't know if one "result" corresponds to one "paper", though.)
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database.
It feels like it belongs in "core" Wikidata. Being able to ask "which papers has this researcher written?" seems like a good example of a Wikidata query. Similarly, "which papers have The ATLAS Collaboration" worked on?"
But, also, are queries like "Which authors of Physics papers went to a high school that had more than 1000 students?" part of the goal of Wikidata? If so, Wikidata needs optimizing in such a way that makes such queries both possible and tractable.
Darren
Indeed, these collaborations in high-energy physics are not static quantities, they change essentially every day (people getting hired and had their contract expired, and most likely every two papers have a slightly different author list.
Cheers Yaroslav
On Sun, May 5, 2019 at 5:58 PM Darren Cook darren@dcook.org wrote:
We may also want to consider if Wikidata is actually the best store for all kinds of data. Let's consider example:
https://www.wikidata.org/w/index.php?title=Q57009452
This is an entity that is almost 2M in size, almost 3000 statements ...
A paper with 2884 authors! arxiv.org deals with it by calling them the "Atlas Collaboration": https://arxiv.org/abs/1403.0489 The actual paper does the same (with the full list of names and affiliations in the Appendix).
The nice thing about graph databases is we should be able to set author to point to an "Atlas Collaboration" node, and then have that node point to the 2884 individual author nodes (and each of those nodes point to their affiliation).
What are the reasons to not re-organize it that way?
My first thought was that who is in the collaboration changes over time? But does it change day to day, or only change each academic year?
Either way, maybe we need to point the author field to something like "Atlas Collaboration 2014a", and clone-and-modify that node each time we come to a paper that describes a different membership?
Or is it better to do each persons membership of such a group with a start and end date?
(BTW, arxiv.org tells me there are 1059 results for ATLAS Collaboration; don't know if one "result" corresponds to one "paper", though.)
While I am not against storing this as such, I do wonder if it's sustainable to keep such kind of data together with other Wikidata data in a single database.
It feels like it belongs in "core" Wikidata. Being able to ask "which papers has this researcher written?" seems like a good example of a Wikidata query. Similarly, "which papers have The ATLAS Collaboration" worked on?"
But, also, are queries like "Which authors of Physics papers went to a high school that had more than 1000 students?" part of the goal of Wikidata? If so, Wikidata needs optimizing in such a way that makes such queries both possible and tractable.
Darren
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi all,
I would like to throw in a slightly different angle here. The GlobalFactSync Project https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE will start in June.
As a preparation we wrote this paper describing the engine behind it: https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf
There has already been very constructive comments by https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE... which led us to focus on syncing music (bands, singles, albums) as 1 of the 10 sync targets. Other proposals for domains are very welcome.
The rationale behind GlobalFactSync is this:
Managing Data Quality is pareto-efficient, i.e. the first 80% are easy to achieve and each percent after that gets much more expensive following the law of diminishing returns. As a consequence for Wikidata: WD is probably at 80% now, so maintaining it gets harder because you need to micro-optimize to find the new errors and fill in missing information. This is exponentiated by growing Wikidata further in terms of entities.
GlobalFactSync does not solve the pareto-efficiency, but it cheats it as we hope that it will pool all the manpower of Wikipedia editors and Wikidata editors and also mobilize DBpedia users to edit either in WP or WD.
In general, Wikimedia runs the 6th largest website in the World. They are in the same league as Google or Facebook and I have absolutely no doubt that they have ample expertise in tackling scalability of hosting, e.g. by doubling the number of servers or web-caching. The problem I see is that you can not easily double the editor manpower or bot edits. Hence the GlobalFactSync Grant.
We will send out an announcement in a week or two. Fell free to suggest sync targets. We are still looking into the complexity of managing references as this is bread and butter for the project.
All the best,
Sebastian
On 05.05.19 18:07, Yaroslav Blanter wrote:
Indeed, these collaborations in high-energy physics are not static quantities, they change essentially every day (people getting hired and had their contract expired, and most likely every two papers have a slightly different author list.
Cheers Yaroslav
On Sun, May 5, 2019 at 5:58 PM Darren Cook <darren@dcook.org mailto:darren@dcook.org> wrote:
> We may also want to consider if Wikidata is actually the best store for > all kinds of data. Let's consider example: > > https://www.wikidata.org/w/index.php?title=Q57009452 > > This is an entity that is almost 2M in size, almost 3000 statements ... A paper with 2884 authors! arxiv.org <http://arxiv.org> deals with it by calling them the "Atlas Collaboration": https://arxiv.org/abs/1403.0489 The actual paper does the same (with the full list of names and affiliations in the Appendix). The nice thing about graph databases is we should be able to set author to point to an "Atlas Collaboration" node, and then have that node point to the 2884 individual author nodes (and each of those nodes point to their affiliation). What are the reasons to not re-organize it that way? My first thought was that who is in the collaboration changes over time? But does it change day to day, or only change each academic year? Either way, maybe we need to point the author field to something like "Atlas Collaboration 2014a", and clone-and-modify that node each time we come to a paper that describes a different membership? Or is it better to do each persons membership of such a group with a start and end date? (BTW, arxiv.org <http://arxiv.org> tells me there are 1059 results for ATLAS Collaboration; don't know if one "result" corresponds to one "paper", though.) > While I am not against storing this as such, I do wonder if it's > sustainable to keep such kind of data together with other Wikidata data > in a single database. It feels like it belongs in "core" Wikidata. Being able to ask "which papers has this researcher written?" seems like a good example of a Wikidata query. Similarly, "which papers have The ATLAS Collaboration" worked on?" But, also, are queries like "Which authors of Physics papers went to a high school that had more than 1000 students?" part of the goal of Wikidata? If so, Wikidata needs optimizing in such a way that makes such queries both possible and tractable. Darren _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata