Dear Wikimedia, Let me introduce myself first. My name is Ivan Heibi, I am a researcher at the University of Bologna working at OpenCitations (directed by Silvio Peroni) as the responsible of the technical infrastructure.
We are currently facing a technical issue while managing our triplestore I wanted to share with you, hoping that maybe your expertise regarding similar issues might give us some new insights to help us deal with it. Thank you in advance for your time and support, here I will briefly explain you the issue.
Currently OpenCitations stores and maintain its data (citations and bibliographic metadata) in one big triplestore (JNL format) using the Blazegraph database. The size of the current JNL file has reached almost 1.5T, and this JNL file is regularly updated (almost every two months) with new triples (data regarding new citations). However, it seems that the current JNL file does not accept any further addition of data, yet its size and total number of triples (almost 8 billion) is less than the limits that Blazegraph states (50 billion). Therefore, any attempt to DATA LOAD additional triples to the JNL file makes the process hanging forever, with no effects on the triplestore.
We tried to LOAD new data into the JNL file using different properties when lanching the Blazegraph triplestore, yet all the tests we have tried gave us the same negative results.
Did you ever face a similar behaviour? are you aware of some limits that Blazegraph has (that we are ignoring)? What are the solutions you have adopted and suggest in order to deal with such issues (in case you have faced such problems)?
Thank you in advance for your support and help, Have a nice day, Ivan Heibi
---------------------------------------------------------------- Ivan Heibi, Ph.D. Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna (Italy)
E-mail: ivan.heibi2@unibo.itmailto:ivan.heibi2@unibo.it Twitter: @ivanHeiBhttps://twitter.com/ivanheib Personal web site: ivanhb.ithttp://ivanhb.it University web page: unibo.it/sitoweb/ivan.heibi2https://www.unibo.it/sitoweb/ivan.heibi2/
Hello!
Wikidata currently has ~15B triples [1] for an on disk journal size of ~1.1TB. We have previously experienced issues with running out of allocators (some docs in [2]), which leads to not being able to add more triples to the store. We did some limited tuning with the help of a Blazegraph expert a while back ([3][4]), but those are not particularly well documented.
I would take the 50B triples limit with a grain of salt. I suspect that this number comes from a fairly specific data set and workload that might not reflect real world usage.
So overall, I'm afraid that we don't have great advice to share. We are struggling ourselves with scaling Blazegraph to the size of Wikidata. You might want to try the Blazegraph issues on Github [5], but activity there is limited.
Good luck! And let us know what you find!
Have fun!
Guillaume
[1] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&... [2] https://github.com/blazegraph/database/wiki/RWStore [3] https://phabricator.wikimedia.org/T213210 [4] https://phabricator.wikimedia.org/T238362 [5] https://github.com/blazegraph/database/issues
On Tue, 17 Jan 2023 at 16:59, Ivan Heibi ivan.heibi2@unibo.it wrote:
Dear Wikimedia, Let me introduce myself first. My name is Ivan Heibi, I am a researcher at the University of Bologna working at OpenCitations (directed by Silvio Peroni) as the responsible of the technical infrastructure.
We are currently facing a technical issue while managing our triplestore I wanted to share with you, hoping that maybe your expertise regarding similar issues might give us some new insights to help us deal with it. Thank you in advance for your time and support, here I will briefly explain you the issue.
Currently OpenCitations stores and maintain its data (citations and bibliographic metadata) in one big triplestore (JNL format) using the Blazegraph database. The size of the current JNL file has reached almost 1.5T, and this JNL file is regularly updated (almost every two months) with new triples (data regarding new citations). However, it seems that the current JNL file does not accept any further addition of data, yet its size and total number of triples (almost 8 billion) is less than the limits that Blazegraph states (50 billion). Therefore, any attempt to DATA LOAD additional triples to the JNL file makes the process hanging forever, with no effects on the triplestore.
We tried to LOAD new data into the JNL file using different properties when lanching the Blazegraph triplestore, yet all the tests we have tried gave us the same negative results.
Did you ever face a similar behaviour? are you aware of some limits that Blazegraph has (that we are ignoring)? What are the solutions you have adopted and suggest in order to deal with such issues (in case you have faced such problems)?
Thank you in advance for your support and help, Have a nice day, Ivan Heibi
Ivan Heibi, Ph.D. Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna (Italy)
E-mail: ivan.heibi2@unibo.it Twitter: @ivanHeiB https://twitter.com/ivanheib Personal web site: ivanhb.it University web page: unibo.it/sitoweb/ivan.heibi2 https://www.unibo.it/sitoweb/ivan.heibi2/ _______________________________________________ Discovery mailing list -- discovery@lists.wikimedia.org To unsubscribe send an email to discovery-leave@lists.wikimedia.org
Hi,
Couple more things we learned about blazegraph are that it is pretty bad at reclaiming free space esp. If you use multiple namespaces, dropping a namespace will not reduce the size of the journal. We are also experiencing a deadlock that makes the service totally unresponsive, we believe that it's triggered by some query load since it only happens on the public facing nodes[0], the main symptom is the thread count increasing steadily blocking all the queries. I would suggest taking a few thread dumps of blazegraph when this happens, there might be things to learn. You are very welcome to join on office hours[1] to discuss more about blazegraph.
Hope it helps,
David.
0: https://phabricator.wikimedia.org/T242453 1: https://www.mediawiki.org/wiki/Wikimedia_Search_Platform#Office_Hours
On Wed, Jan 18, 2023 at 10:57 AM Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello!
Wikidata currently has ~15B triples [1] for an on disk journal size of ~1.1TB. We have previously experienced issues with running out of allocators (some docs in [2]), which leads to not being able to add more triples to the store. We did some limited tuning with the help of a Blazegraph expert a while back ([3][4]), but those are not particularly well documented.
I would take the 50B triples limit with a grain of salt. I suspect that this number comes from a fairly specific data set and workload that might not reflect real world usage.
So overall, I'm afraid that we don't have great advice to share. We are struggling ourselves with scaling Blazegraph to the size of Wikidata. You might want to try the Blazegraph issues on Github [5], but activity there is limited.
Good luck! And let us know what you find!
Have fun!
Guillaume
[1] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&... [2] https://github.com/blazegraph/database/wiki/RWStore [3] https://phabricator.wikimedia.org/T213210 [4] https://phabricator.wikimedia.org/T238362 [5] https://github.com/blazegraph/database/issues
On Tue, 17 Jan 2023 at 16:59, Ivan Heibi ivan.heibi2@unibo.it wrote:
Dear Wikimedia, Let me introduce myself first. My name is Ivan Heibi, I am a researcher at the University of Bologna working at OpenCitations (directed by Silvio Peroni) as the responsible of the technical infrastructure.
We are currently facing a technical issue while managing our triplestore I wanted to share with you, hoping that maybe your expertise regarding similar issues might give us some new insights to help us deal with it. Thank you in advance for your time and support, here I will briefly explain you the issue.
Currently OpenCitations stores and maintain its data (citations and bibliographic metadata) in one big triplestore (JNL format) using the Blazegraph database. The size of the current JNL file has reached almost 1.5T, and this JNL file is regularly updated (almost every two months) with new triples (data regarding new citations). However, it seems that the current JNL file does not accept any further addition of data, yet its size and total number of triples (almost 8 billion) is less than the limits that Blazegraph states (50 billion). Therefore, any attempt to DATA LOAD additional triples to the JNL file makes the process hanging forever, with no effects on the triplestore.
We tried to LOAD new data into the JNL file using different properties when lanching the Blazegraph triplestore, yet all the tests we have tried gave us the same negative results.
Did you ever face a similar behaviour? are you aware of some limits that Blazegraph has (that we are ignoring)? What are the solutions you have adopted and suggest in order to deal with such issues (in case you have faced such problems)?
Thank you in advance for your support and help, Have a nice day, Ivan Heibi
Ivan Heibi, Ph.D. Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, Bologna (Italy)
E-mail: ivan.heibi2@unibo.it Twitter: @ivanHeiB https://twitter.com/ivanheib Personal web site: ivanhb.it University web page: unibo.it/sitoweb/ivan.heibi2 https://www.unibo.it/sitoweb/ivan.heibi2/ _______________________________________________ Discovery mailing list -- discovery@lists.wikimedia.org To unsubscribe send an email to discovery-leave@lists.wikimedia.org
-- *Guillaume Lederrey* (he/him) Engineering Manager Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Discovery mailing list -- discovery@lists.wikimedia.org To unsubscribe send an email to discovery-leave@lists.wikimedia.org