Hopefully this is the right mailing list for my topic.
The German Verein für Computergenealogie is the largest genealogical society in Germany with more than 3,700 members. We are currently considering whether Wikibase is a suitable system for us. Most interesting is the use for our *propographical data*.
Prosopographical data can be divided into three classes:
a) well-known and well-studied personalities, typically authors b) lesser-known but well-studied personalities that can be clearly and easily identified in historical sources c) persons whose identifiability in various sources (such as church records, civil record, city directory) has to be established using (mostly manual) record linkage
Data from (a) can be found in the GND of the German National Libarary. For data from class (b) systems such as FactGrid exists. The Verein für Computergenealogie mostly works with data from class (c). We have a huge amount of that kind of data, more than 40 million records. Currently it is stored in several MySQL and MongoDB databases.
This leads me to the crucial question: Is the performance of Wikibase sufficient for such an amount of data? One record for a person will typically result in maybe ten statements in Wikibase. Using QuickStatements or the WDI library I have not been able to insert more than two or three statements per second. It would take month to import the data.
Another question is whether the edit history of the entries can be preserved. For some data set the edit history goes back to 2004.
I hope someone can give me hints on these questions.
Best wishes Jesper
Hi Jesper,
I didn't try it yet, but maybe the wikibase-cli has a better performance: https://github.com/maxlath/wikibase-cli
Regarding the edit history, do you need the correct user names and timestamps of the edits or just the history per se? I think easiest would be to not try to import it but to keep the history in a kind of archive...
Best regards Johannes
________________________________________ Von: Wikibaseug wikibaseug-bounces@lists.wikimedia.org im Auftrag von Dr. Jesper Zedlitz jesper@zedlitz.de Gesendet: Dienstag, 14. April 2020 09:42 An: wikibaseug@lists.wikimedia.org Betreff: [Wikibase] propographical data and insert performance
Hopefully this is the right mailing list for my topic.
The German Verein für Computergenealogie is the largest genealogical society in Germany with more than 3,700 members. We are currently considering whether Wikibase is a suitable system for us. Most interesting is the use for our *propographical data*.
Prosopographical data can be divided into three classes:
a) well-known and well-studied personalities, typically authors b) lesser-known but well-studied personalities that can be clearly and easily identified in historical sources c) persons whose identifiability in various sources (such as church records, civil record, city directory) has to be established using (mostly manual) record linkage
Data from (a) can be found in the GND of the German National Libarary. For data from class (b) systems such as FactGrid exists. The Verein für Computergenealogie mostly works with data from class (c). We have a huge amount of that kind of data, more than 40 million records. Currently it is stored in several MySQL and MongoDB databases.
This leads me to the crucial question: Is the performance of Wikibase sufficient for such an amount of data? One record for a person will typically result in maybe ten statements in Wikibase. Using QuickStatements or the WDI library I have not been able to insert more than two or three statements per second. It would take month to import the data.
Another question is whether the edit history of the entries can be preserved. For some data set the edit history goes back to 2004.
I hope someone can give me hints on these questions.
Best wishes Jesper
_______________________________________________ Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
Hi Jesper,
regarding import performance, the bottleneck seems to be the java query updater. Adam apparently coded a performance boost to this for his wbstack service that can/will be integrated in the main respository used by wikibase soon. See this blogpost:
https://addshore.com/2020/04/wbstack-2020-update-1/ (headline "Queryservice")
Please tell us, when you tried it out, how the results in performance difference are in detail...
Best regards Johannes ________________________________________ Von: Wikibaseug wikibaseug-bounces@lists.wikimedia.org im Auftrag von Sperling, Johannes Johannes.Sperling@slub-dresden.de Gesendet: Dienstag, 14. April 2020 10:04 An: Wikibase Community User Group Betreff: Re: [Wikibase] propographical data and insert performance
Hi Jesper,
I didn't try it yet, but maybe the wikibase-cli has a better performance: https://github.com/maxlath/wikibase-cli
Regarding the edit history, do you need the correct user names and timestamps of the edits or just the history per se? I think easiest would be to not try to import it but to keep the history in a kind of archive...
Best regards Johannes
________________________________________ Von: Wikibaseug wikibaseug-bounces@lists.wikimedia.org im Auftrag von Dr. Jesper Zedlitz jesper@zedlitz.de Gesendet: Dienstag, 14. April 2020 09:42 An: wikibaseug@lists.wikimedia.org Betreff: [Wikibase] propographical data and insert performance
Hopefully this is the right mailing list for my topic.
The German Verein für Computergenealogie is the largest genealogical society in Germany with more than 3,700 members. We are currently considering whether Wikibase is a suitable system for us. Most interesting is the use for our *propographical data*.
Prosopographical data can be divided into three classes:
a) well-known and well-studied personalities, typically authors b) lesser-known but well-studied personalities that can be clearly and easily identified in historical sources c) persons whose identifiability in various sources (such as church records, civil record, city directory) has to be established using (mostly manual) record linkage
Data from (a) can be found in the GND of the German National Libarary. For data from class (b) systems such as FactGrid exists. The Verein für Computergenealogie mostly works with data from class (c). We have a huge amount of that kind of data, more than 40 million records. Currently it is stored in several MySQL and MongoDB databases.
This leads me to the crucial question: Is the performance of Wikibase sufficient for such an amount of data? One record for a person will typically result in maybe ten statements in Wikibase. Using QuickStatements or the WDI library I have not been able to insert more than two or three statements per second. It would take month to import the data.
Another question is whether the edit history of the entries can be preserved. For some data set the edit history goes back to 2004.
I hope someone can give me hints on these questions.
Best wishes Jesper
_______________________________________________ Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
_______________________________________________ Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
Hi all
The performance boost in that context is only for updating multiple wikibases from a single process.
The WMF is also working on a streamlined update service for WDQS deployment currently too.
I'm not sure if WDI or quickstatements has any async requests by default, so if your only running one copy that's probably why. I expect having some async write requests would defintly up your edit rate.
The next thing you'll hit into will be down to the resources and spec of the MediaWiki environment and also the SQL server, and configuration.
There are many points that can be tuned, docs exist for MediaWiki itself, but nothing specific to wikibase. I did once write down some pointers, but I'm not sure where they are (someone else on this list might remember)
Let me know how it goes with the pointers above. If you don't make much headway I can try to write some more docs for all of this.
On Fri, 24 Apr 2020, 09:53 Sperling, Johannes, < Johannes.Sperling@slub-dresden.de> wrote:
Hi Jesper,
regarding import performance, the bottleneck seems to be the java query updater. Adam apparently coded a performance boost to this for his wbstack service that can/will be integrated in the main respository used by wikibase soon. See this blogpost:
https://addshore.com/2020/04/wbstack-2020-update-1/ (headline "Queryservice")
Please tell us, when you tried it out, how the results in performance difference are in detail...
Best regards Johannes ________________________________________ Von: Wikibaseug wikibaseug-bounces@lists.wikimedia.org im Auftrag von Sperling, Johannes Johannes.Sperling@slub-dresden.de Gesendet: Dienstag, 14. April 2020 10:04 An: Wikibase Community User Group Betreff: Re: [Wikibase] propographical data and insert performance
Hi Jesper,
I didn't try it yet, but maybe the wikibase-cli has a better performance: https://github.com/maxlath/wikibase-cli
Regarding the edit history, do you need the correct user names and timestamps of the edits or just the history per se? I think easiest would be to not try to import it but to keep the history in a kind of archive...
Best regards Johannes
Von: Wikibaseug wikibaseug-bounces@lists.wikimedia.org im Auftrag von Dr. Jesper Zedlitz jesper@zedlitz.de Gesendet: Dienstag, 14. April 2020 09:42 An: wikibaseug@lists.wikimedia.org Betreff: [Wikibase] propographical data and insert performance
Hopefully this is the right mailing list for my topic.
The German Verein für Computergenealogie is the largest genealogical society in Germany with more than 3,700 members. We are currently considering whether Wikibase is a suitable system for us. Most interesting is the use for our *propographical data*.
Prosopographical data can be divided into three classes:
a) well-known and well-studied personalities, typically authors b) lesser-known but well-studied personalities that can be clearly and easily identified in historical sources c) persons whose identifiability in various sources (such as church records, civil record, city directory) has to be established using (mostly manual) record linkage
Data from (a) can be found in the GND of the German National Libarary. For data from class (b) systems such as FactGrid exists. The Verein für Computergenealogie mostly works with data from class (c). We have a huge amount of that kind of data, more than 40 million records. Currently it is stored in several MySQL and MongoDB databases.
This leads me to the crucial question: Is the performance of Wikibase sufficient for such an amount of data? One record for a person will typically result in maybe ten statements in Wikibase. Using QuickStatements or the WDI library I have not been able to insert more than two or three statements per second. It would take month to import the data.
Another question is whether the edit history of the entries can be preserved. For some data set the edit history goes back to 2004.
I hope someone can give me hints on these questions.
Best wishes Jesper
Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
Wikibaseug mailing list Wikibaseug@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibaseug
This leads me to the crucial question: Is the performance of Wikibase sufficient for such an amount of data?
Does someone have experiences with inserting Wikibase data directly into the database? That was one suggestion a received on how to import 80 million+ items into a Wikibase installation.
I have tried to import the data into the page, text and revision tables - as I have done it a few years ago on a normal MediaWiki installation. However, Wikibase seems to use more than these three tables.
Best regards Jesper
wikibaseug@lists.wikimedia.org