Hello,
I'm trying to write a pywikibot script which read and create items / properties on my Wikibase instance. Following pieces of tutorials and script examples, I managed to write something working.
1/ The idea is to read a CSV file, and create an item with its properties for each line. So I have to loop over thousands of lines and create an item and multiple claims associated, and it takes quite some time to do so. (atleast 1 hour to create 1000 items) I guess it's because for each line, I create a new entity and new claims, which means multiple requests for each line. Some pseudo code I use in my script: To create a new item, I use : repo.editEntity({}, {}, summary='new item') assuming repo = site.data_repository() To create a new claim, I use : self.user_add_claim_unless_exists(item, claim), assuming my Bot inherit WikidataBot
Is there a better way to optimize that kind of bulk import ?
--
2/ I kind of have the same problem If I want to check if an item already exists, because first I need to get all existing items and check if they are in my CSV or not. (the CSV does not contain QIDs, but does contain a "custom" ID I've created and added as a property to each item )
--
I hope I was clear enough, any relevant example, idea, advice, would be much appreciated. Bear in mind I'm a beginner with the whole ecosystem so I'm open to any recommendation. Thanks !
În lun., 4 feb. 2019 la 16:36, Kévin Bois kevin.bois@biblissima-condorcet.fr a scris:
Hello,
I'm trying to write a pywikibot script which read and create items / properties on my Wikibase instance. Following pieces of tutorials and script examples, I managed to write something working.
1/ The idea is to read a CSV file, and create an item with its properties for each line. So I have to loop over thousands of lines and create an item and multiple claims associated, and it takes quite some time to do so. (atleast 1 hour to create 1000 items) I guess it's because for each line, I create a new entity and new claims, which means multiple requests for each line. Some pseudo code I use in my script: To create a new item, I use : repo.editEntity({}, {}, summary='new item') assuming repo = site.data_repository() To create a new claim, I use : self.user_add_claim_unless_exists(item, claim), assuming my Bot inherit WikidataBot
Is there a better way to optimize that kind of bulk import ?
Not sure about this, but you might consider using low-level API functions directly or even crafting your API calls by hand. That kind of defies the purpose of using pwb, but oh well...
--
2/ I kind of have the same problem If I want to check if an item already exists, because first I need to get all existing items and check if they are in my CSV or not. (the CSV does not contain QIDs, but does contain a "custom" ID I've created and added as a property to each item )
This sounds like a great job for a SPARQL Query (see https://query.wikidata.org for the public endpoint for WIkidata). Is it feasible to add such an interface to your instance?
Regards, Strainu
--
I hope I was clear enough, any relevant example, idea, advice, would be much appreciated. Bear in mind I'm a beginner with the whole ecosystem so I'm open to any recommendation. Thanks ! _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
Il giorno Mon, 4 Feb 2019 15:36:08 +0100 Kévin Bois kevin.bois@biblissima-condorcet.fr ha scritto:
Hello,
I'm trying to write a pywikibot script which read and create items / properties on my Wikibase instance. Following pieces of tutorials and script examples, I managed to write something working.
1/ The idea is to read a CSV file, and create an item with its properties for each line. So I have to loop over thousands of lines and create an item and multiple claims associated, and it takes quite some time to do so. (atleast 1 hour to create 1000 items) I guess it's because for each line, I create a new entity and new claims, which means multiple requests for each line. Some pseudo code I use in my script: To create a new item, I use : repo.editEntity({}, {}, summary='new item') assuming repo = site.data_repository() To create a new claim, I use : self.user_add_claim_unless_exists(item, claim), assuming my Bot inherit WikidataBot
Is there a better way to optimize that kind of bulk import ?
--
2/ I kind of have the same problem If I want to check if an item already exists, because first I need to get all existing items and check if they are in my CSV or not. (the CSV does not contain QIDs, but does contain a "custom" ID I've created and added as a property to each item )
--
I hope I was clear enough, any relevant example, idea, advice, would be much appreciated. Bear in mind I'm a beginner with the whole ecosystem so I'm open to any recommendation. Thanks ! _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
I do not know if this message will be delivered. I hope so.
About the first question, I think you can split split the workload among different python threads.
About the second, could you generate the QID with an injective function from your id, so you would just have to execute the function using your ID and check if the correspondent QID exists.
Pellegrino
Hello,
I'll answer in the body of both mails. Thank you so much for your help !
Le 4 févr. 2019 à 17:14, Strainu strainu10@gmail.com a écrit :
Not sure about this, but you might consider using low-level API functions directly or even crafting your API calls by hand. That kind of defies the purpose of using pwb, but oh well...
=> I see. I think I'll try figuring it out with pywikibot first, for simplicity sake. If I can't find a good enough solution with pwb, I may try that.
This sounds like a great job for a SPARQL Query (see https://query.wikidata.org for the public endpoint for WIkidata). Is it feasible to add such an interface to your instance?
=> Yes, I'll plug in a SPARQL endpoint soon, I assume that kind of request is fast, so this is definitely something I'll try !
--
Le 4 févr. 2019 à 17:59, Pellegrino Prevete pellegrinoprevete@gmail.com a écrit :
Il giorno Mon, 4 Feb 2019 15:36:08 +0100 Kévin Bois kevin.bois@biblissima-condorcet.fr ha scritto:
Hello,
I'm trying to write a pywikibot script which read and create items / properties on my Wikibase instance. Following pieces of tutorials and script examples, I managed to write something working.
1/ The idea is to read a CSV file, and create an item with its properties for each line. So I have to loop over thousands of lines and create an item and multiple claims associated, and it takes quite some time to do so. (atleast 1 hour to create 1000 items) I guess it's because for each line, I create a new entity and new claims, which means multiple requests for each line. Some pseudo code I use in my script: To create a new item, I use : repo.editEntity({}, {}, summary='new item') assuming repo = site.data_repository() To create a new claim, I use : self.user_add_claim_unless_exists(item, claim), assuming my Bot inherit WikidataBot
Is there a better way to optimize that kind of bulk import ?
--
2/ I kind of have the same problem If I want to check if an item already exists, because first I need to get all existing items and check if they are in my CSV or not. (the CSV does not contain QIDs, but does contain a "custom" ID I've created and added as a property to each item )
--
I hope I was clear enough, any relevant example, idea, advice, would be much appreciated. Bear in mind I'm a beginner with the whole ecosystem so I'm open to any recommendation. Thanks ! _______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
I do not know if this message will be delivered. I hope so.
About the first question, I think you can split split the workload among different python threads.
=> That sounds awesome, I'll look into that
About the second, could you generate the QID with an injective function from your id, so you would just have to execute the function using your ID and check if the correspondent QID exists.
=> It sounds like what I had in mind but I'm not sure if I understood correctly what you mean. To expand what I wanted to do: before adding anything with the script, I wanted to create a big mapping (in a python dictionary) with my custom ID and its corresponding QID. something like that: id_mapping = {custom_id1 => QID1, custom_id2 => QID2, ...}. Then I could easily look into that dictionary when needed before actually adding an item. This is why I'm trying to retrieve all existing item as a first step.
Pellegrino
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
1/ The idea is to read a CSV file, and create an item with its properties for each line. So I have to loop over thousands of lines and create an item and multiple claims associated, and it takes quite some time to do so. (atleast 1 hour to create 1000 items) I guess it's because for each line, I create a new entity and new claims, which means multiple requests for each line.
There is an API call, wbeditentity [1] that allows preparing an item with multiple claims which are written to WB in one call. its Are you aware of wikibase universal bot [2] and wikibase-tools [3]? Both cover functionality that should allow to do what you describe above and both use the wbeditentity call.
[1] https://www.wikidata.org/w/api.php?action=help&modules=wbeditentity [2] https://github.com/dcodings/Wikibase_Universal_Bot [3] https://github.com/stuppie/wikibase-tools
Thank you for your help, I got some time to test some of your suggested solutions, and I drastically reduced the time to import my data by creating an item with all its claims in one call. For future readers, site.editEntity() already uses the `wbeditentity` API call, you only need to complete the `data` argument with labels, descriptions, aliases, claims ...
It's great, but I'm trying to optimize even more and If I'm not mistaken, there is no solution to reduce the number of requests to import my items, so the next step would be implementing some sort of multiprocessing, as suggested by Pellegrino Prevete.
I tried to implement that, unfortunately Pywikibot raises an APIerror : invalid CSRF token. It sounds like multiple processes share the same CSRF token to create / edit an item, which is a bit weird. All in all, it got me thinking : is it even possible to use multiprocessing with Pywikibot at all ? Here's some pseudo-code to show what I did : ''' from multiprocessing.dummy import Pool
pool = Pool(processes=4) results = pool.map(self.process_line, csv) pool.close() pool.join() ''' Where `csv`is a list (already parsed CSV file), `self.process_line` is my method reading the data in the current csv line and creating the item with it.
Le 5 févr. 2019 à 09:35, Andra Waagmeester andra@micel.io a écrit :
1/ The idea is to read a CSV file, and create an item with its properties for each line. So I have to loop over thousands of lines and create an item and multiple claims associated, and it takes quite some time to do so. (atleast 1 hour to create 1000 items) I guess it's because for each line, I create a new entity and new claims, which means multiple requests for each line.
There is an API call, wbeditentity [1] that allows preparing an item with multiple claims which are written to WB in one call. its Are you aware of wikibase universal bot [2] and wikibase-tools [3]? Both cover functionality that should allow to do what you describe above and both use the wbeditentity call.
[1] https://www.wikidata.org/w/api.php?action=help&modules=wbeditentity https://www.wikidata.org/w/api.php?action=help&modules=wbeditentity [2] https://github.com/dcodings/Wikibase_Universal_Bot https://github.com/dcodings/Wikibase_Universal_Bot [3] https://github.com/stuppie/wikibase-tools https://github.com/stuppie/wikibase-tools_______________________________________________ pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot