11 December 2024 - 24 January 2025
Hello, all,
I start this update by thanking everyone that took a look at the last report https://meta.wikimedia.org/wiki/BHL/Our_outcomes/WiR/Status_updates/2025-01-10, specially those that gave feedback on the 6 tasks. It seems we are all aligned on refining the BHL Image Structured Data Model https://docs.google.com/spreadsheets/d/1ocqDQBFaKAQvPsP3HMlrh52faiHiaDU-D9P3yz1oV_M/edit?gid=0#gid=0 and getting metadata transformed in 5-star Linked Open Data. That, thus, has been the focus of the last 2 weeks. https://meta.wikimedia.org/wiki/File:Oiseaux_brillans_du_Br%C3%A9sil_(20118041314).jpgThe #P180 shortcut enables a direct link to the "depicts" statements https://meta.wikimedia.org/w/index.php?title=Oiseaux_brillans_du_Br%C3%A9sil_(20118041314).jpg&action=edit&redlink=1 for this blue-bellied parrot illustration. Bite-sized news
- After a kind invitation by Fiona (WMF), https://meta.wikimedia.org/wiki/User:FRomeo_(WMF) I have presented on Jan 15 at the Commons Community Call https://commons.wikimedia.org/wiki/Commons:WMF_support_for_Commons/Commons_community_calls a bit on what we are doing in the BHL-Wiki Working Group. The community is aligned on showing support for Structured Data on Commons, which might mean more and better tools for SDC in the future and more support for the Commons Query Service.
- As a request from Heidi Meudt https://en.wikipedia.org/wiki/User:Stitchbird2 on the WikiProject Biodiversity chat, the hasDepicts.js https://commons.wikimedia.org/wiki/User:TiagoLubiana/hasDepicts.js tool now is set to be "on" by default. It adds green or white balls under file thumbnails showing whether they have, or not, depicts (P180) statements. It helps a bit when curating this information manually on Commons.
- Thanks to User:Nikki https://www.wikidata.org/wiki/User:Nikki, I have learned that we can share links directly to SDC statements on Commons by adding #P180 (or other property ids) to the end of the URL. For example, in File:Oiseaux_brillans_du_Brésil_(20118041314).jpg#P180 https://commons.wikimedia.org/wiki/File:Oiseaux_brillans_du_Br%C3%A9sil_(20118041314).jpg#P180 .
- The BHL Arena app got good attention and feedback, I thank everyone that tried it! Good requests were raised by RichartLitt https://github.com/RichardLitt and dpriskorn https://github.com/dpriskorn on the GitHub repo https://github.com/lubianat/bhl-arena/issues and by Gio https://meta.wikimedia.org/wiki/User:GFontenelle_(WMF) on Telegram. Hopefully I will find the time to implement them soon. - The Commons Impact Metrics Dashboard https://tiago.bio.br/impact_metrics/ is now tracking metrics for the BHL subcategories for South America https://tiago.bio.br/impact_metrics/?category=Files_from_the_Biodiversity_Heritage_Library_in_South_America and Africa https://tiago.bio.br/impact_metrics/?category=Files_from_the_Biodiversity_Heritage_Library_in_Africa. As these are newly tracked by the Commons Impact Metrics system, there might be bugs.
PIWG connections
On last Wednesday, Jan 22, I attended the Persistent Identifiers Working Group (PIWG) to meat Nicole https://www.wikidata.org/wiki/User:nicolekearney and others, getting to know a bit more of the identifiers work. From the conversations, some tasks emerged that could be useful towards BHL's mission:
*Task 1. Develop more metrics about reuse of BHL content (i.e. DOIs) on Wikipedia*
Rod Page wrote a few queries in that direction, listed on Biodiversity Heritage Library/Our outcomes/Quarry https://meta.wikimedia.org/wiki/Biodiversity_Heritage_Library/Our_outcomes/Quarry. So far, about 7680 uses of BHL DOIs + 10139 links directly to BHL website. Hopefully I can extend some of these metrics, getting, for example, how many times these pages are visited per month.
*Task 2. Update citations on Wikipedia to link more clearly to BHL using the "#via https://meta.wikimedia.org/wiki/Template:Cite_book#via" parameter*
Citations currently show up as:
- Salvadori, Tommaso https://meta.wikimedia.org/w/index.php?title=Tommaso_Salvadori&action=edit&redlink=1 (1874). "Altre nuove specie di uccelli della Nuova Guinea e di Goram raccolte dal Signor L.M. D'Albertis" https://www.biodiversitylibrary.org/page/10812996. *Annali del Museo Civico di Storia Naturale di Genova* (in Italian and Latin) *6*: 81–88 [86]. OCLC https://en.wikipedia.org/wiki/OCLC 820904343 https://www.worldcat.org/oclc/820904343.
As you may see, BHL is hidden behind the links. The alternative makes the links to BHL much clearer, and could be used both for BHL DOIs and direct website links:
- Salvadori, Tommaso https://meta.wikimedia.org/w/index.php?title=Tommaso_Salvadori&action=edit&redlink=1 (1874). "Altre nuove specie di uccelli della Nuova Guinea e di Goram raccolte dal Signor L.M. D'Albertis" https://www.biodiversitylibrary.org/page/10812996. *Annali del Museo Civico di Storia Naturale di Genova* (in Italian and Latin) *6*: 81–88 [86]. OCLC https://en.wikipedia.org/wiki/OCLC 820904343 https://www.worldcat.org/oclc/820904343 – via Biodiversity Heritage Library https://meta.wikimedia.org/wiki/Biodiversity_Heritage_Library.
*Task 3. Update the Mix'n'Match catalog for BHL Authors D*
Deprecated ids still linger. Siobhan https://meta.wikimedia.org/wiki/User:Ambrosia10 contacted Lucy Schrader https://www.wikidata.org/wiki/Q124513124, who kindly provided instructions on how to do it. If all goes well, this should be done shortly. For reference, see all BHL catalogs in Wikidata:WikiProject_BHL/Statistics https://www.wikidata.org/wiki/Wikidata:WikiProject_BHL/Statistics.
If you have more tasks related to persistent identifiers in the BHL-Wiki interface, please let me know! Structured Data Case Studies
After putting together a Tutorial for editing BHL images SDC on OpenRefine https://docs.google.com/document/d/18jVwiOsqxFoMAHJnJ864iTBcQx39ZliLefmQh26uZpg/edit?tab=t.0#heading=h.70mrgxdwb3ri, based on work by Siobhan and Sandra, I proceeded to do a few case studies to detect corner cases towards a scale-up. I am versioning the BHL Image Data Model https://docs.google.com/spreadsheets/d/1ocqDQBFaKAQvPsP3HMlrh52faiHiaDU-D9P3yz1oV_M/edit?gid=0#gid=0, and it is currently at v0.1.1 . Documentation for Each Case Study
Each case study corresponds to a different publication for which the images have been loaded into Commons. I am keeping track of these changes on a master Google Spreadsheet https://docs.google.com/spreadsheets/d/1YhMSb_iBylJaWPX37kZbVzdyWoFidT9a31Pl0oY3buc/edit?gid=0, which some general notes, criteria and links. Each case study is getting its own Google Docs, at least for now: Case Study 01 - Model v. 0.1.1 - Albert Spear Hitchcock https://docs.google.com/document/d/1P96wHjYyDkMQw-0flYamJOEDE7b5hyrw7btXtI7pIWQ/edit?tab=t.0#heading=h.7ufwi5o62qyu
This photo album served to test the tutorial. One complexity there was that it needed manual curation of the number of photos, done in OpenRefine. A particular detected detail was that some images for this album appear to have been digitized twice. Other little modelling challenges like this one are listed in the case studies documents, for those interested.
- [image: Frailejones in the BHL collection (link)] https://meta.wikimedia.org/wiki/File:Albert_Spear_Hitchcock_-_brief_report_on_a_trip_to_Ecuador,_Peru_and_Bolivia,_May_25,_1923-February_18,_1924_(Page_40)_BHL48116141.jpg Frailejones in the BHL collection (link https://commons.wikimedia.org/wiki/File:Albert_Spear_Hitchcock_-_brief_report_on_a_trip_to_Ecuador,_Peru_and_Bolivia,_May_25,_1923-February_18,_1924_(Page_40)_BHL48116141.jpg ) - [image: Frailejones from SLA, but not in the BHL collection. (link)] https://meta.wikimedia.org/wiki/File:Frailejones_in_northern_Ecuador,_1923.jpg Frailejones from SLA, but not in the BHL collection. (link https://meta.wikimedia.org/w/index.php?title=Frailejones_in_northern_Ecuador,_1923.jpg.&action=edit&redlink=1 )
Case Study 02 - Model v. 0.1.1 - Apuntes sobre los insectos de Chile https://docs.google.com/document/d/1MQBmFePJESWhDj0uLp8EChhjINVkMN0FLmooHnlAQhY/edit?tab=t.0#heading=h.alrv62u4dvp
A small category, where some files had the {{BHL}} template, some with not. We decided to, for now on, only parse files with the {{BHL}} template, which include more metadata. This template is used over 230.000 times in Commons https://templatecount.toolforge.org/?lang=commons&name=BHL&namespace=10#bottom, making it already a decently-sized dataset.
At this point, I moved from OpenRefine, but a custom script using Started using WikibaseIntegrator (GitHub Repository: Reconciliation Bot https://github.com/lubianat/bhl_sdc_exploration/tree/main/reconciliation_bot). The main advantage is the scalability and the opportunity to automate a few of the manual parts of the workflow. Case Study 03 - Model v. 0.1.1 - Beitrag zur Flora Brasiliens https://docs.google.com/document/d/1OC3qMzJ3iMAR8J7lv78y989UhO8xKMXddwLiFlDVBqs/edit?tab=t.0#heading=h.alrv62u4dvp
A nice set of plant illustrations, part of the famed contributions of von Martius https://en.wikipedia.org/wiki/Von_Martius and of the Prince of Wied https://en.wikipedia.org/wiki/Prince_Maximilian_of_Wied-Neuwied in Brazil. The illustrator, "T. Wild https://www.wikidata.org/wiki/Q131760409", did not have even a Wikidata item. It seems to be common for illustrators, engravers and lithographers to be forgotten in history. Crediting them in metadata is, I think, an starting point to do justice to. https://meta.wikimedia.org/wiki/File:Beitrag_zur_Flora_Brasiliens_(Pl._2)_(8226085515).jpgOne of the plants drawn by T.Wild.
At this point I implemented editgroups https://github.com/Wikidata/editgroups it, so batches of edits may be tracked, and a parser to get Sponsor and Collection based on the Bibliography IDs. The images in the category also had some structured data, added by the FlickypediaBackfillrBot https://commons.wikimedia.org/wiki/User:FlickypediaBackfillrBot, which lead me to unearth bugs in the core of the WikibaseIntegrator https://meta.wikimedia.org/w/index.php?title=Unearthing_bugs_in_the_core_of_the_WikibaseIntegrator&action=edit&redlink=1 package. Luckly, I was able to hand-fix the bugs, and now the bot runs with the custom package at github.com/lubianat/WikibaseIntegrator. Case Study 04 - Model v. 0.1.1 - Abbildungen zur Naturgeschichte Brasiliens https://docs.google.com/document/d/16N5mzTALk-TsYN75j52iX3aTW54bY_liZ_WNPYeaLJE/edit?tab=t.0#heading=h.iy75xqj41y80 https://meta.wikimedia.org/wiki/File:Prinz_Maximilian_zu_Wied-Neuwied_mit_Joachim_Qu%C3%A4ck_auf_der_Jagd_im_brasilianischen_Urwald.jpgMax, the Prince of Wied, hunting with Joaquim Kuêk https://pt.wikipedia.org/wiki/Joaquim_Ku%C3%AAk.
Another work by Max, the Prince of Wied, but this time I was unable to figure out the illustrator. When studying the work, I found this image of Max with a native "helper", Joaquim Kuêk https://pt.wikipedia.org/wiki/Joaquim_Ku%C3%AAk, and was a shocked. The dead macaw and the exploitation of indigenous people — Kuêk's skull was put in a museum after his death — hint at the ugly, untold parts of these beautiful books. In a way, images like this remind us the importance of re-signifying the biodiversity treasure in BHL, and making sure it has a positive and long lasting impact for human kind and biodiversity. I think that is, ultimately, the mission behind this whole Wiki work.
Anyways, coming back to the structured data, some files in the category had, in the metadata, species names loaded from BHL OCR. I wrote code, then, to parse these as "depicts" statements. Some names dected by the BHL OCR don't seem to be used anymore, e.g. *Noctilio dorsatus* which seems to be now called *Noctilio leporinus* https://www.gbif.org/pt/species/2433343 . I am investigating technical ways to make these leaps automatically, maybe via GBIF. Case Study 05 - Model v. 0.1.1 - Historia naturalis Brasiliae https://docs.google.com/document/d/1pDC9pMLK46bOKaB5mjI9aNaMLf7MdLmATtQggTYGAac/edit?tab=t.0#heading=h.alrv62u4dvp
This 1648 book is outstanding for its historical importance. It was published in Latin, in Leiden and reflects a lot of the Natural History of Dutch Brazil https://meta.wikimedia.org/w/index.php?title=Dutch_Brazil&action=edit&redlink=1. There was, recently, a 1.5 Million Euro Horizon2020 grant https://cordis.europa.eu/project/id/715423 just for studying this book. There are many copies known and catalogued https://www.taylorfrancis.com/chapters/oa-edit/10.4324/9781003362920-9/census-copies-willem-piso-georg-marcgraf-historia-naturalis-brasiliae-leiden-amsterdam-elzevier-1648-alex-alsemgeest-jeroen-bos and BHL has at least 5 different Title identifiers for it, all connected to the same Wikidata item https://www.wikidata.org/wiki/Q339343.
That detail adds some complexity to the workflow, but luckly, the {{BHL}} templated stored the exact Item where each image came from. They seem to have been colored by hand and not by the illustrator, so some images look funny, and very different from copy to copy. https://meta.wikimedia.org/wiki/File:Jacana_-_Historia_naturalis_Brasiliae_(Page_190)_BHL289283_(cropped).jpg https://meta.wikimedia.org/wiki/File:Jacana_jacana_-_Tiago_Lubiana_-_458859541.jpeg This jacana was coloured green in one book, an alien look for a familiar bird to Brazilian birders.
Case Study 06 - Model v. 0.1.1 - Les bois indigènes de São Paulo https://docs.google.com/document/d/1q2vNGAEREgI4yj-GD9kEnmFjImR4FZIZDtKbDKesXu0/edit?tab=t.0#heading=h.alrv62u4dvp
A rare case of a public domain book about Brazilian biodiversity that was actually authored by Brazilians and edited in Brazil! This collection is about woody trees of São Paulo, here I live, and has a mix of photos and illustrations of the plants. I modified Magnus Manske's sdc_tool to be able to use it for "instance of" statements, and curate this information onwiki The modified script is available at User:TiagoLubiana/sdc_tool.js https://commons.wikimedia.org/wiki/User:TiagoLubiana/sdc_tool.js.
Tthe PDF/DJVU of the book was not available on Commons. Adding it to Commons via the Internet Archive Upload tool (ia-upload.wmcloud.org/) was surprisingly straightforward. Tech adventures on Phabricator tickets
Navigating with more intensity in the Commons Structured Data waters led me to find some bugs, inconsistencies and missing features, which in turn turned into dialogs with the Wikimedia tech community via Phabricator, including:
- T298672 https://phabricator.wikimedia.org/T298672, a bug when new users try and revert Structured Data - T383584 https://phabricator.wikimedia.org/T383584 and T304391 https://phabricator.wikimedia.org/T304391, a feature request to be able to link, using SDC, two media files (e.g. an image and the scan where it comes from). The tickets are for two different solutions: one, a quick workaround, the other, for a larger Wikibase restructuring, which would enable better queries for this data - T384221 https://phabricator.wikimedia.org/T384221, based on a feedback by Siobhan on the BHL Arena https://bhl-arena.toolforge.org app, a feature request for sharing links to Commons files opening directly the "Structured Data" view
*Next steps*
And that was it so far. I will keep navigating these categories for the next few weeks, refining the model and the bot code. Let me know if there is any particular category that you would like me to study/add metadata to.
*Thank you* for reading this update, and if you have questions or suggestions, don't hesitate to reach out. Cheers, Tiago
*——————————————————————————* *Tiago Lubiana* *Wikimedian-in-Residence, Biodiversity Heritage Library https://www.biodiversitylibrary.org/*
*tiago.bio.br https://tiago.bio.br*
Wow Tiago, this is absolutely amazing! You are going from strength to strength. I’m going to set aside time to deep dive into the details of your update but just want to say after my first reading of this email that I love what you are managing to achieve.
I particularly appreciate your willingness to help solve the BHL Creator ID mix’n’match dataset issues raised during the meeting with the BHL PIDS working group as I’m aware this has been an issue for members of that group for several years.
Thanks again for all your hard work,
Siobhan
On 25 Jan 2025, at 9:24 AM, Tiago Lubiana tiagolubiana@gmail.com wrote:
11 December 2024 - 24 January 2025
Hello, all,
I start this update by thanking everyone that took a look at the last report https://meta.wikimedia.org/wiki/BHL/Our_outcomes/WiR/Status_updates/2025-01-10, specially those that gave feedback on the 6 tasks. It seems we are all aligned on refining the BHL Image Structured Data Model https://docs.google.com/spreadsheets/d/1ocqDQBFaKAQvPsP3HMlrh52faiHiaDU-D9P3yz1oV_M/edit?gid=0#gid=0 and getting metadata transformed in 5-star Linked Open Data. That, thus, has been the focus of the last 2 weeks.
https://meta.wikimedia.org/wiki/File:Oiseaux_brillans_du_Br%C3%A9sil_(20118041314).jpgThe #P180 shortcut enables a direct link to the "depicts" statements https://meta.wikimedia.org/w/index.php?title=Oiseaux_brillans_du_Br%C3%A9sil_(20118041314).jpg&action=edit&redlink=1 for this blue-bellied parrot illustration. Bite-sized news
After a kind invitation by Fiona (WMF), https://meta.wikimedia.org/wiki/User:FRomeo_(WMF) I have presented on Jan 15 at the Commons Community Call https://commons.wikimedia.org/wiki/Commons:WMF_support_for_Commons/Commons_community_calls a bit on what we are doing in the BHL-Wiki Working Group. The community is aligned on showing support for Structured Data on Commons, which might mean more and better tools for SDC in the future and more support for the Commons Query Service. As a request from Heidi Meudt https://en.wikipedia.org/wiki/User:Stitchbird2 on the WikiProject Biodiversity chat, the hasDepicts.js https://commons.wikimedia.org/wiki/User:TiagoLubiana/hasDepicts.js tool now is set to be "on" by default. It adds green or white balls under file thumbnails showing whether they have, or not, depicts (P180) statements. It helps a bit when curating this information manually on Commons. Thanks to User:Nikki https://www.wikidata.org/wiki/User:Nikki, I have learned that we can share links directly to SDC statements on Commons by adding #P180 (or other property ids) to the end of the URL. For example, in File:Oiseaux_brillans_du_Brésil_(20118041314).jpg#P180 https://commons.wikimedia.org/wiki/File:Oiseaux_brillans_du_Br%C3%A9sil_(20118041314).jpg#P180. The BHL Arena app got good attention and feedback, I thank everyone that tried it! Good requests were raised by RichartLitt https://github.com/RichardLitt and dpriskorn https://github.com/dpriskorn on the GitHub repo https://github.com/lubianat/bhl-arena/issues and by Gio https://meta.wikimedia.org/wiki/User:GFontenelle_(WMF) on Telegram. Hopefully I will find the time to implement them soon. The Commons Impact Metrics Dashboard https://tiago.bio.br/impact_metrics/ is now tracking metrics for the BHL subcategories for South America https://tiago.bio.br/impact_metrics/?category=Files_from_the_Biodiversity_Heritage_Library_in_South_America and Africa https://tiago.bio.br/impact_metrics/?category=Files_from_the_Biodiversity_Heritage_Library_in_Africa. As these are newly tracked by the Commons Impact Metrics system, there might be bugs. PIWG connections
On last Wednesday, Jan 22, I attended the Persistent Identifiers Working Group (PIWG) to meat Nicole https://www.wikidata.org/wiki/User:nicolekearney and others, getting to know a bit more of the identifiers work. From the conversations, some tasks emerged that could be useful towards BHL's mission:
Task 1. Develop more metrics about reuse of BHL content (i.e. DOIs) on Wikipedia
Rod Page wrote a few queries in that direction, listed on Biodiversity Heritage Library/Our outcomes/Quarry https://meta.wikimedia.org/wiki/Biodiversity_Heritage_Library/Our_outcomes/Quarry. So far, about 7680 uses of BHL DOIs + 10139 links directly to BHL website. Hopefully I can extend some of these metrics, getting, for example, how many times these pages are visited per month.
Task 2. Update citations on Wikipedia to link more clearly to BHL using the "#via https://meta.wikimedia.org/wiki/Template:Cite_book#via" parameter
Citations currently show up as:
Salvadori, Tommaso https://meta.wikimedia.org/w/index.php?title=Tommaso_Salvadori&action=edit&redlink=1 (1874). "Altre nuove specie di uccelli della Nuova Guinea e di Goram raccolte dal Signor L.M. D'Albertis" https://www.biodiversitylibrary.org/page/10812996. Annali del Museo Civico di Storia Naturale di Genova (in Italian and Latin) 6: 81–88 [86]. OCLC https://en.wikipedia.org/wiki/OCLC 820904343 https://www.worldcat.org/oclc/820904343. As you may see, BHL is hidden behind the links. The alternative makes the links to BHL much clearer, and could be used both for BHL DOIs and direct website links:
Salvadori, Tommaso https://meta.wikimedia.org/w/index.php?title=Tommaso_Salvadori&action=edit&redlink=1 (1874). "Altre nuove specie di uccelli della Nuova Guinea e di Goram raccolte dal Signor L.M. D'Albertis" https://www.biodiversitylibrary.org/page/10812996. Annali del Museo Civico di Storia Naturale di Genova (in Italian and Latin) 6: 81–88 [86]. OCLC https://en.wikipedia.org/wiki/OCLC 820904343 https://www.worldcat.org/oclc/820904343 – via Biodiversity Heritage Library https://meta.wikimedia.org/wiki/Biodiversity_Heritage_Library. Task 3. Update the Mix'n'Match catalog for BHL Authors D
Deprecated ids still linger. Siobhan https://meta.wikimedia.org/wiki/User:Ambrosia10 contacted Lucy Schrader https://www.wikidata.org/wiki/Q124513124, who kindly provided instructions on how to do it. If all goes well, this should be done shortly. For reference, see all BHL catalogs in Wikidata:WikiProject_BHL/Statistics https://www.wikidata.org/wiki/Wikidata:WikiProject_BHL/Statistics.
If you have more tasks related to persistent identifiers in the BHL-Wiki interface, please let me know!
Structured Data Case Studies
After putting together a Tutorial for editing BHL images SDC on OpenRefine https://docs.google.com/document/d/18jVwiOsqxFoMAHJnJ864iTBcQx39ZliLefmQh26uZpg/edit?tab=t.0#heading=h.70mrgxdwb3ri, based on work by Siobhan and Sandra, I proceeded to do a few case studies to detect corner cases towards a scale-up. I am versioning the BHL Image Data Model https://docs.google.com/spreadsheets/d/1ocqDQBFaKAQvPsP3HMlrh52faiHiaDU-D9P3yz1oV_M/edit?gid=0#gid=0, and it is currently at v0.1.1 .
Documentation for Each Case Study
Each case study corresponds to a different publication for which the images have been loaded into Commons. I am keeping track of these changes on a master Google Spreadsheet https://docs.google.com/spreadsheets/d/1YhMSb_iBylJaWPX37kZbVzdyWoFidT9a31Pl0oY3buc/edit?gid=0, which some general notes, criteria and links. Each case study is getting its own Google Docs, at least for now:
Case Study 01 - Model v. 0.1.1 - Albert Spear Hitchcock https://docs.google.com/document/d/1P96wHjYyDkMQw-0flYamJOEDE7b5hyrw7btXtI7pIWQ/edit?tab=t.0#heading=h.7ufwi5o62qyu This photo album served to test the tutorial. One complexity there was that it needed manual curation of the number of photos, done in OpenRefine. A particular detected detail was that some images for this album appear to have been digitized twice. Other little modelling challenges like this one are listed in the case studies documents, for those interested.
https://meta.wikimedia.org/wiki/File:Albert_Spear_Hitchcock_-_brief_report_on_a_trip_to_Ecuador,_Peru_and_Bolivia,_May_25,_1923-February_18,_1924_(Page_40)_BHL48116141.jpg Frailejones in the BHL collection (link https://commons.wikimedia.org/wiki/File:Albert_Spear_Hitchcock_-_brief_report_on_a_trip_to_Ecuador,_Peru_and_Bolivia,_May_25,_1923-February_18,_1924_(Page_40)_BHL48116141.jpg) https://meta.wikimedia.org/wiki/File:Frailejones_in_northern_Ecuador,_1923.jpg Frailejones from SLA, but not in the BHL collection. (link https://meta.wikimedia.org/w/index.php?title=Frailejones_in_northern_Ecuador,_1923.jpg.&action=edit&redlink=1) Case Study 02 - Model v. 0.1.1 - Apuntes sobre los insectos de Chile https://docs.google.com/document/d/1MQBmFePJESWhDj0uLp8EChhjINVkMN0FLmooHnlAQhY/edit?tab=t.0#heading=h.alrv62u4dvp A small category, where some files had the {{BHL}} template, some with not. We decided to, for now on, only parse files with the {{BHL}} template, which include more metadata. This template is used over 230.000 times in Commons https://templatecount.toolforge.org/?lang=commons&name=BHL&namespace=10#bottom, making it already a decently-sized dataset.
At this point, I moved from OpenRefine, but a custom script using Started using WikibaseIntegrator (GitHub Repository: Reconciliation Bot https://github.com/lubianat/bhl_sdc_exploration/tree/main/reconciliation_bot). The main advantage is the scalability and the opportunity to automate a few of the manual parts of the workflow.
Case Study 03 - Model v. 0.1.1 - Beitrag zur Flora Brasiliens https://docs.google.com/document/d/1OC3qMzJ3iMAR8J7lv78y989UhO8xKMXddwLiFlDVBqs/edit?tab=t.0#heading=h.alrv62u4dvp A nice set of plant illustrations, part of the famed contributions of von Martius https://en.wikipedia.org/wiki/Von_Martius and of the Prince of Wied https://en.wikipedia.org/wiki/Prince_Maximilian_of_Wied-Neuwied in Brazil. The illustrator, "T. Wild https://www.wikidata.org/wiki/Q131760409", did not have even a Wikidata item. It seems to be common for illustrators, engravers and lithographers to be forgotten in history. Crediting them in metadata is, I think, an starting point to do justice to.
https://meta.wikimedia.org/wiki/File:Beitrag_zur_Flora_Brasiliens_(Pl._2)_(8226085515).jpgOne of the plants drawn by T.Wild. At this point I implemented editgroups https://github.com/Wikidata/editgroups it, so batches of edits may be tracked, and a parser to get Sponsor and Collection based on the Bibliography IDs. The images in the category also had some structured data, added by the FlickypediaBackfillrBot https://commons.wikimedia.org/wiki/User:FlickypediaBackfillrBot, which lead me to unearth bugs in the core of the WikibaseIntegrator https://meta.wikimedia.org/w/index.php?title=Unearthing_bugs_in_the_core_of_the_WikibaseIntegrator&action=edit&redlink=1 package. Luckly, I was able to hand-fix the bugs, and now the bot runs with the custom package at github.com/lubianat/WikibaseIntegrator https://github.com/lubianat/WikibaseIntegrator.
Case Study 04 - Model v. 0.1.1 - Abbildungen zur Naturgeschichte Brasiliens https://docs.google.com/document/d/16N5mzTALk-TsYN75j52iX3aTW54bY_liZ_WNPYeaLJE/edit?tab=t.0#heading=h.iy75xqj41y80 https://meta.wikimedia.org/wiki/File:Prinz_Maximilian_zu_Wied-Neuwied_mit_Joachim_Qu%C3%A4ck_auf_der_Jagd_im_brasilianischen_Urwald.jpgMax, the Prince of Wied, hunting with Joaquim Kuêk https://pt.wikipedia.org/wiki/Joaquim_Ku%C3%AAk. Another work by Max, the Prince of Wied, but this time I was unable to figure out the illustrator. When studying the work, I found this image of Max with a native "helper", Joaquim Kuêk https://pt.wikipedia.org/wiki/Joaquim_Ku%C3%AAk, and was a shocked. The dead macaw and the exploitation of indigenous people — Kuêk's skull was put in a museum after his death — hint at the ugly, untold parts of these beautiful books. In a way, images like this remind us the importance of re-signifying the biodiversity treasure in BHL, and making sure it has a positive and long lasting impact for human kind and biodiversity. I think that is, ultimately, the mission behind this whole Wiki work.
Anyways, coming back to the structured data, some files in the category had, in the metadata, species names loaded from BHL OCR. I wrote code, then, to parse these as "depicts" statements. Some names dected by the BHL OCR don't seem to be used anymore, e.g. Noctilio dorsatus which seems to be now called Noctilio leporinus https://www.gbif.org/pt/species/2433343 . I am investigating technical ways to make these leaps automatically, maybe via GBIF.
Case Study 05 - Model v. 0.1.1 - Historia naturalis Brasiliae https://docs.google.com/document/d/1pDC9pMLK46bOKaB5mjI9aNaMLf7MdLmATtQggTYGAac/edit?tab=t.0#heading=h.alrv62u4dvp
This 1648 book is outstanding for its historical importance. It was published in Latin, in Leiden and reflects a lot of the Natural History of Dutch Brazil https://meta.wikimedia.org/w/index.php?title=Dutch_Brazil&action=edit&redlink=1. There was, recently, a 1.5 Million Euro Horizon2020 grant https://cordis.europa.eu/project/id/715423 just for studying this book. There are many copies known and catalogued https://www.taylorfrancis.com/chapters/oa-edit/10.4324/9781003362920-9/census-copies-willem-piso-georg-marcgraf-historia-naturalis-brasiliae-leiden-amsterdam-elzevier-1648-alex-alsemgeest-jeroen-bos and BHL has at least 5 different Title identifiers for it, all connected to the same Wikidata item https://www.wikidata.org/wiki/Q339343.
That detail adds some complexity to the workflow, but luckly, the {{BHL}} templated stored the exact Item where each image came from. They seem to have been colored by hand and not by the illustrator, so some images look funny, and very different from copy to copy.
https://meta.wikimedia.org/wiki/File:Jacana_-_Historia_naturalis_Brasiliae_(Page_190)_BHL289283_(cropped).jpg https://meta.wikimedia.org/wiki/File:Jacana_jacana_-_Tiago_Lubiana_-_458859541.jpeg This jacana was coloured green in one book, an alien look for a familiar bird to Brazilian birders.
Case Study 06 - Model v. 0.1.1 - Les bois indigènes de São Paulo https://docs.google.com/document/d/1q2vNGAEREgI4yj-GD9kEnmFjImR4FZIZDtKbDKesXu0/edit?tab=t.0#heading=h.alrv62u4dvp A rare case of a public domain book about Brazilian biodiversity that was actually authored by Brazilians and edited in Brazil! This collection is about woody trees of São Paulo, here I live, and has a mix of photos and illustrations of the plants. I modified Magnus Manske's sdc_tool to be able to use it for "instance of" statements, and curate this information onwiki The modified script is available at User:TiagoLubiana/sdc_tool.js https://commons.wikimedia.org/wiki/User:TiagoLubiana/sdc_tool.js.
Tthe PDF/DJVU of the book was not available on Commons. Adding it to Commons via the Internet Archive Upload tool (ia-upload.wmcloud.org/ http://ia-upload.wmcloud.org/) was surprisingly straightforward.
Tech adventures on Phabricator tickets
Navigating with more intensity in the Commons Structured Data waters led me to find some bugs, inconsistencies and missing features, which in turn turned into dialogs with the Wikimedia tech community via Phabricator, including:
T298672 https://phabricator.wikimedia.org/T298672, a bug when new users try and revert Structured Data T383584 https://phabricator.wikimedia.org/T383584 and T304391 https://phabricator.wikimedia.org/T304391, a feature request to be able to link, using SDC, two media files (e.g. an image and the scan where it comes from). The tickets are for two different solutions: one, a quick workaround, the other, for a larger Wikibase restructuring, which would enable better queries for this data T384221 https://phabricator.wikimedia.org/T384221, based on a feedback by Siobhan on the BHL Arena https://bhl-arena.toolforge.org/ app, a feature request for sharing links to Commons files opening directly the "Structured Data" view Next steps
And that was it so far. I will keep navigating these categories for the next few weeks, refining the model and the bot code. Let me know if there is any particular category that you would like me to study/add metadata to.
Thank you for reading this update, and if you have questions or suggestions, don't hesitate to reach out.
Cheers, Tiago
—————————————————————————— Tiago Lubiana Wikimedian-in-Residence, Biodiversity Heritage Library https://www.biodiversitylibrary.org/ tiago.bio.br https://tiago.bio.br/