jayvdb added a subscriber: tommorris. jayvdb added a comment.
In https://phabricator.wikimedia.org/T78416#938277, @murfel wrote:
I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.
Did I get it right?
maybe, but maybe not. My inclusion of {{Persondata}} as an example was perhaps misleading.
this harvest_microformats script should not be based on templates, as is the job of harvest_template.py .
This script will use pagegenerators as arguments to select which pages should be processed, and -page:"..." is the easiest to use for testing.
For each page, get the HTML as you've said, and look for __microformats__ (http://microformats.org/) in the HTML. Microformats are usually described using HTML class:".." attributes, such as:
view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin
<span class="bday">1706-01-17</span> <span class="dday deathdate">1790-04-17</span>
and
view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal
<th colspan="2" class="fn org" style="text-align:center;font-size:125%;font-weight:bold;font-size: larger; background-color: #CEDEFF">Manchester Ship Canal</th>
The two most important standardised microformats are http://microformats.org/wiki/hCard http://microformats.org/wiki/hCalendar
Another icroformat that is very relevant to wikis is http://microformats.org/wiki/rel-license
However Wikimedia mostly uses its own non-standard microformats, for example, "licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license
view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era
<table class="licensetpl" style="display:none;"> <tr> <td><span class="licensetpl_short">Public domain</span><span class="licensetpl_long">Public domain</span><span class="licensetpl_link_req">false</span><span class="licensetpl_attr_req">false</span></td> </tr> </table>
When microformats have been found in the HTML, yes .... "parse key-values [from the microformat] and add them to Wikibase" , but .. there are python libraries that already do most of the grunt work for you, so hopefully you dont need to do the parsing yourself, e.g. see http://microformats.org/wiki/parsers and search https://pypi.python.org/pypi/ . One library mentioned is https://github.com/tommorris/mf2py , which is maintained by @tommorris , English Wikipedia admin among other things.
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel, jayvdb Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs