jayvdb added a subscriber: tommorris.
jayvdb added a comment.
In
https://phabricator.wikimedia.org/T78416#938277, @murfel wrote:
I think to implement it in the following way: catch
all page which link to a given template, get HTML for each page, look for table with
id="template_name" inside of HTML, parse key-values in the table and add them to
Wikibase.
Did I get it right?
maybe, but maybe not. My inclusion of {{Persondata}} as an example was perhaps
misleading.
this harvest_microformats script should not be based on templates, as is the job of
harvest_template.py .
This script will use pagegenerators as arguments to select which pages should be
processed, and -page:"..." is the easiest to use for testing.
For each page, get the HTML as you've said, and look for __microformats__
(
http://microformats.org/) in the HTML. Microformats are usually described using HTML
class:".." attributes, such as:
view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin
<span class="bday">1706-01-17</span>
<span class="dday deathdate">1790-04-17</span>
and
view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal
<th colspan="2" class="fn org"
style="text-align:center;font-size:125%;font-weight:bold;font-size: larger;
background-color: #CEDEFF">Manchester Ship Canal</th>
The two most important standardised microformats are
http://microformats.org/wiki/hCard
http://microformats.org/wiki/hCalendar
Another icroformat that is very relevant to wikis is
http://microformats.org/wiki/rel-license
However Wikimedia mostly uses its own non-standard microformats, for example,
"licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license
view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era
<table class="licensetpl" style="display:none;">
<tr>
<td><span class="licensetpl_short">Public
domain</span><span class="licensetpl_long">Public
domain</span><span
class="licensetpl_link_req">false</span><span
class="licensetpl_attr_req">false</span></td>
</tr>
</table>
When microformats have been found in the HTML, yes .... "parse key-values [from the
microformat] and add them to Wikibase" , but .. there are python libraries that
already do most of the grunt work for you, so hopefully you dont need to do the parsing
yourself, e.g. see
http://microformats.org/wiki/parsers and search
https://pypi.python.org/pypi/ . One library mentioned is
https://github.com/tommorris/mf2py , which is maintained by @tommorris , English Wikipedia
admin among other things.
TASK DETAIL
https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
<username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel, jayvdb
Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs