jayvdb created this task. jayvdb added a subscriber: jayvdb. jayvdb added a project: Pywikibot-Wikidata. jayvdb changed Security from none to none.
TASK DESCRIPTION Some data in Wikipedia is easier to extract from the rendered html than from the templates, and it puts the values into microformats. There may also be other webpages which use microformats which could be used to extract information and add it to wikidata. I expect this should be done in a new script, but it would be based on script harvest_templates.py
https://en.wikipedia.org/wiki/Help:Microformats .
birthdate and deathdate are good examples, where on English Wikipedia they are placed in special spans, using a constant format.
view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin
<span class="bday">1706-01-17</span> <span class="dday deathdate">1790-04-17</span>
The {{Persondata}} template is relatively easy to parse the template, but it is also well labelled in the HTML. https://en.wikipedia.org/wiki/Wikipedia:Persondata
<table id="persondata" class="persondata noprint" style="border:1px solid #aaa; display:none; speak:none;"> <tr> <th colspan="2"><a href="/wiki/Wikipedia:Persondata" title="Wikipedia:Persondata">Persondata</a></th> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Name</td> <td>Franklin, Benjamin</td> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Alternative names</td> <td></td> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Short description</td> <td>American printer, writer, politician</td> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Date of birth</td> <td>January 17, 1706</td> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Place of birth</td> <td>Boston, Massachusetts</td> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Date of death</td> <td>April 17, 1790</td> </tr> <tr> <td class="persondata-label" style="color:#aaa;">Place of death</td> <td><a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>, Pennsylvania</td> </tr> </table>
More at https://en.wikipedia.org/wiki/Wikipedia:Metadata
A list of templates which generate microformats is at https://en.wikipedia.org/wiki/Category:Templates_generating_microformats , and sample pages can be found by using 'whatlinkshere'.
e.g. vcard with fn org can be seen in the source of the infobox here:
view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb Cc: Aklapper, jayvdb, pywikipedia-bugs
jayvdb added a comment.
This has been proposed as CGI task https://www.google-melange.com/gci/task/view/google/gci2014/5857599308169216
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb Cc: Aklapper, jayvdb, pywikipedia-bugs
jayvdb added a comment.
The parsed version of a wiki page can be obtained using the parse module https://en.wikipedia.org/w/api.php?action=parse&page=Benjamin_Franklin
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb Cc: Aklapper, jayvdb, pywikipedia-bugs
jayvdb added a project: Google-Code-in-2014.
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb Cc: Aklapper, jayvdb, pywikipedia-bugs
jayvdb moved this task to pywikibot on the Google-Code-in-2014 workboard.
TASK DETAIL https://phabricator.wikimedia.org/T78416
WORKBOARD https://phabricator.wikimedia.org/project/board/74/
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb Cc: Aklapper, jayvdb, pywikipedia-bugs
murfel added a subscriber: murfel. murfel added a comment.
Copying the GCI task description:
This task is to create a new script to harvest data from HTML microformats in Wikipedia pages and other webpages, and add the data to items in Wikidata. The new script will be similar in nature to the existing script harvest_template.py http://git.wikimedia.org/blob/pywikibot%2Fcore.git/master/scripts%2Fharvest_template.py, except it will use HTML instead of wikitext, and it can offer automatic assignments of values to properties where the microformats describe the data in a standardised way that maps to properties on Wikidata.
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel Cc: Aklapper, jayvdb, murfel, pywikipedia-bugs
murfel claimed this task. murfel added a comment.
I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.
Did I get it right?
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel Cc: Aklapper, jayvdb, murfel, pywikipedia-bugs
jayvdb added a subscriber: tommorris. jayvdb added a comment.
In https://phabricator.wikimedia.org/T78416#938277, @murfel wrote:
I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.
Did I get it right?
maybe, but maybe not. My inclusion of {{Persondata}} as an example was perhaps misleading.
this harvest_microformats script should not be based on templates, as is the job of harvest_template.py .
This script will use pagegenerators as arguments to select which pages should be processed, and -page:"..." is the easiest to use for testing.
For each page, get the HTML as you've said, and look for __microformats__ (http://microformats.org/) in the HTML. Microformats are usually described using HTML class:".." attributes, such as:
view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin
<span class="bday">1706-01-17</span> <span class="dday deathdate">1790-04-17</span>
and
view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal
<th colspan="2" class="fn org" style="text-align:center;font-size:125%;font-weight:bold;font-size: larger; background-color: #CEDEFF">Manchester Ship Canal</th>
The two most important standardised microformats are http://microformats.org/wiki/hCard http://microformats.org/wiki/hCalendar
Another icroformat that is very relevant to wikis is http://microformats.org/wiki/rel-license
However Wikimedia mostly uses its own non-standard microformats, for example, "licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license
view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era
<table class="licensetpl" style="display:none;"> <tr> <td><span class="licensetpl_short">Public domain</span><span class="licensetpl_long">Public domain</span><span class="licensetpl_link_req">false</span><span class="licensetpl_attr_req">false</span></td> </tr> </table>
When microformats have been found in the HTML, yes .... "parse key-values [from the microformat] and add them to Wikibase" , but .. there are python libraries that already do most of the grunt work for you, so hopefully you dont need to do the parsing yourself, e.g. see http://microformats.org/wiki/parsers and search https://pypi.python.org/pypi/ . One library mentioned is https://github.com/tommorris/mf2py , which is maintained by @tommorris , English Wikipedia admin among other things.
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel, jayvdb Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs
jayvdb added a comment.
A useful tool to see microformats in any webpage (courtesy of @murfel on IRC) https://mf2py.herokuapp.com/parse?url=https://en.wikipedia.org/wiki/Benjamin...
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel, jayvdb Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs
jayvdb added a comment.
The version packaged on pypi (0.2.1) doesnt handle wikipedia microformats. You will need to install the latest from github : https://github.com/tommorris/mf2py
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel, jayvdb Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs
murfel placed this task up for grabs. murfel added a comment.
You would probably need bs4 4.2.1 (depends on python >= 2.7.5-5~) so mf2py works correctly.
You can start with or take a look at my code: http://pastebin.com/acLfCrfD
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: murfel Cc: Aklapper, jayvdb, murfel, tommorris, pywikipedia-bugs
Prtksxna added a subscriber: Prtksxna.
TASK DETAIL https://phabricator.wikimedia.org/T78416
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Prtksxna Cc: Aklapper, jayvdb, murfel, tommorris, Prtksxna, pywikipedia-bugs
pywikipedia-bugs@lists.wikimedia.org