Hi,
I'm grabbing this opportunity to bring up 3 bugs related to mwlib that deserve a larger discussion and should perhaps be implemented differently in the new version.
1. https://bugzilla.wikimedia.org/show_bug.cgi?id=56560 - PDF creation tool considers IPv6 addresses as users, not anonymous.
I've pushed a patched for this and it was merged; however, the detection was based on regex and, as a quick google search will tell you, it's not so obvious to do a regex to cover all IPv6 cases. Perhaps the information anon user/logged in user might be sent from MW.
2. https://bugzilla.wikimedia.org/show_bug.cgi?id=56219 - PDF creation tool excludes contributors with a "bot" substring in their username
I've also pushed a pull request for this one, but it was rejected based on the en.wp policy that prevents bot-like usernames for humans. The problem is more complex though:
a. Should bots be credited for their edits? While most of them do simple tasks, we have recently seen an increase in bot-created content. On ro.wp we even have a few lists only edited by robots. b. If the robots should _not_ be credited, how do we detect them? Ideally, there should be an automatical way to do so, but according to http://www.mediawiki.org/wiki/Bots, it only works for recent changes. Less ideally, only users with "bot" at the end should be removed, in order to keep users like https://ro.wikipedia.org/wiki/Utilizator:Vitalie_Ciubotaru (which is not a robot, but has "bot" in the name) in the contributor list.
3. https://bugzilla.wikimedia.org/show_bug.cgi?id=2994 - Automatically generated count and list of contributors to an article (authorship tracking)
This is an old enhancement request, revived by me last month in a wikimedia-l thread: http://lists.wikimedia.org/pipermail/wikimedia-l/2013-October/128575.html . The idea is to decide if and how to credit: a. vandals b. reverters c. contributors which had their valid contributions rephrased or replaced from the article. d. contributors with valid contributions but invalid names
I hope the people working on this feature will take the time to consider these issues and come up with solutions for them.
Thanks, Strainu
2013/11/13 Erik Moeller erik@wikimedia.org:
Hi folks,
for a long time we've relied on the mwlib libraries by PediaPress to generate PDFs on Wikimedia sites. These have served us well (we generate >200K PDFs/day), but they architecturally pre-date a lot of important developments in MediaWiki, and actually re-implement the MediaWiki parser (!) in Python. The occasion of moving the entire PDF service to a new data-center has given us reason to re-think the architecture and come up with a minimally viable alternative that we can support long term.
Most likely, we'll end up using Parsoid's HTML5 output, transform it to add required bits like licensing info and prettify it, and then render it to PDF via phantomjs, but we're still looking at various rendering options.
Thanks to Matt Walker, C. Scott Ananian, Max Semenik, Brad Jorsch and Jeff Green for joining the effort, and thanks to the PediaPress folks for giving background as needed. Ideally we'd like to continue to support printed book generation via PediaPress' web service, while completely replacing the rendering tech stack on the WMF side of things (still using the Collection extension to manage books). We may need to deprecate some output formats - more on that as we go.
We've got the collection-alt-renderer project set up on Labs (thanks Andrew) and can hopefully get a plan to our ops team soon as to how the new setup could work.
If you want to peek - work channel is #mediawiki-pdfhack on FreeNode.
Live notes here: http://etherpad.wikimedia.org/p/pdfhack
Stuff will be consolidated here: https://www.mediawiki.org/wiki/PDF_rendering
Some early experiments with different rendering strategies here: https://github.com/cscott/pdf-research
Some improvements to Collection extension underway: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/...
More soon, Erik
-- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l