pywikibot May 2016

pywikibot@lists.wikimedia.org

11 participants
9 discussions

[Pywikipedia-l] Urlencoded section titles

by Bináris

Happy Monday, There are strange people who make such links (kindof urlencoded?): [[Második világháború#Partrasz.C3.A1ll.C3.A1s Szic.C3.ADli.C3.A1ban .28Huskey hadm.C5.B1velet.29|Huskey hadműveletben]] So the section title must have been copied from the URL. Do we have a ready tool to fix these? -- Bináris

5 years, 7 months

[Pywikipedia-l] Template parsing code

by Hannes Röst

Hello all >From one of my assignments as a bot operator I have some code which does template parsing and general text parsing (e.g. Image/File tags). It is not using regex and thus able to correctly parse nested templates and other such nasty things. I have written those as library classes and written tests for them which cover almost all of the code. I would now really like to contribute that code back to the community. Would you be interested in adding this code to the pywikibot framework? If yes, can I send the code to someone for code review or how do you usually operate? Greetings Hannes PS: wiki userpage is http://en.wikipedia.org/wiki/User:Hannes_R%C3%B6st

7 years, 7 months

Not sure if I'm running weblinkchecker correctly

by .

Hi all, I'my trying to run weblinkchecker from my own computer. I've logged in to the bot account, but when I try to run weblink checker, like this: >python weblinkchecker.py UserPage:usernamegoeshere I just get a very long bunch of help text on the screen "This bot is used for checking external links found at the wiki" etc. I don't know if I'm successfully running the script or not, and whether it's actually checking any links. What am I doing wrong? Thanks!

7 years, 11 months

Re: [pywikibot] [Wikitech-l] Insecure (non-HTTPS) API Requests to become unsupported starting 2016-06-12

by Maarten Dammers

Hi Brandon, Op 14-5-2016 om 15:50 schreef Brandon Black: > Hi Maarten, > > I've searched logstash going back 1 month, and there's only one pair > of insecure requests from one user with Pywikibot in the User-Agent > string. These two requests were on April 29th with the UA string > "login (wmf:en; User:Abi%C3%A1n) Pywikibot/2.0b3 (g6606) > requests/2.2.1 Python/2.7.6.final.0". Thanks for looking into this, but are you sure the query is correct? I expected more results and I see at https://en.wikipedia.org/wiki/Wikipedia:Bot_owners'_noticeboard#https_api_change you found another Pywikibot. Are you sure you caught everything? Maarten > > Thanks, > -- Brandon > > On Sat, May 14, 2016 at 10:00 AM, Maarten Dammers <maarten(a)mdammers.nl > <mailto:maarten@mdammers.nl>> wrote: > > Hi guys, > > I'm pretty sure Pywikibot is not affected by this unless you're > using an ancient version or you forced the bot to http in your > configuration. Brandon, do you see any pywikibot based bots in > your logs that would be affected by this? > > Maarten > > > -------- Doorgestuurd bericht -------- > Onderwerp: [Wikitech-l] Insecure (non-HTTPS) API Requests to > become unsupported starting 2016-06-12 > Datum: Fri, 13 May 2016 22:34:20 +0000 > Van: Brandon Black <bblack(a)wikimedia.org> > <mailto:bblack@wikimedia.org> > Antwoord-naar: Wikimedia developers > <wikitech-l(a)lists.wikimedia.org> > <mailto:wikitech-l@lists.wikimedia.org> > Aan: mediawiki-api-announce(a)lists.wikimedia.org > <mailto:mediawiki-api-announce@lists.wikimedia.org>, > mediawiki-api(a)lists.wikimedia.org > <mailto:mediawiki-api@lists.wikimedia.org>, Wikimedia developers > <wikitech-l(a)lists.wikimedia.org> > <mailto:wikitech-l@lists.wikimedia.org> > > > > TL;DR: ---- * All access to Wikimedia production sites/APIs should > use https:// URLs, not http:// -- your bot/tool will break in the > near future if it does not! * 2016-06-12 - insecure access is > unsupported; starting on this date we plan to break (deny with > 403) 10% of all insecure requests randomly as a wake-up call. * > 2016-07-12 - we plan to break all insecure requests. ---- Hi all, > As you may remember, all production Wikimedia wikis switched to > HTTPS-only for all canonical domainnames nearly a year ago: > https://blog.wikimedia.org/2015/06/12/securing-wikimedia-sites-with-https/ > Since way back then, we've been forcing insecure HTTP requests to > our canonical domains over to HTTPS by using redirects and > Strict-Transport-Security, which is effective for the vast > majority of access from humans using browsers and apps. In the > time since, we've been chasing down various corner-case issues > where loopholes may arise in our HTTPS standards and enforcement. > One of the most-difficult loopholes to close has been the > "Insecure POST" loophole, which is discussed in our ticket system > here: https://phabricator.wikimedia.org/T105794 . To briefly recap > the "Insecure POST" issue: * Most of our humans using browser UAs > are not affected by it. They start out doing GET traffic to our > sites, their GETs get redirected to HTTPS if necessary, and then > any POSTs issued by their browser use protocol-relative URIs which > are also HTTPS. * However, many automated/code UAs (bots, tools, > etc) access the APIs using initial POST requests to hardcoded > service URLs using HTTP (rather than HTTPS). * For all of the > code/library UAs out there in the world, there is no > universally-compatible way to redirect them to HTTPS. There are > different ways that work for some UAs, but many UAs used for APIs > don't handle redirects at all. * Regardless of the above, even if > we could reliably redirect POST requests, that doesn't fix the > security problem like it does with GET. The private data has > already been leaked in the initial insecure request before we have > a chance to redirect it. If we did some kind of redirect first, > we'd still just be putting off the inevitable future date where we > have to go through a breaking transition to secure the data. > Basically, we're left with no good way to upgrade these insecure > requests without breaking them. The only way it gets fixed is if > all of our API clients in the world use explicit https:// URLs for > Wikimedia sites in all of their code and configuration, and the > only way we can really force them to do so is to break insecure > POST requests by returning a 403 error to tools that don't. Back > in July 2015, I began making some efforts to statistically sample > the User-Agent fields of clients doing "Insecure POST" and > tracking down the most-prominent offenders. We were able to find > and fix many clients along the way since. A few months ago Bryan > Davis got us further when he committed a MediaWiki core change to > let our sites directly warn offending clients. I believe that went > live on Jan 29th of this year ( > https://gerrit.wikimedia.org/r/#/c/266958 ). It allows insecure > POSTs to still succeed, but sends the clients a standard warning > that says "HTTP used when HTTPS was expected". This actually broke > some older clients that weren't prepared to handle warnings at > all, and caused several clients to upgrade. We've been logging > offending UAs and accounts which trigger the warning via > EventLogging since then, but after the initial impact the rate > flattened out again; clients and/or users that didn't notice the > warning fairly quickly likely never will. Many of the remaining > UAs we see in logs are simply un-updated. For example, > https://github.com/mwclient/mwclient switched to HTTPS-by-default > in 0.8.0, released in early January, but we're still getting lots > of insecure POST from older mwclient versions installed out there > in the world. Even in cases where the code is up to date and > supports HTTPS properly, bot/tool configurations may still have > hardcoded http:// > site config URLs. > > We're basically out of "soft" ways to finish up this part of the HTTPS > transition, and we've stalled long enough on this. > > ** 2016-06-12 is the selected support cutoff date ** > > After this date, insecure HTTP POST requests to our sites are > officially unsupported. This date is: > > * A year to day after the public announcement that our sites are HTTPS only > * ~ 11 months after we began manually tracking down top offenders and > getting them fixed > * ~ 4 months after we began sending warning messages in the response > to all insecure POST requests to the MW APIs > * ~ 1 month after this email itself > > On the support cutoff date, we’ll begin emitting a “403 Insecure POST > Forbidden - use HTTPS” failure for 10% of all insecure POST traffic > (randomly-selected). Some clients will retry around this, and > hopefully the intermittent errors will raise awareness more-strongly > than the API warning message and this email did. > > A month later (two months out from this email) on 2016-07-12 we plan > to break insecure access completely (all insecure requests get the 403 > response). > > In the meantime, we'll be trying to track down offending bots/tools > from our logs and trying to contact owners who haven't seen these > announcements. Our Community team will be helping us communicate this > message more-directly to affected Bot accounts as well. > > Thank you all for your help during this transition! > -- Brandon Black > Sr Operations Engineer > Wikimedia Foundation > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org <mailto:Wikitech-l@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > >

7 years, 11 months

GSoC 2016 | Porting catimages to pywikibot-core

by Abdeali Kothari

Hi, I'm a student from Chennai, India and my project is going to be related to performing image processing on the images on commons.wikimedia to automate categorization. DrTrigon had made the script catimages.py a few years ago which was made in the old pywikipedia-bot framework. I'll be working towards updating the script to the pywikibot-core framework, updating it's dependencies, and using newer techniques when possible. catimages.py is a script that analyzes an image using various computer vision algorithms and allots categories to the image on commons. For example, consider algorithms that detect faces, barcodes, etc. The script uses these to categorize images to Category:Unidentified People, Category:Barcode, and so on. If you have any suggestions and categorizations you think might be useful to you, drop in at #gsoc-catimages on freenode or my talk page[0]. You can find out more about me on User:AbdealiJK[1] and about the project at T129611[2]. Regards [0] - https://commons.wikimedia.org/wiki/User_talk:AbdealiJK [1] - https://meta.wikimedia.org/wiki/User:AbdealiJK [2] - https://phabricator.wikimedia.org/T129611

7 years, 11 months

Fwd: [Wikitech-l] Insecure (non-HTTPS) API Requests to become unsupported starting 2016-06-12

by Maarten Dammers

Hi guys, I'm pretty sure Pywikibot is not affected by this unless you're using an ancient version or you forced the bot to http in your configuration. Brandon, do you see any pywikibot based bots in your logs that would be affected by this? Maarten -------- Doorgestuurd bericht -------- Onderwerp: [Wikitech-l] Insecure (non-HTTPS) API Requests to become unsupported starting 2016-06-12 Datum: Fri, 13 May 2016 22:34:20 +0000 Van: Brandon Black <bblack(a)wikimedia.org> Antwoord-naar: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Aan: mediawiki-api-announce(a)lists.wikimedia.org, mediawiki-api(a)lists.wikimedia.org, Wikimedia developers <wikitech-l(a)lists.wikimedia.org> TL;DR: ---- * All access to Wikimedia production sites/APIs should use https:// URLs, not http:// -- your bot/tool will break in the near future if it does not! * 2016-06-12 - insecure access is unsupported; starting on this date we plan to break (deny with 403) 10% of all insecure requests randomly as a wake-up call. * 2016-07-12 - we plan to break all insecure requests. ---- Hi all, As you may remember, all production Wikimedia wikis switched to HTTPS-only for all canonical domainnames nearly a year ago: https://blog.wikimedia.org/2015/06/12/securing-wikimedia-sites-with-https/ Since way back then, we've been forcing insecure HTTP requests to our canonical domains over to HTTPS by using redirects and Strict-Transport-Security, which is effective for the vast majority of access from humans using browsers and apps. In the time since, we've been chasing down various corner-case issues where loopholes may arise in our HTTPS standards and enforcement. One of the most-difficult loopholes to close has been the "Insecure POST" loophole, which is discussed in our ticket system here: https://phabricator.wikimedia.org/T105794 . To briefly recap the "Insecure POST" issue: * Most of our humans using browser UAs are not affected by it. They start out doing GET traffic to our sites, their GETs get redirected to HTTPS if necessary, and then any POSTs issued by their browser use protocol-relative URIs which are also HTTPS. * However, many automated/code UAs (bots, tools, etc) access the APIs using initial POST requests to hardcoded service URLs using HTTP (rather than HTTPS). * For all of the code/library UAs out there in the world, there is no universally-compatible way to redirect them to HTTPS. There are different ways that work for some UAs, but many UAs used for APIs don't handle redirects at all. * Regardless of the above, even if we could reliably redirect POST requests, that doesn't fix the security problem like it does with GET. The private data has already been leaked in the initial insecure request before we have a chance to redirect it. If we did some kind of redirect first, we'd still just be putting off the inevitable future date where we have to go through a breaking transition to secure the data. Basically, we're left with no good way to upgrade these insecure requests without breaking them. The only way it gets fixed is if all of our API clients in the world use explicit https:// URLs for Wikimedia sites in all of their code and configuration, and the only way we can really force them to do so is to break insecure POST requests by returning a 403 error to tools that don't. Back in July 2015, I began making some efforts to statistically sample the User-Agent fields of clients doing "Insecure POST" and tracking down the most-prominent offenders. We were able to find and fix many clients along the way since. A few months ago Bryan Davis got us further when he committed a MediaWiki core change to let our sites directly warn offending clients. I believe that went live on Jan 29th of this year ( https://gerrit.wikimedia.org/r/#/c/266958 ). It allows insecure POSTs to still succeed, but sends the clients a standard warning that says "HTTP used when HTTPS was expected". This actually broke some older clients that weren't prepared to handle warnings at all, and caused several clients to upgrade. We've been logging offending UAs and accounts which trigger the warning via EventLogging since then, but after the initial impact the rate flattened out again; clients and/or users that didn't notice the warning fairly quickly likely never will. Many of the remaining UAs we see in logs are simply un-updated. For example, https://github.com/mwclient/mwclient switched to HTTPS-by-default in 0.8.0, released in early January, but we're still getting lots of insecure POST from older mwclient versions installed out there in the world. Even in cases where the code is up to date and supports HTTPS properly, bot/tool configurations may still have hardcoded http:// site config URLs. We're basically out of "soft" ways to finish up this part of the HTTPS transition, and we've stalled long enough on this. ** 2016-06-12 is the selected support cutoff date ** After this date, insecure HTTP POST requests to our sites are officially unsupported. This date is: * A year to day after the public announcement that our sites are HTTPS only * ~ 11 months after we began manually tracking down top offenders and getting them fixed * ~ 4 months after we began sending warning messages in the response to all insecure POST requests to the MW APIs * ~ 1 month after this email itself On the support cutoff date, we’ll begin emitting a “403 Insecure POST Forbidden - use HTTPS” failure for 10% of all insecure POST traffic (randomly-selected). Some clients will retry around this, and hopefully the intermittent errors will raise awareness more-strongly than the API warning message and this email did. A month later (two months out from this email) on 2016-07-12 we plan to break insecure access completely (all insecure requests get the 403 response). In the meantime, we'll be trying to track down offending bots/tools from our logs and trying to contact owners who haven't seen these announcements. Our Community team will be helping us communicate this message more-directly to affected Bot accounts as well. Thank you all for your help during this transition! -- Brandon Black Sr Operations Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

7 years, 11 months

BeutifyInfoboxes

by Jan Dudík

Hello, [[cs:User:Dvorapa]] wrote script for cosmetic changes in infoboxes. https://cs.wikipedia.org/wiki/Wikipedie:WikiProjekt_Strojové_zpracování/Skr… This is great tool, but there is very annoying bug: script sometimes destroys table https://cs.wikipedia.org/w/index.php?title=Petr_Bříza&diff=prev&oldid=13643… or external link https://cs.wikipedia.org/w/index.php?title=Jaroslav_Maxmilián_Kašparů&diff=… It looks like script ignores 'Exceptions' Can somebody repair it? JAnD ----------- def beautifyInfoboxes(self, text): """Cleanup multiple or trailing spaces.""" exceptions = ['comment', 'math', 'nowiki', 'pre', 'table', 'link'] items = [i for i in text.split('|')] newitems = [] inInfobox = [False] inTemplate = [False] inLink = [False] for item in items: if inInfobox[-1] and not inLink[-1] and not inTemplate[-1]: item = textlib.replaceExcept(item.lstrip(), r'^', r' ', exceptions, site=self.site) # TODO: nahradit re.sub za textlib.replaceExcept po vyřešení T125307 item = re.sub(r' *= *', r' = ', item, count=1) if re.search(r'(?i)\{{2}(Infobox|NFPA 704)[^\}]*$', item) != None: inInfobox.append(True) elif inInfobox[-1]: brackets = [m.start() for m in re.finditer("}}",item)] brackets += [m.start() for m in re.finditer("{{",item)] brackets += [m.start() for m in re.finditer("]]",item)] brackets += [m.start() for m in re.finditer(r"\[{2}",item)] brackets.sort(key=int) if not brackets == []: endInfobox = [] for n in brackets: if item[n] == '{': inTemplate.append(True) elif item[n] == '[': inLink.append(True) elif item[n] == ']': inLink.pop() else: if inTemplate[-1]: inTemplate.pop() elif inInfobox[-1]: endInfobox.append(n) inInfobox.pop() if not inInfobox[-1]: break for n in reversed(endInfobox): before = item[:n] before = textlib.replaceExcept(before.rstrip(), r'$', r'\n', exceptions, site=self.site) after = item[n:] if not inInfobox[-1]: after = textlib.replaceExcept(after, r'^}}\s*', r'}}\n', exceptions, site=self.site) item = before + after if inInfobox[-1] and not inLink[-1] and not inTemplate[-1]: item = textlib.replaceExcept(item.rstrip(), r'$', r'\n ', exceptions, site=self.site) newitems.append(item) return '|'.join(newitems)

7 years, 11 months

Extra developer branches in gerrit

by John Mark Vandenberg

We now have a lot of branches in gerrit. 2.0 correction HEAD master nexqt PropertyPage textdata We have previously only have branches for 2.0 HEAD and master in gerrit. I think this happens when we push directly to the gerrit repo, i.e. not using git review. They are replicated to github, and as a result trigger builds on Travis and Appveyor-CI. https://github.com/wikimedia/pywikibot-core/branches I guess this could be a good thing, if a developer doesnt want to have a github.com account, but wants to do test builds on Travis/Appveyor. The most recent one is `nexqt` , which is https://gerrit.wikimedia.org/r/#/c/283940/ (newly uploaded today by xqt) , so it can be deleted I guess. The other three are not in Gerrit review system, all by Andre Engels <andreengels(a)gmail.com> textdata is Ic753f1b2727d5142705041a296241a04274e65da / 2146a2d475a PropertyPage is Id988417bfe67119aed2773a6becd6c4bd229c0c0 / 595742edb24 correction is I229e05a2cd47059a1682a5b6c6a353af04968139 / e56d112fd These have useful changes in them, so we shouldnt delete these branches. One strange aspect is that `correction` is attracting commits from l10n-bot(a)translatewiki.net , while the other ones are not. -- John Vandenberg

7 years, 11 months

[CR] Wikibase/Wikidata CSV importer

by Federico Leva (Nemo)

Does somebody have idea how to get https://gerrit.wikimedia.org/r/#/c/166629/ merged? 13 months ago, the script already did all I needed; if pywikibot's requirements differ from mine, I'd like to know. I personally don't need the code merged, but I know other users for whom it would be a simplification to find it in the standard repository. Nemo

7 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

pywikibot May 2016