Pywikipedia-l January 2012

pywikipedia-l@lists.wikimedia.org

17 participants
30 discussions

toolserver interwiki project (was: interwiki.py)
by Merlijn van Deen 21 Jan '12

21 Jan '12

For the pywikipedia-l listeners just tuning in: the toolserver has an overload of interwiki bots, and we want to reduce this. As such, we want to switch to a single bot that runs all the interwiki updates from the toolserver. On 16 January 2012 09:19, Merlijn van Deen <valhallasw(a)arctus.nl> wrote: > The only reasonable action we can take to reduce the memory > consumption is to let the OS do its job in freeing memory: using one > process to track pages that have to be corrected (using the database, > if possible), and one process to do the actual fixing (interwiki.py). > This should be reasonably easy to implement (i.e. use a pywikibot page > generator to generate a list of pages, use a database layer to track > interlanguage links and popen('interwiki.py <page>') if this is a > fixable situation) > > I took some time yesterday to work out some details on this - see http://piratepad.net/T29Uj4j1U4 . It boils down to this: 1) generation of a list of pages to work on: from the database, if possible 2) dispatching interwiki.py based on that list of pages and handling logging 3) interwiki.py itself My suggestion is to split these tasks, and creating a simple interface (e.g. WSGI) between 1) and 2), and using subprocesses for 2) to 3). Yesterday, I have been working (mainly) on speeding up the startup of interwiki.py, so that we can spawn one process per Page. On the Toolserver side, I would appreciate any comments/work/existing work on the creation of an interwiki graph from the database - there are already tools that suggest images based on interwiki links, so this code should be around - and hopefully be adaptable. The only goal for this process would be to create a list of starting pages interwiki.py can use - i.e. graphs with one or more missing links, but without any double links. On the Pywikipedia side, some thoughts on running interwiki.py in a new process would be welcome. e.g. how can we improve startup time ('kill all the regexps!') and effectively spawn multiple processes to run. What parameters (throttles?) should be tuned, et cetera. Best, Merlijn

1 0

snapshot of link graph at a given date
by John R. Frank 21 Jan '12

21 Jan '12

Hello Pywikipedians, As part of the TREC Knowledge Base Acceleration[1] eval for 2012, I want to generate snapshots of wikipedia's link graph around certain pages as of a given date. Does anyone have advice about this? Anyone doing something similar that I can learn from? For example, given a wikipedia.Page() instance for urlname=Takashi_Murakami, call it "central node," I want to iterate through the Pages returned by getReferences() and only keep those that ref'ed my central node on a date in the past, like November 30, 2011. Is the best (only?) way to do this to iterate through previous revisions of those Pages and verify that they ref'ed the central node before that date? What about pages that did ref and no longer do? (probly rare?) Thanks for any pointers! John 1 - http://www.mit.edu/~jrf/knowledge-base-acceleration/ -- ___________________________ John R. Frank <jrf(a)mit.edu>

3 2

Re: [Pywikipedia-l] [Pywikipedia-svn] SVN: [9196] trunk/pywikipedia
by Merlijn van Deen 19 Jan '12

19 Jan '12

Whoo! Great work :-) Tests always are good contributions :-) On a sidenote - is there a reason you're implementing these in 'trunk' and not in 'rewrite'? Of course, these contributions are very welcome in the trunk, but I still think it would be good to push the rewrite branch. Best regards, Merlijn On 24 April 2011 07:41, <jayvdb(a)svn.wikimedia.org> wrote: > http://www.mediawiki.org/wiki/Special:Code/pywikipedia/9196 > > Revision: 9196 > Author: jayvdb > Date: 2011-04-24 05:40:59 +0000 (Sun, 24 Apr 2011) > Log Message: > ----------- > Allow lists of Page and User objects to be interogated > > Modified Paths: > -------------- > trunk/pywikipedia/query.py > trunk/pywikipedia/tests/test_query.py > > Modified: trunk/pywikipedia/query.py > =================================================================== > --- trunk/pywikipedia/query.py 2011-04-24 04:23:12 UTC (rev 9195) > +++ trunk/pywikipedia/query.py 2011-04-24 05:40:59 UTC (rev 9196) > @@ -263,10 +263,21 @@ > > encList = '' > # items may not have one symbol - '|' > - for l in list: > - if type(l) == str and u'|' in l: > - raise wikipedia.Error("item '%s' contains '|' symbol" % l ) > - encList += ToUtf8(l) + u'|' > + for item in list: > + if isinstance(item,basestring): > + if u'|' in item: > + raise wikipedia.Error(u"item '%s' contains '|' symbol" % > item ) > + encList += ToUtf8(item) + u'|' > + elif isinstance(item,wikipedia.Page): > + encList += ToUtf8(item.title()) + u'|' > + elif item.__class__.__name__ == 'User': > + # delay loading this until it is needed > + import userlib > + encList += ToUtf8(item.name()) + u'|' > + else: > + raise wikipedia.Error(u'unknown item class %s' % > item.__class__.__name__) > + > + # strip trailing '|' before returning > return encList[:-1] > > def ToUtf8(s): > > Modified: trunk/pywikipedia/tests/test_query.py > =================================================================== > --- trunk/pywikipedia/tests/test_query.py 2011-04-24 04:23:12 UTC > (rev 9195) > +++ trunk/pywikipedia/tests/test_query.py 2011-04-24 05:40:59 UTC > (rev 9196) > @@ -7,6 +7,8 @@ > import unittest > import tests.test_pywiki > > +import wikipedia as pywikibot > +import catlib, userlib > import query > > > @@ -74,5 +76,72 @@ > ]} > self.assertEqualQueryResult(params, expectedresult) > > + def test_titles_Page(self): > + params = { > + 'action': 'query', > + 'list': 'users', > + 'usprop': ['registration'], > + 'ususers': [pywikibot.Page(self.site, u'Example'), > + pywikibot.Page(self.site, u'Example2')], > + } > + expectedresult = {u'users': [ > + { > + u'userid': 215131, > + u'name': u'Example', > + u'registration': u'2005-03-19T00:17:19Z' > + }, > + { > + u'userid': 5176706, > + u'name': u'Example2', > + u'registration': u'2007-08-26T02:13:33Z' > + }, > + ]} > + self.assertEqualQueryResult(params, expectedresult) > + > + def test_titles_User(self): > + params = { > + 'action': 'query', > + 'list': 'users', > + 'usprop': ['registration'], > + 'ususers': [userlib.User(self.site, u'Example'), > + userlib.User(self.site, u'Example2')], > + } > + expectedresult = {u'users': [ > + { > + u'userid': 215131, > + u'name': u'Example', > + u'registration': u'2005-03-19T00:17:19Z' > + }, > + { > + u'userid': 5176706, > + u'name': u'Example2', > + u'registration': u'2007-08-26T02:13:33Z' > + }, > + ]} > + self.assertEqualQueryResult(params, expectedresult) > + > + def test_titles_Category(self): > + params = { > + 'action': 'query', > + 'prop': 'revisions', > + 'rvprop': ['ids', 'timestamp', 'user'], > + 'rvdir': 'newer', > + 'rvlimit': 1, > + 'titles': [catlib.Category(self.site, > u'Category:Categories')], > + } > + expectedresult = {u'pages': {u'794823': > + { > + u'ns': 14, > + u'pageid': 794823, > + u'revisions': [{ > + u'revid': 4494485, > + u'user': u'SEWilco', > + u'timestamp': u'2004-07-07T18:45:50Z', > + }], > + u'title': u'Category:Categories', > + }, > + }} > + self.assertEqualQueryResult(params, expectedresult) > + > if __name__ == "__main__": > unittest.main() > > > _______________________________________________ > Pywikipedia-svn mailing list > Pywikipedia-svn(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-svn >

8 13

SOPA tomorrow
by Maarten Dammers 18 Jan '12

18 Jan '12

Hi guys, You probably know by now that the English Wikipedia will be closed tomorrow [1]. Keep an eye on your bots if they're running on the English Wikipedia (don't forget interwiki.py) because you might see some unexpected behaviour or just crashes tomorrow. Maarten [1] http://blog.wikimedia.org/2012/01/16/wikipedias-community-calls-for-anti-so…

3 3

Re: [Pywikipedia-l] Need help for: Page rename / insert text / update links
by Eric K 15 Jan '12

15 Jan '12

[For others: trying to list all page titles that contain any characters, *other than*: alphanumeric, spaces, underscores, dashes] I'm seeing the same. I went to this regex chat and asked them for a regex, which worked on the regex tester http://regexpal.com/but not on the bot :-(. python pagegenerators.py -titleregex:.*[^\w\s-].* ... - [1] I also tried these: (?=.*[^\w\s-])(.*[\w\s-].*) (?=.*[^\w\s-]).* #1 hits all pages including ones that only have that set of characters, for example its also showing up "Apple". I want the bot to show me pages with the following titles: My apple is sweet (song) He said "hello" to me But it should not show: My apple is sweet - song He said hello to me ________________________________ From: Jon Harald Søby <jhsoby(a)gmail.com> To: Eric K <ek79501(a)yahoo.com> Sent: Sunday, January 15, 2012 1:50 PM Subject: Re: [Pywikipedia-l] Need help for: Page rename / insert text / update links I'm trying to figure this one out too... I've been using mostly the same regex: [^A-Za-z0-9-\s]*, but mine hits _every_ page no matter what. Something's fishy. Will look more into it. Den 19:49 15. januar 2012 skrev Eric K <ek79501(a)yahoo.com> følgende: I'm trying this for example: >python pagegenerators.py -titleregex:[^A-Za-z0-9-\s]+ >(trying to make it say: dont match any alphabets, numbers, spaces and dashes) >And it only brings up titles with begin a " (quotation mark) , but it misses titles that have the " somewhere in the middle. > > >Then I remove the ^ for the regex and see what that does, and it gets all titles including those which have " and so on. >python pagegenerators.py -titleregex:[A-Za-z0-9-\s]+ > > >Its probably my regex that is flawed :). > > > > > >________________________________ > From: Eric K <ek79501(a)yahoo.com> >To: Jon Harald Søby <jhsoby(a)gmail.com> >Sent: Sunday, January 15, 2012 1:40 AM > >Subject: Re: [Pywikipedia-l] Need help for: Page rename / insert text / update links > > > >Thanks, I'm learning! I tried step one for generating the title list. It was saying "incomplete XML data". It was doing a text replace and I was able to make it work by doing "-debug" but it was slow. >Then I found out this command: >python pagegenerators.py -titleregex:apple >http://www.mediawiki.org/wiki/Manual:Pywikipediabot/pagegenerators.py >(there is no -save option for this script but if I add "> filelist.txt" at the end, it outputs the screen output to that text file, which works) > >So this one is only working on the page titles (so its quicker) and it does work, e.g. for the above, it will list any pages beginning with the word "apple". I actually found there's other characters in the database, so the best way would be to do this kind of search: >- If a page contains any character which is not: >--- alphanumeric > >--- underscore > >--- dash > >--- space > >Then include that page in the list. >So "Hello 123" would be excluded but "Hello 123$" would be included. >The pagegenerators.py does not have an "exclude" option like the "replace.py" had. > > > >Do you know of a regex that will work? > > > > > > > > >________________________________ > From: Jon Harald Søby <jhsoby(a)gmail.com> >To: Eric K <ek79501(a)yahoo.com> >Sent: Saturday, January 14, 2012 10:09 PM >Subject: Re: [Pywikipedia-l] Need help for: Page rename / insert text / update links > > >Good luck! :-) >My regex skills are quite rudimentary, so be cautious when doing the replacement step 5 -- the regexes may catch something they shouldn't. Please let me know how it goes! :-) > > >Den 04:43 15. januar 2012 skrev Eric K <ek79501(a)yahoo.com> følgende: > >Hi John, wow, that really is awesome, that you were able to do this with the provided scripts. I could never have come up with that. I'll try this right away and let you know how it goes. >> >> >> >> >> >> >> >>________________________________ >> From: Jon Harald Søby <jhsoby(a)gmail.com> >>To: Eric K <ek79501(a)yahoo.com>; Pywikipedia discussion list <pywikipedia-l(a)lists.wikimedia.org> >>Sent: Saturday, January 14, 2012 8:47 PM >>Subject: Re: [Pywikipedia-l] Need help for: Page rename / insert text / update links >> >> >> >> >>2012/1/15 Eric K <ek79501(a)yahoo.com> >> >>Hi guys, >>>I just installed the pywikipedia bot on my wiki yesterday. I'm new to Python but I can try learn it since I'm familiar with PHP. It would take me a while though to make this first bot since I'm new to the language. The tasks are pretty straightforward. I would like the bot to run without any user input and do all of this by itself: >>> >>> >>>1. For every page on the wiki, check if it has these three characters: ( , ) , : . Any page containing any of these characters (curly brackets and colon) will be moved to a new title. The original title is var_1. >>> >>> >> >>> >>>2. For the new title, brackets are simply deleted, and the : (colon) is replaced with a " - " (a dash with a space on each side). The new title generated is var_2. >>> >> >>> >>>3. Insert this text at the top of this page: {{page_rename|var_1}}, and save page. >>> >> >>> >>>4. Find any existing links on the site to this page which would be in the format of [[var_1]], and change them to [[var_2|var_1]]. >>> >>> >>>I don't need any menus or other functionality. Is something something pretty straightforward to make? I would appreciate any tips/help and if its something that can be made pretty easily, I would be really thankful if someone could do this for me or give me a good start. >>>I've looked at some of existing pywikipedia bot scripts (basic.py, movepages.py) but none of them would work for me and being new to Python, it would take me a long time to do what I need but in any case I will learn a lot in this first attempt. >>> >>> >>> >>>thanksEric >>>_______________________________________________ >>>Pywikipedia-l mailing list >>>Pywikipedia-l(a)lists.wikimedia.org >>>https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l >>> >>> >> >>This is how I would do it. It is probably a hacky solution, and there may be better/more efficient ways of doing it, but it should work. >> >>Step 1: Getting list of pages to change >> >>Run this line: >> >> >>python replace.py -regex -requiretitle:"$|$|:" "[A-Za-z0-9]" "test" -save:Pagestoberenamed.txt -start:! >> >>Press "a" when it prompts. >> >>This will not change anything, only save a list of all pages that need to be renamed. The script assumes there is either a letter or number in all the pages that needs to be changed. >> >>Step 2: Put that template on top of the pages >> >>Run this line: >> >>python add_text.py -up -text:"{{page_rename|{{subst:PAGENAME}}}}" -file:Pagestoberenamed.txt >> >>Step 3: Creating list for renaming files >> >>Open the file "Pagestoberenamed.txt" in a regex-supporting text editor and use the follow regex replacements: >> >>Replace: >> >>#\[\[([^:]*):([^\]]*)\]\] >>with >> >>[[\1:\2]] [[\1 - \2]] >> >> and replace >> >>#\[\[([^$]*)\(([^$]*)\)([^\]]*)\]\] >> with >> >>[[\1(\2)\3]] [[\1\2\3]] >> >> >> I don't actually have a text editor that supports regex, so instead I copypasted the contents of that file into a sandbox page, and ran the following line: >> >>python replace.py -page:SANDBOX -regex "#\[\[([^:]*):([^\]]*)\]\]" "[[\1:\2]] [[\1 - \2]]" "#\[\[([^$]*)\(([^$]*)\)([^\]]*)\]\]" "[[\1(\2)\3]] [[\1\2\3]]" >> >>Save the text as Pagerenaming.txt >> >> >> Hacky solution, but it should work. >> >>Step 4: Moving the pages >> >>Run this line: >> >>python movepages.py -pairs:Pagerenaming.txt >> >>It will not prompt you, it will move the pages as specified in Pagerenaming.txt >>If you do not want to have redirects from the old page names, use -noredirect as an additional argument. This may depend on how your wiki is set up, I know Wikipedias didn't have this option until relatively recently (and maybe it is only for administrators now). >> >>Step 5: Fixing links >>Links can be fixed using this line: >> >>python replace.py -regex "\[\[([^:]*):([^\]]*)\]\]" "[[\1 - \2|\1: \2]]" "\[\[([^$|^\[]*)\(([^$]*)\)([^\]]*)\]\]" "[[\1\2\3|\1(\2)\3]]" -start:! >> >>If you think it is too slow, you can append -pt:1 to that. >> >>With this last one you should be careful, and approve quite a few changes manually first (pressing "y" and not "a"), in case something is fishy with the regex. >> >>Hope this helps. >> >>-- >> >>mvh >>Jon Harald Søby >> >> >> > > >-- > >mvh >Jon Harald Søby > > > > > -- mvh Jon Harald Søby

3 3

Wikipedia.py:put_async parameters
by Bináris 15 Jan '12

15 Jan '12

Hi, Page has this method: def put_async(self, newtext, comment=None, watchArticle=None, minorEdit=True, force=False, It says: All arguments are the same as for .put(), except callback: But three of the parameters of Page.put is missing: *sysop=False, botflag=True, maxTries=-1* How can I use these? -- Bináris

2 2

How can we block IPs via PWB?
by Amir Ladsgroup 15 Jan '12

15 Jan '12

Hello As I know (and I tested) userlib.py doesn't let block IPs. Why did you that? and How can I block almost thirty thousand IP via PWB? -- Amir

2 6

upload.py under Windows 7: "No input filename given"
by Raimond Spekking 03 Jan '12

03 Jan '12

Hi, I try to run upload.py under Windows 7. Installed: Python 2.7 Pywikipediabot: Latest from SVN Executing this line from Windows PowerShell: c:\python27\python.exe D:\F_Programmierung\pywikipedia\upload.py -keep -log:upload.log -filename:"Kulturpreis der Sparkassen-Kulturstiftung Rheinland 2011-5606.jpg" -noverify "D:\Eigene Bilder\WP Bilder\Kulturpreis der Sparkassen-Kulturstiftung Rheinland 2011-5606.jpg" "== {{int:filedesc}} ==......." results into "No input filename given". Same error if I simplify the line to c:\python27\python.exe D:\F_Programmierung\pywikipedia\upload.py -filename:xxx.jph .\xxx.jpg "...." Yes, the pathes are correct. Any ideas? Thanks. Raimond.

4 7

Pywiki user agents
by Bináris 03 Jan '12

03 Jan '12

Hi, the trunk version's user agent is *PythonWikipediaBot/1.0*, but the description of wikipedia.py says: setUserAgent(text): Sets the string being passed to the HTTP server as the User-agent: header. Defaults to '*Pywikipediabot/1.0*'. This might be corrected. Not a high priority. :-) But can somebody tell me the UA of rewrite version? -- Bináris

2 2

account_global = False
by Nickanc Wikipedia 02 Jan '12

02 Jan '12

I saw that inside config.py there are these lines: # This is currently not used anywhere: account_global = False I don't know what this means technically, but by its name one may guess it means "not to login globally". But, since almost every account is global, why don't allow global login with the bot? Many users works on more than one wiki, for example, I work on both it.wiki and it.wikt, but there are also global bots who work on a big number of wikis. Therefore a global login would be convenient, no? Happy 2012 Nickanc

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Pywikipedia-l January 2012