Is it possible for Pywikipediabot to access the number of editors of a certain page? Or at least, whether there has been more than one editor? If not, is there another tool I can use to create a list of pages on a wiki with only one editor?
I think there are quite a few old spam pages on our wiki, mainly in "User talk" space. A helpful script to identify them would be of the form:
If (number of editors == 1) then replace "(*.http://.*)" with "{{delete|tagged by ChriswaterguyBot as possible spam.}}\1"
Or, if there's another tool that can do the first part (produce a list of pages with only one editor) then replace.py can do the rest. Any suggestions?
(If we could also check whether that user has edited any other pages, that would be wonderful... but that might be asking a bit much.)
Thanks
lap=pywikibot.Page(site,'Triatlon') editorlist = set([h[2] for h in lap.getVersionHistory(getAll=True)])
This gives you a set of editors with unique names (that's why it is converted to set). Use len(editorlist) to get the number of editors.
On 8 March 2012 20:04, Chris Watkins chriswaterguy@appropedia.org wrote:
Is it possible for Pywikipediabot to access the number of editors of a certain page? Or at least, whether there has been more than one editor? If not, is there another tool I can use to create a list of pages on a wiki with only one editor?
The best option for things like this is to use a database query. However, I'm not sure if you are able to run them for appropedia. The query would be something like
select page_namespace, page_title, count(distinct(rev_user)) as cnt from revision left join page on page_id=rev_page group by page_namespace, page_title having cnt=1 limit 1;
Pywikipedia can tell you what the count is for a /specific/ page (as Bináris showed), but is unable to run such queries. The advantage of SQL queries is that you could even do this more specifically, for instance by listing only pages in user_talk that have at least one external link on them.
Last but not least; if all the links spam to one domain, you can consider using Special:LinkSearch instead. I'm not quite sure if pwb allows you to use such a list directly.
Best, Merlijn
2012/3/16 Merlijn van Deen valhallasw@arctus.nl
Last but not least; if all the links spam to one domain, you can consider using Special:LinkSearch instead. I'm not quite sure if pwb allows you to use such a list directly.
From the help of replace.py:
-weblink Work on all articles that contain an external link to a given URL; may be given as "-weblink:url" That means a pagegenerator has also to belong to it.
On 16 March 2012 23:00, Bináris wikiposta@gmail.com wrote:
From the help of replace.py:
-weblink Work on all articles that contain an external link to a given URL; may be given as "-weblink:url" That means a pagegenerator has also to belong to it.
Right. pagegenerators.LinksearchPageGenerator(link, step=500, site=None) is the one.
best, Merlijn
Thanks for all the help, Merlijn and Bináris.
There are a couple of options then. If the first solution, from Bináris, requires the page to be identified, I could make a list from AllPages for that namespace...
But for now (given my lack of skill in SQL and Python) it occurs to me that I can do a search for any match from a list of spam strings and replace with a delete tag. "(Florida|real estate|home insurance... )" - I have a list of a few hundred spammy phrases. And I'll store this email thread for future reference.
Thanks again!
On Sat, Mar 17, 2012 at 05:53, Merlijn van Deen valhallasw@arctus.nlwrote:
On 8 March 2012 20:04, Chris Watkins chriswaterguy@appropedia.org wrote:
Is it possible for Pywikipediabot to access the number of editors of a certain page? Or at least, whether there has been more than one editor?
If
not, is there another tool I can use to create a list of pages on a wiki with only one editor?
The best option for things like this is to use a database query. However, I'm not sure if you are able to run them for appropedia. The query would be something like
select page_namespace, page_title, count(distinct(rev_user)) as cnt from revision left join page on page_id=rev_page group by page_namespace, page_title having cnt=1 limit 1;
Pywikipedia can tell you what the count is for a /specific/ page (as Bináris showed), but is unable to run such queries. The advantage of SQL queries is that you could even do this more specifically, for instance by listing only pages in user_talk that have at least one external link on them.
Last but not least; if all the links spam to one domain, you can consider using Special:LinkSearch instead. I'm not quite sure if pwb allows you to use such a list directly.
Best, Merlijn
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
2012/3/18 Chris Watkins chriswaterguy@appropedia.org
There are a couple of options then. If the first solution, from Bináris, requires the page to be identified, I could make a list from AllPages for that namespace...
Yes, and once you clean it up properly, you may repeat the cleaning regularly, e.g. on a weekly basis, from Recentchanges which is much faster.
But for now (given my lack of skill in SQL and Python)
I tell you how I became a Python programmer. I behan to use Pywikipedia, then I realized that I wanted to modify something for my own needs, and tried to understand, then I began to exepriment with basic.py, than I began to write my own scripts... Now I use Python as my general hobby programming language.
it occurs to me that I can do a search for any match from a list of spam strings and replace with a delete tag. "(Florida|real estate|home insurance... )" - I have a list of a few hundred spammy phrases.
That's a way, too, if your list is comprehensive enough. I strongly suggest you to use fixes instead of command line replacements. You may create a fix in fixes.py or user-fixes.py with all your stopwords, following the same pattern, while you can't type a few hundred words into a command. If your list is in a well ordered form, you may put the words in a column of an Excel table, and create the replacements with a text function and copy back to user-fixes.py. You may also want to replace these words with a special category instead of a deleet tag and tell delete.py to kill them en masse.
One more question: do you know blacklist? This is an admins' tool to list ugly websites. You just write the bad-faced website on the blacklist, and users will be unhappy to see they cannot save the page until it contains the link to it. :-)
On Mon, Mar 19, 2012 at 02:05, Bináris wikiposta@gmail.com wrote:
I tell you how I became a Python programmer. I behan to use Pywikipedia,
then I realized that I wanted to modify something for my own needs, and tried to understand, then I began to exepriment with basic.py, than I began to write my own scripts... Now I use Python as my general hobby programming language.
Hadn 't seen basic.py - thanks. With all the commenting, it looks like a much better place to start than the other Python I've looked at.
it occurs to me that I can do a search for any match from a list of spam strings and replace with a delete tag. "(Florida|real estate|home insurance... )" - I have a list of a few hundred spammy phrases.
That's a way, too, if your list is comprehensive enough. I strongly suggest you to use fixes instead of command line replacements. You may create a fix in fixes.py or user-fixes.py with all your stopwords, following the same pattern, while you can't type a few hundred words into a command. If your list is in a well ordered form, you may put the words in a column of an Excel table, and create the replacements with a text function and copy back to user-fixes.py. You may also want to replace these words with a special category instead of a deleet tag and tell delete.py to kill them en masse.
Thanks, good ideas. I wasn't familiar with fixes.py/user-fixes.py. I copy and paste from my working file to the command line rather than typing , but having the fixes in a file is still much better.
I don't expect to do this often - now we have AbuseFilter set up and it's catching new spam quite well. It's just a cleanup tool for old spam.
One more question: do you know blacklist? This is an admins' tool to list ugly websites. You just write the bad-faced website on the blacklist, and users will be unhappy to see they cannot save the page until it contains the link to it. :-)
Yes, *SpamBlacklist http://www.mediawiki.org/wiki/Extension:SpamBlacklist*- it apparently updates from the Wikimedia blacklist every 10-15 minutes, and we have added sites on there as well. Very good, but we can't keep up with all the new spam sites (which makes the updates from Wikimedia very valuable.)
Thanks again, Chris
-- Bináris
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
pywikipedia-l@lists.wikimedia.org