jayvdb added a comment.
re -random, and even the pagegenerator, I doubt anyone using those care about randomness.
I was looking for a MediaWiki API bug about allowing continuation. I didnt find one.
TASK DETAIL
https://phabricator.wikimedia.org/T84944
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb
Cc: gerritbot, valhallasw, jayvdb, Aklapper, Mpaa, pywikipedia-bugs
valhallasw added a comment.
The underlying randomness algorithm is as follows:
- each page is stored with a random number, `page_random`, between 0 and 1
- generator=random runs `SELECT * FROM page WHERE page_random > {value} LIMIT {limit}`, with value a random number between 0 and 1, and LIMIT the number of pages to retrieve
I suppose the API could actually expose page_random as opaque 'continue' parameter, which would then allow actual continuation, and hence provide full random-without-replacement?
As for //our// users: they would typically use -random from the command line, and iirc generators from the command line are always filtered for uniqueness.
TASK DETAIL
https://phabricator.wikimedia.org/T84944
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: valhallasw
Cc: gerritbot, valhallasw, jayvdb, Aklapper, Mpaa, pywikipedia-bugs
jayvdb added a comment.
To me, 'step' feels like it is breaking a batch into non overlapping subsets, which isnt strictly true if each 'step' is a new random sequence, especially if each batch contains only unique items (which means the server algorithm is slightly reducing the randomness, when a duplicate appears).
If we look at a very small wiki, the underlying generator doesnt repeat if the limit isnt reached.
https://www.molnac.unisa.it/BioTools/mdcons/api.php?action=query&generator=…https://www.molnac.unisa.it/BioTools/mdcons/index.php/Special:ListFiles
IMO, in site.randompages, we are trying to expose the underlying MediaWiki API, and it doesnt have continuation. A caller cant know that limit 20 is two batches of 10 from the server algorithm, or a single batch of 20 from the server algorithm (which is unique?). The only way to have any chance of knowing that is to obtain the API limit from paraminfo , and use that.
However, I dont know the underlying randomness algorithm well enough to speak with much authority about that, or how many of our users are wanting to 'see' the underlying randomness vs happy with any randomness that has slight oddities introduced by multiple disjoint batches.
Whatever we do, we need to update the docstring to explain what we are doing in case the caller cares.
TASK DETAIL
https://phabricator.wikimedia.org/T84944
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb
Cc: gerritbot, valhallasw, jayvdb, Aklapper, Mpaa, pywikipedia-bugs
cpa199 added a comment.
Many thanks for that, I can see that it has been updated indeed.
TASK DETAIL
https://phabricator.wikimedia.org/T87248
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Chad, cpa199
Cc: Krinkle, XZise, valhallasw, JanZerebecki, Nikerabbit, siebrand, cpa199, zhaofengli, llbraughler, adrianheine, Krenair, Xqt, jayvdb, fbstj, greg, Legoktm, Chad, MarkTraceur, matmarex, UltrasonicNXT, Aklapper, QChris, pywikipedia-bugs
valhallasw added a comment.
I don't see why having step is related to having a continuation mechanism. It's a parameter that's passed to the api, which has a well-defined meaning.
I also don't see why returning duplicates is an issue. Random sampling is typically with replacement, unless specified otherwise, so the caller should not be surprised to see duplicates, and should filter them manually.
TASK DETAIL
https://phabricator.wikimedia.org/T84944
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: valhallasw
Cc: gerritbot, valhallasw, jayvdb, Aklapper, Mpaa, pywikipedia-bugs
jayvdb created this task.
jayvdb added subscribers: pywikipedia-bugs, jayvdb.
jayvdb added a project: pywikibot-core.
TASK DESCRIPTION
If "generate interwiki links" is enabled when generating for wikidata or commons, all wikipedia sites are added to self.langs
```
$ python pwb.py generate_family_file.py https://commons.wikimedia.org/wiki/Main_Page commons_generated
Generating family file from https://commons.wikimedia.org/wiki/Main_Page
==================================
api url: https://commons.wikimedia.org/w/api.php
MediaWiki version: 1.25wmf12
==================================
Determining other languages...aa ab ace af ak als am an ang ar arc arz as ast av ay az ba bar bat-smg bcl be be-tarask be-x-old bg bh bi bjn bm bn bo bpy br bs bug bxr ca cbk-zam cdo ce ceb ch cho chr chy ckb co cr crh cs csb cu cv cy da de diq dsb dv dz ee egl el eml en eo es et eu ext fa ff fi fiu-vro fj fo fr frp frr fur fy ga gag gan gd gl glk gn got gsw gu gv ha hak haw he hi hif ho hr hsb ht hu hy hz ia id ie ig ii ik ilo io is it iu ja jbo jv ka kaa kab kbd kg ki kj kk kl km kn ko koi kr krc ks ksh ku kv kw ky la lad lb lbe lez lg li lij lmo ln lo lt ltg lv lzh mai map-bms mdf mg mh mhr mi min mk ml mn mo mr mrj ms mt mus mwl my myv mzn na nah nan nap nb nds nds-nl ne new ng nl nn no nov nrm nso nv ny oc om or os pa pag pam pap pcd pdc pfl pi pih pl pms pnb pnt ps pt qu rm rmy rn ro roa-rup roa-tara ru rue rup rw sa sah sc scn sco sd se sg sgs sh si simple sk sl sm sn so sq sr srn ss st stq su sv sw szl ta te tet tg th ti tk tl tn to tpi tr ts tt tum tw ty tyv udm ug uk ur
uz ve vec vep vi vls vo vro wa war wo wuu xal xh xmf yi yo yue za zea zh zh-classical zh-cn zh-min-nan zh-tw zh-yue zu
There are 301 languages available.
Do you want to generate interwiki links? This might take a long time. ([y]es/[N]o/[e]dit)y
Loading wikis...
* aa... downloaded
* ab... downloaded
* ace... downloaded
* af... downloaded
* ak... downloaded
...
* zh... downloaded
* zh-classical... in cache
* zh-cn... in cache
* zh-min-nan... downloaded
* zh-tw... in cache
* zh-yue... in cache
* zu... downloaded
* en... in cache
Writing pywikibot/families/commons_generated_family.py...
$ head -20 pywikibot/families/commons_generated_family.py
# -*- coding: utf-8 -*-
"""
This family file was auto-generated by $Id: 2dd21e4aaf7a93cf8749be841552881a80684b52 $
Configuration parameters:
url = https://commons.wikimedia.org/wiki/Main_Page
name = commons_generated
Please do not commit this to the Git repository!
"""
from pywikibot import family
class Family(family.Family):
def __init__(self):
family.Family.__init__(self)
self.name = 'commons_generated'
self.langs = {
'hu': 'hu.wikipedia.org',
'vec': 'vec.wikipedia.org',
'bpy': 'bpy.wikipedia.org',
```
TASK DETAIL
https://phabricator.wikimedia.org/T85657
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: jayvdb
Cc: Aklapper, jayvdb, pywikipedia-bugs