Il 06/12/25 13:11, mygreycooper(a)gmail.com ha scritto:
> I would like to begin with some background, because many non-Chinese Wikimedia contributors may not be aware of how significant CJO has been for judicial transparency in China and how sharply access to it has been reduced in recent years.
Thanks for this context, it's super interesting!
> For our purposes, the important point is this: CJO has removed or restricted access to large portions of its historical archive, including documents that were originally public, legally non-copyrightable under Chinese law, and crucial for understanding the functioning of China’s legal system. Many judgments that were once easily verifiable on the official site can no longer be checked against their original source. These documents are at risk of disappearing entirely from public access.
How strong is the presumption of copyright-ineligibility? What's the
legal source for it and could it change in the future? (I'm clueless
about the hierarchy of sources of law in China, sorry.)
> Have other Wikisources hosted similarly massive, uniform corpora of government or legal documents? How did you determine whether they fit the mission of Wikisource? Were there concerns about overwhelming the project or changing its character?
Nothing as massive, but Italian Wikisource hosts court rulings, usually
when they are especially news-worth. In those cases (think powerful
politicians) there was always someone interested in getting them
removed, but I don't recall whether there were official requests for
redactions. However, we very intentionally do not copy all court rulings
from official court databases, because they are known to be riddled with
personal data. JurisWiki, a project from an experienced lawyer and free
knowledge advocate of Italy (Simone Aliprandi), had to shut down for
such issues after importing "just" 400k court rulings.
> In our case, the source is an independent mirror of a government website that is now selectively removing documents. While Wikimedia projects have long preserved public domain government documents after originals were taken down or censored, I am unsure how Wikisource communities have handled this scenario in practice. Are mirrored datasets acceptable when the original public source has been altered or removed? How should we document provenance and authenticity for future readers?
I would say that relying on a mirror is *better* than using an official
source, because you can have an additional layer of vetting, just like
we do with PGDP.
Are you in contact with the people in that database? Are they going to
be responsive when you find out personal data that failed to be
redacted? (This is a "when", not an "if". It's certain to happen.)
What's the added benefit that a Wikisource copy would bring to that
project? Find out, and focus on that. (Does it really need a
comprehensive copy?)
> If we proceed, how should we structure this corpus so the project remains usable? Are there recommended practices for:
> – titling, metadata, and Wikidata integration for legal documents,
Wikidata should be immediately ruled out as it cannot stand this volume
of documents.
As for titles, categories etc., you should probably talk with Chinese
practitioners who can tell you how people usually search these documents.
Say the rulings are organised in tidy partitions of 100 different
provinces (I'm inventing) and people usually search within each of them,
then you can use those as prefixes and it will be easy to disambiguate.
> – organizing millions of pages so they do not overwhelm categories and search,
> – mitigating strain on job queues, dumps, and indexing,
This part I would say don't worry too much about, as WMF will let you
know if it becomes a problem. Maybe don't come up with exceedingly
esoteric templates and don't rely on DynamicPageList or other extensions
known to be slow.
>
> 4. Political and archival importance
> Wikisource has historically preserved documents at risk of censorship or disappearance, whether due to authoritarian restrictions or institutional neglect. Do other communities have experience with politically sensitive archival projects where the preservation value itself was a central motivation?
Yes, see above, but not at this scale.
Best,
Federico
Hello,
if you can translate to any of the following languages:
ar as be cs cy de es ka km kn ko ky mn ne nn qu ro si sr to ur uz yo yue
zh zu
you can help by joining the current CLDR localisation round (see below
for some statistics). If you added translations in the past, simply
logging in will add them to this round as well. If you do not have an
account, see:
https://translatewiki.net/wiki/CLDR
Best,
Federico
-------- Messaggio Inoltrato --------
Oggetto: CLDR Survey Tool announcement: 2 weeks until vetting starts
Data: Wed, 28 May 2025 04:00:07 +0000 (UTC)
Mittente: CLDR SurveyTool
This message is being sent to you on behalf of Annemarie Apple (Google)"
(Google) - user #2303 From: Annemarie Apple (Google)
To: Everyone
Organization(s): All
Locale(s): ar as az be cs cy de es ka km kn ko kok ky mn ne nn qu ro si
sr to ur uz yo yue zh zu
Reminder, we currently only have 2 weeks remaining of submission before
starting vetting. It looks like your locale has more than 150 items in
either missing or provisional.
Please prioritize missing and provisional items since you will not be
able to add new data items once we start the vetting period on June
11th. Thank you!
Current overall progress in case it is helpful:
https://docs.google.com/spreadsheets/d/1515Ntysw61Rhy1ybv-NdUr81wXXQuzLElsC…
Do not reply to this message, instead please go to the Survey Tool.
https://st.unicode.org
Hello,
if you can translate to any of the following languages:
ar as be cs cy de es ka km kn ko ky mn ne nn qu ro si sr to ur uz yo yue
zh zu
you can help by joining the current CLDR localisation round (see below
for some statistics). If you added translations in the past, simply
logging in will add them to this round as well. If you do not have an
account, see:
https://translatewiki.net/wiki/CLDR
Best,
Federico
-------- Messaggio Inoltrato --------
Oggetto: CLDR Survey Tool announcement: 2 weeks until vetting starts
Data: Wed, 28 May 2025 04:00:07 +0000 (UTC)
Mittente: CLDR SurveyTool
This message is being sent to you on behalf of Annemarie Apple (Google)"
(Google) - user #2303 From: Annemarie Apple (Google)
To: Everyone
Organization(s): All
Locale(s): ar as az be cs cy de es ka km kn ko kok ky mn ne nn qu ro si
sr to ur uz yo yue zh zu
Reminder, we currently only have 2 weeks remaining of submission before
starting vetting. It looks like your locale has more than 150 items in
either missing or provisional.
Please prioritize missing and provisional items since you will not be
able to add new data items once we start the vetting period on June
11th. Thank you!
Current overall progress in case it is helpful:
https://docs.google.com/spreadsheets/d/1515Ntysw61Rhy1ybv-NdUr81wXXQuzLElsC…
Do not reply to this message, instead please go to the Survey Tool.
https://st.unicode.org
Hello,
if you can translate to any of the following languages:
ar as be cs cy de es ka km kn ko ky mn ne nn qu ro si sr to ur uz yo yue
zh zu
you can help by joining the current CLDR localisation round (see below
for some statistics). If you added translations in the past, simply
logging in will add them to this round as well. If you do not have an
account, see:
https://translatewiki.net/wiki/CLDR
Best,
Federico
-------- Messaggio Inoltrato --------
Oggetto: CLDR Survey Tool announcement: 2 weeks until vetting starts
Data: Wed, 28 May 2025 04:00:07 +0000 (UTC)
Mittente: CLDR SurveyTool
This message is being sent to you on behalf of Annemarie Apple (Google)"
(Google) - user #2303 From: Annemarie Apple (Google)
To: Everyone
Organization(s): All
Locale(s): ar as az be cs cy de es ka km kn ko kok ky mn ne nn qu ro si
sr to ur uz yo yue zh zu
Reminder, we currently only have 2 weeks remaining of submission before
starting vetting. It looks like your locale has more than 150 items in
either missing or provisional.
Please prioritize missing and provisional items since you will not be
able to add new data items once we start the vetting period on June
11th. Thank you!
Current overall progress in case it is helpful:
https://docs.google.com/spreadsheets/d/1515Ntysw61Rhy1ybv-NdUr81wXXQuzLElsC…
Do not reply to this message, instead please go to the Survey Tool.
https://st.unicode.org
It looks like prometheus-pushgateway.discovery.wmnet (as documented in https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway)) is not reachable from my VPS instance:
$ traceroute prometheus-pushgateway.discovery.wmnet
traceroute to prometheus-pushgateway.discovery.wmnet (10.64.0.82), 30 hops max, 60 byte packets
1 vlan-legacy.cloudinstances2b-gw.svc.eqiad1.wikimedia.cloud (172.16.0.1) 0.657 ms 0.632 ms 0.563 ms
2 vlan1107.cloudgw1004.eqiad1.wikimediacloud.org (185.15.56.234) 0.513 ms 0.486 ms 0.440 ms
3 * * *
4 * * *
5 * * *
6 * * *
7 * * *
8 * * *
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
is that the correct host to be using?
> On May 4, 2025, at 5:39 PM, Roy Smith <roy(a)panix.com> wrote:
>
> Thanks for the input. Yes, in the statsd world, these are what I would have called gauges. HIstograms might be nice, but to get started, just the raw gauges will be a useful improvement over what we have now, so I figure I'd start with that. And, yes, I expect I'll implement this in some python scripts launched by cron under the toolforge jobs framework.
>
> So, I guess if I wanted to do this on the command line, I would do:
>
> echo "some_metric 3.14" | curl --data-binary @- http://prometheus-pushgateway.discovery.wmnet/???
>
> where the ??? is the name of my job. Do I just make up something that looks reasonable, or is there some namespace that I get allocated for my metrics?
>
>
>
>> On May 4, 2025, at 3:07 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
>>
>> Il 01/05/25 20:17, Roy Smith ha scritto:
>>> I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
>>
>> It's not silly at all! If you use standard Prometheus metrics and some labels, you can later also get some basic statistical analysis for free on Grafana.
>>
>> What you described is called a Prometheus exporter. It would take the raw data (from the MediaWiki API?) and output the metrics in Prometheus format. You can hand-craft the metrics even in bash, but probably something like Python or Rust where you have both MediaWiki and Prometheus libraries will be easiest.
>>
>> The pushgateway is the traditional solution for a batch job like this. I don't know how authentication etc. is handled in WMF though.
>>
>> The metrics you described are mostly gauges. For things like the time spent sitting in queues, you may want a histogram (so you can calculate e.g. the 75th percentile or the longest-waiting proposal). This is definitely best done with a Prometheus library (but make sure to manually set the buckets to some reasonable intervals, probably in terms of hours and days, otherwise you might get some unhelpful defaults starting from ms).
>>
>> https://www.robustperception.io/how-does-a-prometheus-histogram-work/
>> https://prometheus.io/docs/practices/histograms/
>>
>> Best,
>> Federico
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l(a)lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave(a)lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Thanks for the input. Yes, in the statsd world, these are what I would have called gauges. HIstograms might be nice, but to get started, just the raw gauges will be a useful improvement over what we have now, so I figure I'd start with that. And, yes, I expect I'll implement this in some python scripts launched by cron under the toolforge jobs framework.
So, I guess if I wanted to do this on the command line, I would do:
echo "some_metric 3.14" | curl --data-binary @- http://prometheus-pushgateway.discovery.wmnet/???
where the ??? is the name of my job. Do I just make up something that looks reasonable, or is there some namespace that I get allocated for my metrics?
> On May 4, 2025, at 3:07 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
>
> Il 01/05/25 20:17, Roy Smith ha scritto:
>> I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
>
> It's not silly at all! If you use standard Prometheus metrics and some labels, you can later also get some basic statistical analysis for free on Grafana.
>
> What you described is called a Prometheus exporter. It would take the raw data (from the MediaWiki API?) and output the metrics in Prometheus format. You can hand-craft the metrics even in bash, but probably something like Python or Rust where you have both MediaWiki and Prometheus libraries will be easiest.
>
> The pushgateway is the traditional solution for a batch job like this. I don't know how authentication etc. is handled in WMF though.
>
> The metrics you described are mostly gauges. For things like the time spent sitting in queues, you may want a histogram (so you can calculate e.g. the 75th percentile or the longest-waiting proposal). This is definitely best done with a Prometheus library (but make sure to manually set the buckets to some reasonable intervals, probably in terms of hours and days, otherwise you might get some unhelpful defaults starting from ms).
>
> https://www.robustperception.io/how-does-a-prometheus-histogram-work/
> https://prometheus.io/docs/practices/histograms/
>
> Best,
> Federico
Il 01/05/25 20:17, Roy Smith ha scritto:
> I want a graph vs time.. Which is what statsd/graphite was good at, so I assumed Prometheus would also be good at it. Why is this silly?
It's not silly at all! If you use standard Prometheus metrics and some
labels, you can later also get some basic statistical analysis for free
on Grafana.
What you described is called a Prometheus exporter. It would take the
raw data (from the MediaWiki API?) and output the metrics in Prometheus
format. You can hand-craft the metrics even in bash, but probably
something like Python or Rust where you have both MediaWiki and
Prometheus libraries will be easiest.
The pushgateway is the traditional solution for a batch job like this. I
don't know how authentication etc. is handled in WMF though.
The metrics you described are mostly gauges. For things like the time
spent sitting in queues, you may want a histogram (so you can calculate
e.g. the 75th percentile or the longest-waiting proposal). This is
definitely best done with a Prometheus library (but make sure to
manually set the buckets to some reasonable intervals, probably in terms
of hours and days, otherwise you might get some unhelpful defaults
starting from ms).
https://www.robustperception.io/how-does-a-prometheus-histogram-work/https://prometheus.io/docs/practices/histograms/
Best,
Federico
[re-posting here as I inadvertently just replied to Federico privately; ah
the joys of MUAs...]
On Thu, Apr 10, 2025 at 8:56 PM Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
>
> >
> > The new policy isn’t more restrictive than the older one for general
> > crawling of the site or the API; on the contrary we allow higher limits
> > than previously stated.
>
> I find this hard to believe, considering this new sentence for
> upload.wikimedia.org: «Always keep a total concurrency of at most 2, and
> limit your total download speed to 25 Mbps (as measured over 10 second
> intervals).»
>
> This is a ridiculously low limit. It's a speed which is easy to breach
> in casual browsing of Wikimedia Commons categories, let alone with any
> kind of media-related bots.
>
>
First of all, each of the limits explicitly exclude web browsers and human
activity in general.
This limit (that we can discuss, see below) is intended to ensure a single
unidentified agent cannot use a significant slice of our available
resources.
Second, there was no stated limit on download of media files in the policy
IIRC, because it was written in 2009 when media downloads weren't as big of
an issue, which is why the quote you report clearly states "the site or the
api" - any limit imposed on media downloads is indeed by default more
restrictive.
It was never my goal, in updating the policy, to limit what can be done;
but rather to get eventually to a point where we can safely identify if
some traffic is
coming from a user, a high volume bot we've identified, or random traffic
from the internet.
It will help both reduce the constant stream of incidents related to
predatory downloading of our images while reducing impact on legitimate
users[1].
Simply put, I want to be able to know who's doing what, and be able to put
general limits on unidentified actors that we can determine clearly aren't
a user-run browser.
As you can imagine, I have a personal interest in this - moving from the
game of whack-a-mole SRE plays nowadays to systematic enforcement of limits
on unidentified clients will improve my own quality of life.
I have no interest nor intention to prevent people from archiving
wikipedia, nor I guess would the community, which I hope could eventually
grant tiers of usage to individual bots, leaving me/us only the role of
defining said tiers.
It was never my intention, in writing the limits, to impede any activity,
but rather to put ourselves in a position where we're more aware of who is
doing what.
I appreciate that some exceptions for Wikimedia Cloud bots were added
> after the discussion at
> https://phabricator.wikimedia.org/T391020#10716478 , but the fact
> remains that this comes off as a big change.
>
>
Actually, the exception for WMCS, which has been around for years, has been
a pillar of the policy since I've written the first draft. Protecting
community use while also protecting the infrastructure (and, honestly, my
weekends :) ) has always been my main goal.
Having said all of the above, I see how the 25 Mbps limit seems stringent;
in evaluating it, let me explain how I got to that number:
* Because of the nature of media downloads, it will be extremely hard for
us to enforce limits that are not per-IP - I don't want to get into more
details on that, but let me just say that rate-limiting fairly usage of our
media serving infrastructure isn't simple, especially if you're trying very
hard to not interfere with human consumption.
* I calculated what sustained bandwidth we can support in a single
datacenter without saturating more than 80% of our outgoing links, if a
crawler uses a number of different IP addresses as large as the largest
we've seen from one of these abuse actors.
So yes, the number is probably a bit defensive, and we can reason if that's
enough for a non-archival bot usage.
I'd argue I'd be happy if an archival tool uses and needs more resources; I
would also like to be able to not worry about it and/or block it in an
emergency.
Again, the reason I've asked for feedback is I'm open to changing things,
in particular the numbers I've settled on, which are of course coming from
the perspective of someone trying to preserve resources for consumption.
If you have a suggestion about what you think would be a more reasonable
default limit, considering the above, please do so on the talk page. If you
have suggestions to make it clearer what's the intention of the policy,
those are also welcome of course.
Cheers,
Giuseppe
[1] To make an example with a screwup of mine: two weekends ago, a
predatory scraper masking as Google Chrome and coming from all over the
internet brought down our media backend serving twice. I and others
intervened and saved the situation, but the ban I created was casting a
little to large a net, and I forgot to remove it eventually which ended up
causing issues to users, see https://w.wiki/Dmfn
--
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
That's a good point. I actually need a W* media dump now in my work (incl.
usage + all captions), if you do too perhaps we should see how effectively
we can compile one via a WMC tool. Alternately if WME can offer same for a
fee I would be glad to pay that.
S.
🌍🌏🌎🌑
On Thu, Apr 10, 2025, 2:57 PM Federico Leva (Nemo) <nemowiki(a)gmail.com>
wrote:
> Il 08/04/25 18:08, Giuseppe Lavagetto ha scritto:
> > I’ve updated our Robot Policy[0], which was vastly outdated, the main
> > revision being from 2009.
>
> Thanks for working on an update! It seems there was a misalignment of
> expectations, which is in itself a problem to fix.
>
> >
> > The new policy isn’t more restrictive than the older one for general
> > crawling of the site or the API; on the contrary we allow higher limits
> > than previously stated.
>
> I find this hard to believe, considering this new sentence for
> upload.wikimedia.org: «Always keep a total concurrency of at most 2, and
> limit your total download speed to 25 Mbps (as measured over 10 second
> intervals).»
>
> This is a ridiculously low limit. It's a speed which is easy to breach
> in casual browsing of Wikimedia Commons categories, let alone with any
> kind of media-related bots.
>
> At the suggested speed, it would take over 150 years for a person to
> download Wikimedia Commons files alone.
>
> Needless to say, I breached such a threshold all the time when I
> compiled the https://archive.org/details/wikimediacommons collection. I
> typically aimed to saturate my upload bandwidth at all times when
> updating it, so I must have tried to download at about 100 Mbps, and it
> still took me months. (I used to run those scripts from my home in
> Milan, downloading the files to an external HDD. I stopped updating the
> collection after 2016 in part because I don't have FTTH in Helsinki, and
> the daily downloads were far too big for any storage in Wikimedia Cloud.)
>
> I appreciate that some exceptions for Wikimedia Cloud bots were added
> after the discussion at
> https://phabricator.wikimedia.org/T391020#10716478 , but the fact
> remains that this comes off as a big change.
>
> Il 09/04/25 19:10, AntiCompositeNumber ha scritto:
> > I'll just note that both API:Etiquette and the Robot Policy have been
> incorporated by reference into the Terms of Use:
> https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use/en#12._API_Terms
> >
> > Undiscussed changes to the Terms of Use should be avoided.
>
> This is a good point.
>
> There are parts of the terms of use which assume the [[m:Right to fork]]
> is upheld by the availability of mirrored dumps. But the media tarballs
> have not been updated since 2012. Now in effect the WMF is explicitly
> saying that no mirrors are allowed for media, unless by gracious
> exemption to individual requesters.
>
> Best,
> Federico
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l(a)lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave(a)lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
Il 08/04/25 18:08, Giuseppe Lavagetto ha scritto:
> I’ve updated our Robot Policy[0], which was vastly outdated, the main
> revision being from 2009.
Thanks for working on an update! It seems there was a misalignment of
expectations, which is in itself a problem to fix.
>
> The new policy isn’t more restrictive than the older one for general
> crawling of the site or the API; on the contrary we allow higher limits
> than previously stated.
I find this hard to believe, considering this new sentence for
upload.wikimedia.org: «Always keep a total concurrency of at most 2, and
limit your total download speed to 25 Mbps (as measured over 10 second
intervals).»
This is a ridiculously low limit. It's a speed which is easy to breach
in casual browsing of Wikimedia Commons categories, let alone with any
kind of media-related bots.
At the suggested speed, it would take over 150 years for a person to
download Wikimedia Commons files alone.
Needless to say, I breached such a threshold all the time when I
compiled the https://archive.org/details/wikimediacommons collection. I
typically aimed to saturate my upload bandwidth at all times when
updating it, so I must have tried to download at about 100 Mbps, and it
still took me months. (I used to run those scripts from my home in
Milan, downloading the files to an external HDD. I stopped updating the
collection after 2016 in part because I don't have FTTH in Helsinki, and
the daily downloads were far too big for any storage in Wikimedia Cloud.)
I appreciate that some exceptions for Wikimedia Cloud bots were added
after the discussion at
https://phabricator.wikimedia.org/T391020#10716478 , but the fact
remains that this comes off as a big change.
Il 09/04/25 19:10, AntiCompositeNumber ha scritto:
> I'll just note that both API:Etiquette and the Robot Policy have been
incorporated by reference into the Terms of Use:
https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use/en#12._API_Terms
>
> Undiscussed changes to the Terms of Use should be avoided.
This is a good point.
There are parts of the terms of use which assume the [[m:Right to fork]]
is upheld by the availability of mirrored dumps. But the media tarballs
have not been updated since 2012. Now in effect the WMF is explicitly
saying that no mirrors are allowed for media, unless by gracious
exemption to individual requesters.
Best,
Federico