Hello everyone,
I am Quinn (User:SuperGrey) from Chinese Wikisource (zh.wikisource.org). I am writing to request advice and precedent from the wider Wikisource community and the Wikimedia Foundation regarding a proposed large-scale import of Chinese court judgments from the national database known as China Judgments Online (中国裁判文书网, often abbreviated as CJO).
I would like to begin with some background, because many non-Chinese Wikimedia contributors may not be aware of how significant CJO has been for judicial transparency in China and how sharply access to it has been reduced in recent years.
China Judgments Online was launched in 2014 by the Supreme People’s Court (SPC) as a major transparency initiative. For nearly a decade, courts across the country uploaded tens of millions of decisions, creating what was widely regarded as one of the world’s largest publicly accessible judicial databases. At its peak, CJO hosted over 140 million documents and received tens of billions of page views. Researchers inside and outside China used the site extensively to study judicial behavior, local governance, criminal justice, and institutional changes.
However, since around 2021, and especially in 2023–2024, the Chinese government has significantly reversed this openness. Multiple independent investigations and media reports have documented the systematic removal of previously public judgments, particularly those that reflect poorly on local authorities, expose procedural misconduct, involve politically sensitive issues, or contradict preferred political narratives. In late 2023, leaked SPC documents revealed instructions to migrate judgments into a new internal-only database accessible solely within the court system, while sharply reducing what remains publicly visible. Studies have shown that vast numbers of cases have already disappeared from public view. Major news organizations such as MIT Technology Review, Radio Free Asia, the South China Morning Post, and Reuters have all reported on this rollback of judicial transparency: – https://www.technologyreview.com/2023/12/20/1085741/china-judgements-online-... – https://www.rfa.org/english/news/china/china-court-records-12142023132626.ht... – https://www.scmp.com/news/china/politics/article/3246067/china-cut-back-acce... – https://www.reuters.com/world/china/china-vows-judicial-disclosure-after-out...
For our purposes, the important point is this: CJO has removed or restricted access to large portions of its historical archive, including documents that were originally public, legally non-copyrightable under Chinese law, and crucial for understanding the functioning of China’s legal system. Many judgments that were once easily verifiable on the official site can no longer be checked against their original source. These documents are at risk of disappearing entirely from public access.
An independent archiving project, caseopen.org, has preserved a large HTML snapshot of CJO’s judgments spanning 2013 to October 2024. The maintainers of caseopen.org have donated this dataset to Chinese Wikisource. The files capture the “online version” as it originally appeared on CJO, including formatting and errors, and therefore represent a unique opportunity to preserve a historical record of China’s legal system prior to this wave of censorship and delisting. In practical terms, this may be the last comprehensive public snapshot that will ever exist.
On Chinese Wikisource, I have proposed importing this dataset through a bot (User:SuperGrey-bot). The local discussion, including technical details and code links, is here (in Chinese): https://zh.wikisource.org/wiki/Wikisource:%E6%9C%BA%E5%99%A8%E4%BA%BA#User:S...
The scale of the corpus is extremely large: tens of millions of judgments, potentially more if we include non-judgment document types such as 裁定书 (ruling document) and 通知书 (notification document). We are planning a staged import, beginning with small test batches, then individual months, and only later the full corpus, once the community settles questions about formatting, titling, metadata, and scope.
Because this project includes politically sensitive material and an unusual archival value, and because the scale is unprecedented for our language Wikisource, I would greatly appreciate advice and precedent from the international community. This is not only a technical or organizational task; it is also a preservation effort. We are attempting to safeguard public domain legal documents that have been systematically removed from public access. Wikisource may be one of the last neutral, open, global platforms capable of preserving this historical record.
Given the potential size of the import, I would also appreciate input from the Wikimedia Foundation on any operational considerations. A multi-million–page import may affect storage, dumps, CirrusSearch indexing, and overall site performance. Before proceeding beyond small test batches, I would like to understand whether such an import is feasible within the current technical limits of Chinese Wikisource, and whether coordination with SRE or Cloud Services is recommended.
Specifically, I would like to ask for input on the following areas:
1. Scope and suitability Have other Wikisources hosted similarly massive, uniform corpora of government or legal documents? How did you determine whether they fit the mission of Wikisource? Were there concerns about overwhelming the project or changing its character?
2. Verifiability and provenance In our case, the source is an independent mirror of a government website that is now selectively removing documents. While Wikimedia projects have long preserved public domain government documents after originals were taken down or censored, I am unsure how Wikisource communities have handled this scenario in practice. Are mirrored datasets acceptable when the original public source has been altered or removed? How should we document provenance and authenticity for future readers?
3. Organizational and technical considerations If we proceed, how should we structure this corpus so the project remains usable? Are there recommended practices for: – titling, metadata, and Wikidata integration for legal documents, – organizing millions of pages so they do not overwhelm categories and search, – mitigating strain on job queues, dumps, and indexing, – making future partial deletions or corrections feasible if political pressure or legal demands (e.g., DMCA takedown notices) ever arise?
4. Political and archival importance Wikisource has historically preserved documents at risk of censorship or disappearance, whether due to authoritarian restrictions or institutional neglect. Do other communities have experience with politically sensitive archival projects where the preservation value itself was a central motivation?
At present, Chinese Wikisource is still deliberating basic formatting and policy questions. No large imports will be performed until a local consensus is clear. Although we are working from the independent caseopen.org snapshot rather than relying on ongoing availability of the official CJO site, the broader context is that public access to Chinese judicial decisions has already been substantially reduced in recent years. Because our dataset preserves a historical record that may not remain accessible through official channels, we believe this is an appropriate moment to seek broader input and learn from other Wikisource communities with similar archival experiences.
Thank you very much for your time, advice, and any examples or concerns you can share. Even understanding which questions we should be asking would be extremely helpful.
Best regards, Quinn Gao (User:SuperGrey) https://meta.wikimedia.org/wiki/User:SuperGrey
Hoi, I wonder if this information is available at archive.org. If it is, having it at wikisource is somewhat redundant. Thanks, GerardM
On Sat, 6 Dec 2025 at 12:11, mygreycooper@gmail.com wrote:
Hello everyone,
I am Quinn (User:SuperGrey) from Chinese Wikisource (zh.wikisource.org). I am writing to request advice and precedent from the wider Wikisource community and the Wikimedia Foundation regarding a proposed large-scale import of Chinese court judgments from the national database known as China Judgments Online (中国裁判文书网, often abbreviated as CJO).
I would like to begin with some background, because many non-Chinese Wikimedia contributors may not be aware of how significant CJO has been for judicial transparency in China and how sharply access to it has been reduced in recent years.
China Judgments Online was launched in 2014 by the Supreme People’s Court (SPC) as a major transparency initiative. For nearly a decade, courts across the country uploaded tens of millions of decisions, creating what was widely regarded as one of the world’s largest publicly accessible judicial databases. At its peak, CJO hosted over 140 million documents and received tens of billions of page views. Researchers inside and outside China used the site extensively to study judicial behavior, local governance, criminal justice, and institutional changes.
However, since around 2021, and especially in 2023–2024, the Chinese government has significantly reversed this openness. Multiple independent investigations and media reports have documented the systematic removal of previously public judgments, particularly those that reflect poorly on local authorities, expose procedural misconduct, involve politically sensitive issues, or contradict preferred political narratives. In late 2023, leaked SPC documents revealed instructions to migrate judgments into a new internal-only database accessible solely within the court system, while sharply reducing what remains publicly visible. Studies have shown that vast numbers of cases have already disappeared from public view. Major news organizations such as MIT Technology Review, Radio Free Asia, the South China Morning Post, and Reuters have all reported on this rollback of judicial transparency: – https://www.technologyreview.com/2023/12/20/1085741/china-judgements-online-... – https://www.rfa.org/english/news/china/china-court-records-12142023132626.ht... – https://www.scmp.com/news/china/politics/article/3246067/china-cut-back-acce... – https://www.reuters.com/world/china/china-vows-judicial-disclosure-after-out...
For our purposes, the important point is this: CJO has removed or restricted access to large portions of its historical archive, including documents that were originally public, legally non-copyrightable under Chinese law, and crucial for understanding the functioning of China’s legal system. Many judgments that were once easily verifiable on the official site can no longer be checked against their original source. These documents are at risk of disappearing entirely from public access.
An independent archiving project, caseopen.org, has preserved a large HTML snapshot of CJO’s judgments spanning 2013 to October 2024. The maintainers of caseopen.org have donated this dataset to Chinese Wikisource. The files capture the “online version” as it originally appeared on CJO, including formatting and errors, and therefore represent a unique opportunity to preserve a historical record of China’s legal system prior to this wave of censorship and delisting. In practical terms, this may be the last comprehensive public snapshot that will ever exist.
On Chinese Wikisource, I have proposed importing this dataset through a bot (User:SuperGrey-bot). The local discussion, including technical details and code links, is here (in Chinese): https://zh.wikisource.org/wiki/Wikisource:%E6%9C%BA%E5%99%A8%E4%BA%BA#User:S...
The scale of the corpus is extremely large: tens of millions of judgments, potentially more if we include non-judgment document types such as 裁定书 (ruling document) and 通知书 (notification document). We are planning a staged import, beginning with small test batches, then individual months, and only later the full corpus, once the community settles questions about formatting, titling, metadata, and scope.
Because this project includes politically sensitive material and an unusual archival value, and because the scale is unprecedented for our language Wikisource, I would greatly appreciate advice and precedent from the international community. This is not only a technical or organizational task; it is also a preservation effort. We are attempting to safeguard public domain legal documents that have been systematically removed from public access. Wikisource may be one of the last neutral, open, global platforms capable of preserving this historical record.
Given the potential size of the import, I would also appreciate input from the Wikimedia Foundation on any operational considerations. A multi-million–page import may affect storage, dumps, CirrusSearch indexing, and overall site performance. Before proceeding beyond small test batches, I would like to understand whether such an import is feasible within the current technical limits of Chinese Wikisource, and whether coordination with SRE or Cloud Services is recommended.
Specifically, I would like to ask for input on the following areas:
- Scope and suitability
Have other Wikisources hosted similarly massive, uniform corpora of government or legal documents? How did you determine whether they fit the mission of Wikisource? Were there concerns about overwhelming the project or changing its character?
- Verifiability and provenance
In our case, the source is an independent mirror of a government website that is now selectively removing documents. While Wikimedia projects have long preserved public domain government documents after originals were taken down or censored, I am unsure how Wikisource communities have handled this scenario in practice. Are mirrored datasets acceptable when the original public source has been altered or removed? How should we document provenance and authenticity for future readers?
- Organizational and technical considerations
If we proceed, how should we structure this corpus so the project remains usable? Are there recommended practices for: – titling, metadata, and Wikidata integration for legal documents, – organizing millions of pages so they do not overwhelm categories and search, – mitigating strain on job queues, dumps, and indexing, – making future partial deletions or corrections feasible if political pressure or legal demands (e.g., DMCA takedown notices) ever arise?
- Political and archival importance
Wikisource has historically preserved documents at risk of censorship or disappearance, whether due to authoritarian restrictions or institutional neglect. Do other communities have experience with politically sensitive archival projects where the preservation value itself was a central motivation?
At present, Chinese Wikisource is still deliberating basic formatting and policy questions. No large imports will be performed until a local consensus is clear. Although we are working from the independent caseopen.org snapshot rather than relying on ongoing availability of the official CJO site, the broader context is that public access to Chinese judicial decisions has already been substantially reduced in recent years. Because our dataset preserves a historical record that may not remain accessible through official channels, we believe this is an appropriate moment to seek broader input and learn from other Wikisource communities with similar archival experiences.
Thank you very much for your time, advice, and any examples or concerns you can share. Even understanding which questions we should be asking would be extremely helpful.
Best regards, Quinn Gao (User:SuperGrey) https://meta.wikimedia.org/wiki/User:SuperGrey _______________________________________________ Wikisource-l mailing list -- wikisource-l@lists.wikimedia.org To unsubscribe send an email to wikisource-l-leave@lists.wikimedia.org
Hello Quinn / SuperGrey
Here is my advice -
1. Select a few documents, either 1, 5, or 10, which are extra interesting and ideally which you can relate to Wikipedia articles 2. upload those documents to Wikimedia Commons 3. mirror / format into Wikisource 4. figure out how to do court citations. For English language, this is super hard. I have no experience with Chinese court citations. Get your citation data into Wikidata as part of https://meta.wikimedia.org/wiki/WikiCite 5. interconnect everything - Wikipedia, Wikimedia Commons, Wikidata, and Wikisource, in usual wiki ways 6. now come back and ask here again about doing this for 1000 more documents.
You said you have tens of millions. The Wikimedia platform is not an exhaustive archive, and we probably only want documents which can be of general interest, but the Wikimedia platform is a good place for you to sort your process and showcase some select important set of these, whether that its 10s, 100s, 1000s of them, or whatever is interesting. Also in the Wikimedia platform you will be able to develop a general use data model for organizing these, for if and when you or anyone else find or create an appropriate complete archive.
An appropriate on-wiki place to do your data modeling discussion is https://meta.wikimedia.org/wiki/Talk:WikiCite . There is an active WikiCite community and your project is a sort of document metadata sorting project, but we have never done legal documents there, nor do we have much Chinese language document curation.
I think you have an interesting project and I would like to see you get at least 1 document into the Wikimedia platform as a demonstration.
yours
On Mon, Dec 8, 2025 at 5:09 AM Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, I wonder if this information is available at archive.org. If it is, having it at wikisource is somewhat redundant. Thanks, GerardM
On Sat, 6 Dec 2025 at 12:11, mygreycooper@gmail.com wrote:
Hello everyone,
I am Quinn (User:SuperGrey) from Chinese Wikisource (zh.wikisource.org). I am writing to request advice and precedent from the wider Wikisource community and the Wikimedia Foundation regarding a proposed large-scale import of Chinese court judgments from the national database known as China Judgments Online (中国裁判文书网, often abbreviated as CJO).
I would like to begin with some background, because many non-Chinese Wikimedia contributors may not be aware of how significant CJO has been for judicial transparency in China and how sharply access to it has been reduced in recent years.
China Judgments Online was launched in 2014 by the Supreme People’s Court (SPC) as a major transparency initiative. For nearly a decade, courts across the country uploaded tens of millions of decisions, creating what was widely regarded as one of the world’s largest publicly accessible judicial databases. At its peak, CJO hosted over 140 million documents and received tens of billions of page views. Researchers inside and outside China used the site extensively to study judicial behavior, local governance, criminal justice, and institutional changes.
However, since around 2021, and especially in 2023–2024, the Chinese government has significantly reversed this openness. Multiple independent investigations and media reports have documented the systematic removal of previously public judgments, particularly those that reflect poorly on local authorities, expose procedural misconduct, involve politically sensitive issues, or contradict preferred political narratives. In late 2023, leaked SPC documents revealed instructions to migrate judgments into a new internal-only database accessible solely within the court system, while sharply reducing what remains publicly visible. Studies have shown that vast numbers of cases have already disappeared from public view. Major news organizations such as MIT Technology Review, Radio Free Asia, the South China Morning Post, and Reuters have all reported on this rollback of judicial transparency: – https://www.technologyreview.com/2023/12/20/1085741/china-judgements-online-... – https://www.rfa.org/english/news/china/china-court-records-12142023132626.ht... – https://www.scmp.com/news/china/politics/article/3246067/china-cut-back-acce... – https://www.reuters.com/world/china/china-vows-judicial-disclosure-after-out...
For our purposes, the important point is this: CJO has removed or restricted access to large portions of its historical archive, including documents that were originally public, legally non-copyrightable under Chinese law, and crucial for understanding the functioning of China’s legal system. Many judgments that were once easily verifiable on the official site can no longer be checked against their original source. These documents are at risk of disappearing entirely from public access.
An independent archiving project, caseopen.org, has preserved a large HTML snapshot of CJO’s judgments spanning 2013 to October 2024. The maintainers of caseopen.org have donated this dataset to Chinese Wikisource. The files capture the “online version” as it originally appeared on CJO, including formatting and errors, and therefore represent a unique opportunity to preserve a historical record of China’s legal system prior to this wave of censorship and delisting. In practical terms, this may be the last comprehensive public snapshot that will ever exist.
On Chinese Wikisource, I have proposed importing this dataset through a bot (User:SuperGrey-bot). The local discussion, including technical details and code links, is here (in Chinese): https://zh.wikisource.org/wiki/Wikisource:%E6%9C%BA%E5%99%A8%E4%BA%BA#User:S...
The scale of the corpus is extremely large: tens of millions of judgments, potentially more if we include non-judgment document types such as 裁定书 (ruling document) and 通知书 (notification document). We are planning a staged import, beginning with small test batches, then individual months, and only later the full corpus, once the community settles questions about formatting, titling, metadata, and scope.
Because this project includes politically sensitive material and an unusual archival value, and because the scale is unprecedented for our language Wikisource, I would greatly appreciate advice and precedent from the international community. This is not only a technical or organizational task; it is also a preservation effort. We are attempting to safeguard public domain legal documents that have been systematically removed from public access. Wikisource may be one of the last neutral, open, global platforms capable of preserving this historical record.
Given the potential size of the import, I would also appreciate input from the Wikimedia Foundation on any operational considerations. A multi-million–page import may affect storage, dumps, CirrusSearch indexing, and overall site performance. Before proceeding beyond small test batches, I would like to understand whether such an import is feasible within the current technical limits of Chinese Wikisource, and whether coordination with SRE or Cloud Services is recommended.
Specifically, I would like to ask for input on the following areas:
- Scope and suitability
Have other Wikisources hosted similarly massive, uniform corpora of government or legal documents? How did you determine whether they fit the mission of Wikisource? Were there concerns about overwhelming the project or changing its character?
- Verifiability and provenance
In our case, the source is an independent mirror of a government website that is now selectively removing documents. While Wikimedia projects have long preserved public domain government documents after originals were taken down or censored, I am unsure how Wikisource communities have handled this scenario in practice. Are mirrored datasets acceptable when the original public source has been altered or removed? How should we document provenance and authenticity for future readers?
- Organizational and technical considerations
If we proceed, how should we structure this corpus so the project remains usable? Are there recommended practices for: – titling, metadata, and Wikidata integration for legal documents, – organizing millions of pages so they do not overwhelm categories and search, – mitigating strain on job queues, dumps, and indexing, – making future partial deletions or corrections feasible if political pressure or legal demands (e.g., DMCA takedown notices) ever arise?
- Political and archival importance
Wikisource has historically preserved documents at risk of censorship or disappearance, whether due to authoritarian restrictions or institutional neglect. Do other communities have experience with politically sensitive archival projects where the preservation value itself was a central motivation?
At present, Chinese Wikisource is still deliberating basic formatting and policy questions. No large imports will be performed until a local consensus is clear. Although we are working from the independent caseopen.org snapshot rather than relying on ongoing availability of the official CJO site, the broader context is that public access to Chinese judicial decisions has already been substantially reduced in recent years. Because our dataset preserves a historical record that may not remain accessible through official channels, we believe this is an appropriate moment to seek broader input and learn from other Wikisource communities with similar archival experiences.
Thank you very much for your time, advice, and any examples or concerns you can share. Even understanding which questions we should be asking would be extremely helpful.
Best regards, Quinn Gao (User:SuperGrey) https://meta.wikimedia.org/wiki/User:SuperGrey _______________________________________________ Wikisource-l mailing list -- wikisource-l@lists.wikimedia.org To unsubscribe send an email to wikisource-l-leave@lists.wikimedia.org
Wikisource-l mailing list -- wikisource-l@lists.wikimedia.org To unsubscribe send an email to wikisource-l-leave@lists.wikimedia.org
Il 06/12/25 13:11, mygreycooper@gmail.com ha scritto:
I would like to begin with some background, because many non-Chinese Wikimedia contributors may not be aware of how significant CJO has been for judicial transparency in China and how sharply access to it has been reduced in recent years.
Thanks for this context, it's super interesting!
For our purposes, the important point is this: CJO has removed or restricted access to large portions of its historical archive, including documents that were originally public, legally non-copyrightable under Chinese law, and crucial for understanding the functioning of China’s legal system. Many judgments that were once easily verifiable on the official site can no longer be checked against their original source. These documents are at risk of disappearing entirely from public access.
How strong is the presumption of copyright-ineligibility? What's the legal source for it and could it change in the future? (I'm clueless about the hierarchy of sources of law in China, sorry.)
Have other Wikisources hosted similarly massive, uniform corpora of government or legal documents? How did you determine whether they fit the mission of Wikisource? Were there concerns about overwhelming the project or changing its character?
Nothing as massive, but Italian Wikisource hosts court rulings, usually when they are especially news-worth. In those cases (think powerful politicians) there was always someone interested in getting them removed, but I don't recall whether there were official requests for redactions. However, we very intentionally do not copy all court rulings from official court databases, because they are known to be riddled with personal data. JurisWiki, a project from an experienced lawyer and free knowledge advocate of Italy (Simone Aliprandi), had to shut down for such issues after importing "just" 400k court rulings.
In our case, the source is an independent mirror of a government website that is now selectively removing documents. While Wikimedia projects have long preserved public domain government documents after originals were taken down or censored, I am unsure how Wikisource communities have handled this scenario in practice. Are mirrored datasets acceptable when the original public source has been altered or removed? How should we document provenance and authenticity for future readers?
I would say that relying on a mirror is *better* than using an official source, because you can have an additional layer of vetting, just like we do with PGDP.
Are you in contact with the people in that database? Are they going to be responsive when you find out personal data that failed to be redacted? (This is a "when", not an "if". It's certain to happen.)
What's the added benefit that a Wikisource copy would bring to that project? Find out, and focus on that. (Does it really need a comprehensive copy?)
If we proceed, how should we structure this corpus so the project remains usable? Are there recommended practices for: – titling, metadata, and Wikidata integration for legal documents,
Wikidata should be immediately ruled out as it cannot stand this volume of documents.
As for titles, categories etc., you should probably talk with Chinese practitioners who can tell you how people usually search these documents.
Say the rulings are organised in tidy partitions of 100 different provinces (I'm inventing) and people usually search within each of them, then you can use those as prefixes and it will be easy to disambiguate.
– organizing millions of pages so they do not overwhelm categories and search, – mitigating strain on job queues, dumps, and indexing,
This part I would say don't worry too much about, as WMF will let you know if it becomes a problem. Maybe don't come up with exceedingly esoteric templates and don't rely on DynamicPageList or other extensions known to be slow.
- Political and archival importance
Wikisource has historically preserved documents at risk of censorship or disappearance, whether due to authoritarian restrictions or institutional neglect. Do other communities have experience with politically sensitive archival projects where the preservation value itself was a central motivation?
Yes, see above, but not at this scale.
Best, Federico
wikisource-l@lists.wikimedia.org