Re: [Wikitech-l] HTML5 and non valid attributes/elements of previous versions (bug 40329)

27 Sep 2012


      On Fri, 21 Sep 2012 10:04:50 -0700, Gabriel Wicke gwicke@wikimedia.org  
wrote:
...
On 09/20/2012 07:40 PM, MZMcBride wrote:
...
Scanning dumps (or really dealing with them in any form) is pretty  
awful.
There's been some brainstorming in the past for how to set up a system  
where
users (or operators) could run arbitrary regular expressions on all of  
the
current wikitext regularly, but such a setup requires _a lot_ of  
anything
involved (disk space, RAM, bandwidth, processing power, etc.). Maybe  
one day
Labs will have something like this.
We have a dump grepper tool in the Parsoid codebase (see
js/tests/dumpGrepper.js) that takes about 25 minutes to grep an XML dump
of the English Wikipedia. The memory involved is minimal and constant,
the thing is mostly CPU-bound.
It should not be hard to hook this up to a web service. Our parser web
service in js/api could serve as a template for that.
Gabriel
Another option would be to start indexing tag/attr/property usage. I've  
thought of doing this before. Sometimes you want to cleanup the use of  
certain tags. Other times you want to stop using a parser function or tag  
hook from an extension in your pages. Other times your wiki is full of  
-moz-border-radius properties added by people who never quite got the fact  
that it's a standardized property with other forms that need to be  
included.
So aggregating this information into parser output properties we can  
display on a special page would make it easier for users to track down.
...of course we could always opt for the easier [[Category:Pages using  
deprecated WikiText]] built-in maintenance category.
Another thing I've wanted to do was build an on-wiki mass-replacement  
tool. One that properly uses the job queue, has a good UI, and some extra  
features. That could help cleanup smaller wikis too.
-- 
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] HTML5 and non valid attributes/elements of previous versions (bug 40329)