Great, if you remove ETL from the way, the task is pretty simple:
* Get a CSV dump of your table(s).
* Use google refine to draw rules for cleaning up the data - I can
provide you some, based on my very old database dump, and we might work
that out together through Skype.
* Reload the table(s) with the exported (fixed) CSV.
The great thing about it is that you get a record of "cleaning actions",
which you can replay later - e.g., if you get a more updated database
dump which didn't get cleaned up.
--
Nuno Tavares
Wikimedia Portugal
http://www.wikimedia.pt
Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter
livre acesso ao somatório de todo o conhecimento humano. É isso o que
estamos a fazer.
Participe também:
http://www.wikimedia.pt
Vicenç Riullop escreveu:
> Ok, thanks, I think I've got it. It will be better to clean the
> municipality field moving any comment or multiple municipalities to
> another field more suitable for notes. The pipped links and sorting keys
> can be generated with an auxiliary template-table that can be useful
> also for uniformity in other types of lists.
>
> Obtaining the source has been a PITA, so regeneration is not an issue.
> We will think later how to check for updates.
>
> Vicenç
>
>> Date: Thu, 25 Aug 2011 12:33:35 +0100
>> From: nuno.tavares@wikimedia.pt
>> To: wikilovesmonuments@lists.wikimedia.org
>> Subject: Re: [Wiki Loves Monuments] Spanish database [Was: European
> stats and other cool tools]
>>
>> Dear Vicenç [I believe all others should ignore this message],
>>
>>
>> Yes, I've just noticed, e.g., "[[Ogassa]] i [[Ribes de Freser]]". My
>> (professional) opinion is that this is a huge mistake, since that field
>> represents 2 values, which is a clear violation of standard database
>> rules. Having this in mind, I'd suggest splitting the object in two.
>>
>> I found a pattern for "multiple municipalities in the same object",
>> which seems to be "]] i [[", and there seems to be only 6 of them. You
>> can just wipe out one of them from the field, or manually duplicate the
>> record them for splitting - the latter being more complex, think of a
>> good ID disambiguation though. Regarding link disambiguations, you can
>> just add a column for that - in a "working table", as they are of no use
>> for a listing like this (alas, that kind of data is only relevant if you
>> plan to regenerate the lists).
>>
>>
>> We had similar problems - we even had descriptions names in wiki syntax
>> already, our lists date back to 2006. As I wanted to preserve the links
>> and original "names" (designations), I've done some transformations on
>> the data before it gets on the list - this kind of approach was actually
>> the reason to have our tools/plist [1] - notice multiple links near the
>> end of each description, and notice that, if you export to wiki format,
>> you'll find municipalities disambiguated as well.
>>
>> [1]
http://www.wikilovesmonuments.org.pt/tools/plist/?region=Braga
>>
>> That's why you currently can "regenerate" the lists with their original
>> descriptions and links.
>>
>> This is, however, a more complex approach which I won't have time to
>> assist you, but I may point you directions if you know ETL.
>>
>>
>> --
>> Nuno Tavares
>> Wikimedia Portugal
>>
http://www.wikimedia.pt
>>
>> Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter
>> livre acesso ao somatório de todo o conhecimento humano. É isso o que
>> estamos a fazer.
>>
>> Participe também:
http://www.wikimedia.pt
>>
>>
>>
>> Vicenç Riullop escreveu:
>> > A list may include different municipalities so links, disambiguations
>> > and sorting keys are included in the municipality field. Morever, some
>> > large monuments may be in the border of two municipalities. Any
>> > suggestion how to handdle this?
>> >
>> > Vicenç
>> >
>> >> Date: Thu, 25 Aug 2011 03:40:12 +0100
>> >> From: nuno.tavares@wikimedia.pt
>> >> To: wikilovesmonuments@lists.wikimedia.org
>> >> Subject: Re: [Wiki Loves Monuments] Spanish database [Was: European
>> > stats and other cool tools]
>> >>
>> >> Dear spanish fellows:
>> >>
>> >> I've notice some quality problems in the municipality names of the
>> >> objects on es/es-ct/es-vc lists. Sometimes they are [[like this]],
>> >> others "like this" and also "{{ordre|like this}}". In my opinion, this
>> >> is half-way for problems in managing the lists.
>> >>
>> >>
>> >
>
http://www.wikilovesmonuments.org.pt/tools/stats-eu/index2.php?format=table&...
>> >>
>> >>
>> >> It's pretty easy to clean that up, but I'm still not sure if Maarten is
>> >> loading the database from wiki lists (thus overriding any fixes we
>> >> perform). Get in touch with me if you need help.
>> >>
>> >> PS: I don't have much opportunities to hang around IRC, but feel
> free to
>> >> ping me in Skype.
>> >>
>> >> PS2: Andorra and France seem to suffer the same symptoms.
>> >>
>> >> --
>> >> Nuno Tavares
>> >> Wikimedia Portugal
>> >>
http://www.wikimedia.pt
>> >>
>> >> Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter
>> >> livre acesso ao somatório de todo o conhecimento humano. É isso o que
>> >> estamos a fazer.
>> >>
>> >> Participe também:
http://www.wikimedia.pt
>> >>
>> >> Em 25-08-2011 02:49, Nuno Tavares escreveu:
>> >> > Hey again,
>> >> >
>> >> > Didn't I say I love statistics? :-)
>> >> >
>> >> > You can now have your "heat" list adding the option "color=heatpct":
>> >> >
>> >> >
>> >
>
http://www.wikilovesmonuments.org.pt/tools/stats-eu/index2.php?format=table&...
>> >> >
>> >> >
>> >> >
>> >>
>> >> _______________________________________________
>> >> Wiki Loves Monuments mailing list
>> >> WikiLovesMonuments@lists.wikimedia.org
>> >>
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
>> >>
http://www.wikilovesmonuments.eu
>> >
>> >
>> > ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > Wiki Loves Monuments mailing list
>> > WikiLovesMonuments@lists.wikimedia.org
>> >
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
>> >
http://www.wikilovesmonuments.eu
>>
>> _______________________________________________
>> Wiki Loves Monuments mailing list
>> WikiLovesMonuments@lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
>>
http://www.wikilovesmonuments.eu
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Wiki Loves Monuments mailing list
> WikiLovesMonuments@lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
>
http://www.wikilovesmonuments.eu