Great, if you remove ETL from the way, the task is pretty simple:
* Get a CSV dump of your table(s).
* Use google refine to draw rules for cleaning up the data - I can
provide you some, based on my very old database dump, and we might work
that out together through Skype.
* Reload the table(s) with the exported (fixed) CSV.
It's pretty easy!
The great thing about it is that you get a record of "cleaning actions",
which you can replay later - e.g., if you get a more updated database
dump which didn't get cleaned up.
--
Nuno Tavares
Wikimedia Portugal
Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter
livre acesso ao somatório de todo o conhecimento humano. É isso o que
estamos a fazer.
Participe também:
Ok, thanks, I think I've got it. It will be better
to clean the
municipality field moving any comment or multiple municipalities to
another field more suitable for notes. The pipped links and sorting keys
can be generated with an auxiliary template-table that can be useful
also for uniformity in other types of lists.
Obtaining the source has been a PITA, so regeneration is not an issue.
We will think later how to check for updates.
Vicenç
Date: Thu, 25 Aug 2011 12:33:35 +0100
From: nuno.tavares(a)wikimedia.pt
To: wikilovesmonuments(a)lists.wikimedia.org
Subject: Re: [Wiki Loves Monuments] Spanish database [Was: European
stats and
other cool tools]
Dear Vicenç [I believe all others should ignore this message],
Yes, I've just noticed, e.g., "[[Ogassa]] i [[Ribes de Freser]]". My
(professional) opinion is that this is a huge mistake, since that field
represents 2 values, which is a clear violation of standard database
rules. Having this in mind, I'd suggest splitting the object in two.
I found a pattern for "multiple municipalities in the same object",
which seems to be "]] i [[", and there seems to be only 6 of them. You
can just wipe out one of them from the field, or manually duplicate the
record them for splitting - the latter being more complex, think of a
good ID disambiguation though. Regarding link disambiguations, you can
just add a column for that - in a "working table", as they are of no use
for a listing like this (alas, that kind of data is only relevant if you
plan to regenerate the lists).
We had similar problems - we even had descriptions names in wiki syntax
already, our lists date back to 2006. As I wanted to preserve the links
and original "names" (designations), I've done some transformations on
the data before it gets on the list - this kind of approach was actually
the reason to have our tools/plist [1] - notice multiple links near the
end of each description, and notice that, if you export to wiki format,
you'll find municipalities disambiguated as well.
[1]
http://www.wikilovesmonuments.org.pt/tools/plist/?region=Braga
That's why you currently can "regenerate" the lists with their original
descriptions and links.
This is, however, a more complex approach which I won't have time to
assist you, but I may point you directions if you know ETL.
--
Nuno Tavares
Wikimedia Portugal
http://www.wikimedia.pt
Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter
livre acesso ao somatório de todo o conhecimento humano. É isso o que
estamos a fazer.
Participe também:
http://www.wikimedia.pt
Vicenç Riullop escreveu:
> A list may include different municipalities so links, disambiguations
> and sorting keys are included in the municipality field. Morever, some
> large monuments may be in the border of two municipalities. Any
> suggestion how to handdle this?
>
> Vicenç
>
>> Date: Thu, 25 Aug 2011 03:40:12 +0100
>> From: nuno.tavares(a)wikimedia.pt
>> To: wikilovesmonuments(a)lists.wikimedia.org
>> Subject: Re: [Wiki Loves Monuments] Spanish database [Was: European
> stats and other cool tools]
>>
>> Dear spanish fellows:
>>
>> I've notice some quality problems in the municipality names of the
>> objects on es/es-ct/es-vc lists. Sometimes they are [[like this]],
>> others "like this" and also "{{ordre|like this}}". In my
opinion, this
>> is half-way for problems in managing the lists.
>>
>>
>
http://www.wikilovesmonuments.org.pt/tools/stats-eu/index2.php?format=table…
>>
>>
>> It's pretty easy to clean that up, but I'm still not sure if Maarten is
>> loading the database from wiki lists (thus overriding any fixes we
>> perform). Get in touch with me if you need help.
>>
>> PS: I don't have much opportunities to hang around IRC, but feel
free
to
>> ping me in Skype.
>>
>> PS2: Andorra and France seem to suffer the same symptoms.
>>
>> --
>> Nuno Tavares
>> Wikimedia Portugal
>>
http://www.wikimedia.pt
>>
>> Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter
>> livre acesso ao somatório de todo o conhecimento humano. É isso o que
>> estamos a fazer.
>>
>> Participe também:
http://www.wikimedia.pt
>>
>> Em 25-08-2011 02:49, Nuno Tavares escreveu:
>> > Hey again,
>> >
>> > Didn't I say I love statistics? :-)
>> >
>> > You can now have your "heat" list adding the option
"color=heatpct":
>> >
>> >
>
http://www.wikilovesmonuments.org.pt/tools/stats-eu/index2.php?format=table…
>
>
>
_______________________________________________
Wiki Loves Monuments mailing list
WikiLovesMonuments(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
http://www.wikilovesmonuments.eu
------------------------------------------------------------------------
_______________________________________________
Wiki Loves Monuments mailing list
WikiLovesMonuments(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
http://www.wikilovesmonuments.eu
_______________________________________________
Wiki Loves Monuments mailing list
WikiLovesMonuments(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
http://www.wikilovesmonuments.eu
------------------------------------------------------------------------
_______________________________________________
Wiki Loves Monuments mailing list
WikiLovesMonuments(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments
http://www.wikilovesmonuments.eu