Basically I am with Sabine and support the idea.
Yet I want to warn against doing it now and doing it quickly so as
to avoid certain pitfalls.
I suggest, rather to develop & train bots, data, and algorithms
with the test wikipedia only, for the time being, where spefic
situations can easily be (re)created without risk of havoc.
What follows are details and reasons, you can safely stop reading
here if not interested.
I've been mass inserting data in the ripuarian test wikipedia at
a semi autmated level, which I created from several small database
like collections, such as:
- names, and ISO codes of languages having Wikipedias,
- dates and mottos of carnival parades in the city of cologne of
the last 185 years.
- redirects for dialectal, and spelling variants
So I've (limited) experience.
Pitfalls to be avoided.
If we have inserted data in a WP already, and later a refined
version of that data becomes available, we want to pass that
to the WP. This becomes complicated when an article already
exists for a record. Thus we may strategically choose to export
data as late as possible, at an as complete state as possible,
when general additions and amendments have become unlikely, and
the data structure is stable.
We can safely replace articles, when we can determine that they
have been unaltered since our own last update - i.e. we need to be
able to look at the version history for those cases.
When an article has been conventionally updated by an editor, that
may mean, he altered data, which we originally supplied, and that
we have to update our source before we may re-exort data to
WPs again. It is possible, that an update made in one WP shall
influence others as well, though this is not neccessarily so.
When we say, we supply only some specific data to an article, e.g.
an infobox, then we can reread the infobox, and if it has not
been altered, we can rewrite it for an update.
We can also use such infoboxes to import new data from WPs, when
they have been altered, e.g. someone died. We should have, however,
some protection agains collecting errors, garbage, and vandal drivel.
Both such uses should imho be documented by comments in the
wikicode of the articles in question. Editors must know of the
implications of their edits.
Summarizing all this, I'd suggest to carefulls plan, and test
drive, all aplications having the least chance to be more than
sheer article-creation-and-the-leave-it-alone-forever projects.
Another field needing attention is language.
A pretty huge number of names (of persons, places, langages, etc.)
are identical between languages, or are transliterated somehow, or
undergo systematic transformations (e.g. of the kind that Estonian
versions of male names have 'as' appended to them, afaik) etc.
The rule of thumb is that for lesser-known distant things (places,
languages, persons, etc.) the existance of special or irregular
translations is very unlikely.
That may mean, we can compile a set of transformation rules, and
an exception lookup mechanism (e.g. in WiktionaryZ) and pretty
well assume, when exceptions are not found, that we can use the
Naturally, when this assumption fails, we need to have a feedback
path from the respective language community, that allows us to
"repair" errors. Since in most Wikipedias there are editors
reviewing all, or most, new articles, we can assume feedback to be
rather quick and reliable.
Finding the right, grammar, wording etc. for automatically
generated non-tabular content is quite an interesting task which
I'll not address here any further ;-)
Wikies not having alert proofreaders should imho not be filled
with much automated content, since this might be a remarkable
hindrance for community buildup.
The amount of newly inserted automated data should be determinable
by wiki admins, and generally it might be wise to make it somehow
related to the number of edits of any given time period, so as not
to overload the community.
How wiki admins find the right figures should imho be left to
them, valid suggestion might be by public voting, or taken from
experience of how thoroughly data can be verified.
Alo, keeping data up to date needs imho to be negotiated with the
communities. I bet, we'll receive several interesting ideas of how
this could be accomplished without interfering with potential
human editors too much.
Greetings to all
-- e-mail: <wikidata-l.mail.wikimedia.org(a)publi.purodha.net>