[Foundation-l] data centralization for the benefit of small (and also bigger) projects

Marcus Buck me at marcusbuck.org
Mon Aug 23 15:55:34 UTC 2010


  Several wikis have used bots to increase their article count in the 
past. Examples are the Volapük Wikipedia (vo) with 118,000 articles of 
which about 117,000 are bot-created stubs or the Aromanian Wikipedia 
(roa-rup) with 61,000 articles at the moment and less than 10,000 before 
the bot run.

Why do they use bots? Because they have a small userbase and want to 
cover as much topics as possible with small effort. Most of the 
languages that use bots are small languages without much written 
literature especially when it comes to non-fiction reference literature. 
There are no Aromanian encyclopedias, no or few reference books, no 
databases etc. An Aromanian either has to learn and use foreign 
languages or he will never be able to get informations about places in 
China or in America. The bot operator tried to change this by creating 
stubs about places in China, America and elsewhere. (Geographic objects 
are the easiest method to cover large numbers of topics without much 
effort.) But he did a horrible job with really bad and uninformative 
articles. I assume the reason for the bad articles is not any bad intent 
but just lack of technical skills to program a more useful bot.

The easiest reaction to this is to just let them do their thing and 
don't care about it. The second easiest reaction is running a delete bot 
and removing the bad articles because of their negative effects. But 
both methods do not address the original motivation of the bot operator: 
the wish to have information about a large range of entities available 
in the wiki's language.

How can this be addressed?

We need a datawiki. That's not a new proposal, proposals for datawikis 
have a long history. But there never was a specific reason not to 
implement it, it was just that nobody cared about it so much that it was 
implemented until now.

Here's my idea about it:
When a search does not yield any matching articles on the local wiki, 
the software will look up the name in the central datawiki. If the 
central datawiki contains a matching entry, this entry will be loaded. 
It will contain an instance of a template filled with information about 
the entity. E.g.:

{{Town
|name=Fab City
|country=Awesomia
|pop=89042
|lat=42.0
|lon=42.0
|elevation=12
|mayor=Adam Sweet
}}

The software will now look for a template called "Town" on the local 
wiki. The local template [[Template:Town]] will for example look like this:

{| class="infobox"
|-
! Name
|
{{{name|}}}
|-
! Country
|
{{{country|}}}
|-
! Population
|
{{{pop|}}}
|-
! Mayor
|
{{{mayor|}}}
|-
! Elevation
|
{{{elevation|}}} above sea level
|-
! Geographic position
|
{{latlon| {{{lat|}}} | {{{lon|}}} }}
|}
'''{{{name|}}}''' is a place in [[{{countryname| {{{country|}}} }}]] with a population of {{{pop|}}}.

[[Category:{{countryname| {{{country|}}} }}]]
[[Category:Towns]]

Of course this template will be localized in the language of the local 
wiki. This information will now be shown to the user who entered the 
name in the search. (The above examples are just, well, examples. Real 
entries would most likely contain much more data.)

The datawiki can be filled with information about any entity that has a 
certain set of recurring features (almost anything that has a infobox on 
Wikipedia), especially geographic objects. These objects also have the 
advantage that their names usually are international (at least among 
Latin script languages).

The advantages are:
- when the central datawiki is filled with info (most of which can be 
bot-extracted from existing Wikipedia infoboxes), every Wikipedia - how 
small the userbase may be - has instant access to information about 
hundreds of thousands or millions of objects, they just need to 
implement some infobox templates
- this solution also erases problems with outdated information in 
infoboxes (a problem even en.wp is suffering from). The data only needs 
to be updated in one single place instead of every single Wikipedia 
separately

With the work done by Nikola Smolenski on the Interlanguage extension 
(<http://www.mediawiki.org/wiki/Extension:Interlanguage>) it shouldn't 
be too hard to implement.

In view of the potential usefulness I cannot think of any argument that 
speaks against this in general. The prospect of providing at least basic 
information about millions of objects in all the different languages 
seems really great to me.

Many native speakers of smaller languages use foreign language wikis as 
their default wiki because the chance that their native wiki has an 
article on the topic is small. If the number of topics where a search on 
the native wiki yields results raises from "some thousands" to 
"millions", there is a chance that users will finally accept their 
native wiki as their default wiki. The entries will be basic, but if 
interwikis (of existing articles not generated from the datawiki) will 
be included in the info obtained from the datawiki, the more extensive 
data is just one click away, while an unsuccessful search on the local 
wiki (as you will get it as of now) is a dead end.

It certainly is worth putting some resources into it.

What do you think?

Marcus Buck
User:Slomox



More information about the foundation-l mailing list