Fwd: Support for small languages - Langcom

18 Nov 2013

Hoi,

I have been asked by Erik what can be done to better support small
languages and in particular what we can do to support more small languages
more effectively. I have thought about it for a long time and as far as I
am concerned, try as we might it will not happen as long as there is no
clear benefit we bring. In this text I describe how we can provide more
value to people of *any *language. For new and small languages the emphasis
will be on bold and easy strokes that have a big impact.

Key in all this will be that we have to connect to what is already there.
This makes search key. Two of the most important objectives are finding
pictures and finding information. Wikidata provides the most obvious tool
in this because it takes little effort to connect to the information that
is already there. Half an hour a day on labelling items that are in the
news will swell the most often searched terms rapidly in any language.

When people search for something, they either find it or they do not. When
a search is entered and nothing is found, it may exist either under a
different spelling or in a different language. When something is NOT found,
we should ask if the person knows a synonym in his language or a
translation in another language. With the new term we iterate in the
search. When something is found after one iteration, we ask if this item is
indeed what was intended to be found. One image and a first paragraph of
text should suffice. When it does, we add the search item as a (dirty)
label. Adding labels in this way will quickly swell the number of terms
available in a language for a search. Most importantly we make from a
failure a success. A success that benefits everyone who seeks the same
information.

When an item is found in a language, we can provide information in that
language in the format of an infobox or a reasonator page. Obviously many
statements may not exist in that language. They are blinking or presented
in another language or whatever so that they can be added in the primary
language. This approach will ensure that a teacher can select the search
terms he is interested in and prepare the information for his students.

Another approach is to learn where we fail to provide information. We do
not know what search terms fail most often. Consequently we do not have the
tool to remedy this in any language. The basis of data driven user
participation is that we KNOW what to ask for and why. When people start to
find pictures because of the link Wikidata has with Commons, we need to
understand it and see it coming before kids in school from all over the
world really start hammering our servers.

The objective is to reach the tipping point where we become useful in a
language.

I have been asked to become an advisory board member for the PanLex
Project<http://panlex.org>of The Long Now Foundation. I have accepted
this and what they are
interested in is experimenting with one language and see how their content
can make a difference in Wikidata but equally how Wikidata can make a
difference in Wikidata. My take on their objective is that their work makes
no difference if it is not used. An experiment will see their staff work on
leveraging our data and software and vice versa. In my opinion this will
make information useful as explained above.'

We have the opportunity to experiment with the Long Now Foundation and at
the same time develop tooling that will help all our languages and will
help us reach the tipping point where Wikidata is useful for all of them.

I also propose to change the criteria for accepting new WMF projects. So
far we asked for Wikipedia many written articles of high quality.
Effectively we accepted many articles of a stub quality. What I propose is
to have something like 50 articles of a substantial size and complement
this with 250 items that have labels for all the statements. These 250
items cover many domains but are optimised for being what people are likely
to search.

I have been pushing and experimenting along these lines. The result is a
search tool using Wikidata in Wikipedia. A demonstration that Wikidata
knows more items than Wikipedia has articles. Visualisation for people and
organisms in the "Reasonator" and a personal conviction that increasingly
says that this is how we can grow any language to the fullest of its
potential.

My question is, what do you think. How can we be implement this. What more
can we do.
Thanks,
     Gerard

PS I fear that when children find that they can find pictures in THEIR
language, that they will be able to bring our servers down.. A luxury
problem I am sure :)

PS-2 A big thank you to Magnus Manske and Lydia Pintscher for the wonderful
work they do.