[Wikimedia-l] Wikidata Stubs: Threat or Menace?

Fri Apr 26 15:34:50 UTC 2013

Since I was thinking about how to do this for some time, I wrote some 
developers' notes at 
http://meta.wikimedia.org/wiki/Wikidata/Notes/Article_generation so feel 
free to comment if anything is not clear or not desirable.

On 26/04/13 14:10, Jane Darnell wrote:
> Well, I am going to come out of the closet here and admit that I for
> one will sometimes want to read that machine-generated text over the
> human-written English one. Sometimes to uncover the real little gems
> of Wikipedia, you need to have a lot of patience with Google translate
> options.
>
> 2013/4/26, Delirium <delirium at hackish.org>:
>> This is a very interesting proposal. I think how well it will work may
>> vary considerably based on the language.
>>
>> The strongest case in favor of machine-generating stubs, imo, is in
>> languages where there are many monolingual speakers and the Wikipedia is
>> already quite large and active. In that case, machine-generated stubs
>> can help promote expansion into not-yet-covered areas, plus provide
>> monolingual speakers with information they would otherwise either not
>> get, or have to get in worse form via a machine-translated article.
>>
>> At the other end of the spectrum you have quite small Wikipedias, and
>> Wikipedias which are both small and read/written mostly/entirely by
>> bilingual readers. In these Wikipedias, article-writing tends to focus
>> on things more specifically relevant to a certain culture and history.
>> Suddenly creating tens or hundreds of thousands of stubs in such
>> languages might serve to dilute a small Wikipedia more than strengthen
>> it: if you take a Wikipedia with 10,000 articles, and it gains 500,000
>> machine-generated stubs, *almost every* article that comes up in search
>> engines will be machine-generated, making it much less obvious what
>> parts of the encyclopedia are actually active and human-written amidst
>> the sea of auto-generated content.
>>
>> Plus, from a reader's perspective, it may not even improve the
>> availability of information. For example, I doubt there are many
>> speakers of Bavarian who would prefer to read a machine-generated
>> bar.wiki article, over a human-written de.wiki article. That may even be
>> true for some less-related languages: most Danes I know would prefer a
>> human-written English article over a machine-generated Danish one.
>>
>> -Mark
>>
>>
>> On 4/25/13 8:16 PM, Erik Moeller wrote:
>>> Millions of Wikidata stubs invade small Wikipedias .. Volapük
>>> Wikipedia now best curated source on asteroids .. new editors flood
>>> small wikis .. Google spokesperson: "This is out of control. We will
>>> shut it down."
>>>
>>> Denny suggested:
>>>
>>>>> II ) develop a feature that blends into Wikipedia's search if an article
>>>>> about a topic does not exist yet, but we  have data on Wikidata about
>>>>> that
>>>>> topic
>>> Andrew Gray responded:
>>>
>>>> I think this would be amazing. A software hook that says "we know X
>>>> article does not exist yet, but it is matched to Y topic on Wikidata"
>>>> and pulls out core information, along with a set of localised
>>>> descriptions... we gain all the benefit of having stub articles
>>>> (scope, coverage) without the problems of a small community having to
>>>> curate a million pages. It's not the same as hand-written content, but
>>>> it's immeasurably better than no content, or even an attempt at
>>>> machine-translating free text.
>>>>
>>>> XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
>>>> Vietnam]. It [grows to: 20 cm]. (pictures)
>>> This seems very doable. Is it desirable?
>>>
>>> For many languages, it would allow hundreds of thousands of
>>> pseudo-stubs (not real articles stored in the DB, but generated from
>>> Wikidata) to be served to readers and crawlers that would otherwise
>>> not exist in that language.
>>>
>>> Looking back 10 years, User:Ram-Man was one of the first to generate
>>> thousands of en.wp articles from, in this case, US census data. It was
>>> controversial at the time and it stuck. Other Wikipedias have since
>>> then either allowed or prohibited bot-creation of articles on a
>>> project-by-project basis. It tends to lead to frustration when folks
>>> compare article counts and see artificial inflation by bot-created
>>> content.
>>>
>>> Does anyone know if the impact of bot-creation on (new) editor
>>> behavior has been studied? I do know that many of the Rambot articles
>>> were expanded over time, and I suspect many wouldn't have been if they
>>> hadn't turned up in search engines in the first place. On the flip
>>> side, a large "surface area" of content being indexed by search
>>> engines will likely also attract a fair bit of drive-by vandalism that
>>> may not be detected because those pages aren't watched.
>>>
>>> A model like the proposed one might offer a solution to a lot of these
>>> challenges. How I imagine it could work:
>>>
>>> * Templates could be defined for different Wikidata entities. We could
>>> make it possible to let users add links from items in Wikidata to
>>> Wikipedia articles that don't exist yet. (Currently this is
>>> prohibited.) If such a link is added, _and_ a relevant template is
>>> defined for the Wikidata entity type (perhaps through an entity
>>> type->template mapping), WP will render an article using that
>>> template, pulling structured info from Wikidata.
>>>
>>> * A lot of the grammatical rules would be defined in the template
>>> using checks against the Wikidata result. Depending on the complexity
>>> of grammatical variations beyond basics such as singular/plural this
>>> might require Lua scripting.
>>>
>>> * The article is served as a normal HTTP 200 result, cached, and
>>> indexed by search engines. In WP itself, links to the article might
>>> have some special affordance that suggests that they're neither
>>> ordinary red links nor existing articles.
>>>
>>> * When a user tries to edit the article, wikitext (or visual edit
>>> mode) is generated, allowing the user to expand or add to the
>>> automatically generated prose and headings. Such edits are tagged so
>>> they can more easily be monitored (they could also be gated by default
>>> if the vandalism rate is too high).
>>>
>>> * We'd need to decide whether we want these pages to show up in
>>> searches on WP itself.
>>>
>>> Advantages:
>>>
>>> * These pages wouldn't inflate page counts, but they would offer
>>> useful information to readers and be higher quality than machine
>>> translation.
>>>
>>> * They could serve as powerful lures for new editors in languages that
>>> are currently underrepresented on the web.
>>>
>>> Disadvantages/concerns:
>>>
>>> * Depending on implementation, I continue to have some concern about
>>> {{#property}} references ending up in article text (as opposed to
>>> templates); these concerns are consistent with the ones expressed in
>>> the en.wp RFC [1]. This might be mitigated if Visual Editor offers a
>>> super-intuitive in-place editing method. {{#property}} references in
>>> text could also be converted to their plain text representation the
>>> moment a page is edited by a human being (which would have its own set
>>> of challenges, of course).
>>>
>>> * How massive would these sets of auto-generated articles get? I
>>> suspect the technical complexity of setting up the templates and
>>> adding the links in Wikidata itself would act as a bit of a barrier to
>>> entry. But vast pseudo-article sets in tiny languages could pose
>>> operational challenges without adding a lot of value.
>>>
>>> * Would search engines penalize WP for such auto-generated content?
>>>
>>> Overall, I think it's an area where experimentation is merited, as it
>>> could not only expand information in languages that are
>>> underrepresented on the web, but also act as a force multiplier for
>>> new editor entrypoints. It also seems that a proof-of-concept for
>>> experimentation in a limited context should be very doable.
>>>
>>> Erik
>>>
>>> [1]
>>> https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2#Use_of_Wikidata_in_article_text