[Wikimedia-l] Wikidata Stubs: Threat or Menace?

Fri Apr 26 09:50:37 UTC 2013

This is a very interesting proposal. I think how well it will work may 
vary considerably based on the language.

The strongest case in favor of machine-generating stubs, imo, is in 
languages where there are many monolingual speakers and the Wikipedia is 
already quite large and active. In that case, machine-generated stubs 
can help promote expansion into not-yet-covered areas, plus provide 
monolingual speakers with information they would otherwise either not 
get, or have to get in worse form via a machine-translated article.

At the other end of the spectrum you have quite small Wikipedias, and 
Wikipedias which are both small and read/written mostly/entirely by 
bilingual readers. In these Wikipedias, article-writing tends to focus 
on things more specifically relevant to a certain culture and history. 
Suddenly creating tens or hundreds of thousands of stubs in such 
languages might serve to dilute a small Wikipedia more than strengthen 
it: if you take a Wikipedia with 10,000 articles, and it gains 500,000 
machine-generated stubs, *almost every* article that comes up in search 
engines will be machine-generated, making it much less obvious what 
parts of the encyclopedia are actually active and human-written amidst 
the sea of auto-generated content.

Plus, from a reader's perspective, it may not even improve the 
availability of information. For example, I doubt there are many 
speakers of Bavarian who would prefer to read a machine-generated 
bar.wiki article, over a human-written de.wiki article. That may even be 
true for some less-related languages: most Danes I know would prefer a 
human-written English article over a machine-generated Danish one.

-Mark

On 4/25/13 8:16 PM, Erik Moeller wrote:
> Millions of Wikidata stubs invade small Wikipedias .. Volapük
> Wikipedia now best curated source on asteroids .. new editors flood
> small wikis .. Google spokesperson: "This is out of control. We will
> shut it down."
>
> Denny suggested:
>
>>> II ) develop a feature that blends into Wikipedia's search if an article
>>> about a topic does not exist yet, but we  have data on Wikidata about that
>>> topic
> Andrew Gray responded:
>
>> I think this would be amazing. A software hook that says "we know X
>> article does not exist yet, but it is matched to Y topic on Wikidata"
>> and pulls out core information, along with a set of localised
>> descriptions... we gain all the benefit of having stub articles
>> (scope, coverage) without the problems of a small community having to
>> curate a million pages. It's not the same as hand-written content, but
>> it's immeasurably better than no content, or even an attempt at
>> machine-translating free text.
>>
>> XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
>> Vietnam]. It [grows to: 20 cm]. (pictures)
> This seems very doable. Is it desirable?
>
> For many languages, it would allow hundreds of thousands of
> pseudo-stubs (not real articles stored in the DB, but generated from
> Wikidata) to be served to readers and crawlers that would otherwise
> not exist in that language.
>
> Looking back 10 years, User:Ram-Man was one of the first to generate
> thousands of en.wp articles from, in this case, US census data. It was
> controversial at the time and it stuck. Other Wikipedias have since
> then either allowed or prohibited bot-creation of articles on a
> project-by-project basis. It tends to lead to frustration when folks
> compare article counts and see artificial inflation by bot-created
> content.
>
> Does anyone know if the impact of bot-creation on (new) editor
> behavior has been studied? I do know that many of the Rambot articles
> were expanded over time, and I suspect many wouldn't have been if they
> hadn't turned up in search engines in the first place. On the flip
> side, a large "surface area" of content being indexed by search
> engines will likely also attract a fair bit of drive-by vandalism that
> may not be detected because those pages aren't watched.
>
> A model like the proposed one might offer a solution to a lot of these
> challenges. How I imagine it could work:
>
> * Templates could be defined for different Wikidata entities. We could
> make it possible to let users add links from items in Wikidata to
> Wikipedia articles that don't exist yet. (Currently this is
> prohibited.) If such a link is added, _and_ a relevant template is
> defined for the Wikidata entity type (perhaps through an entity
> type->template mapping), WP will render an article using that
> template, pulling structured info from Wikidata.
>
> * A lot of the grammatical rules would be defined in the template
> using checks against the Wikidata result. Depending on the complexity
> of grammatical variations beyond basics such as singular/plural this
> might require Lua scripting.
>
> * The article is served as a normal HTTP 200 result, cached, and
> indexed by search engines. In WP itself, links to the article might
> have some special affordance that suggests that they're neither
> ordinary red links nor existing articles.
>
> * When a user tries to edit the article, wikitext (or visual edit
> mode) is generated, allowing the user to expand or add to the
> automatically generated prose and headings. Such edits are tagged so
> they can more easily be monitored (they could also be gated by default
> if the vandalism rate is too high).
>
> * We'd need to decide whether we want these pages to show up in
> searches on WP itself.
>
> Advantages:
>
> * These pages wouldn't inflate page counts, but they would offer
> useful information to readers and be higher quality than machine
> translation.
>
> * They could serve as powerful lures for new editors in languages that
> are currently underrepresented on the web.
>
> Disadvantages/concerns:
>
> * Depending on implementation, I continue to have some concern about
> {{#property}} references ending up in article text (as opposed to
> templates); these concerns are consistent with the ones expressed in
> the en.wp RFC [1]. This might be mitigated if Visual Editor offers a
> super-intuitive in-place editing method. {{#property}} references in
> text could also be converted to their plain text representation the
> moment a page is edited by a human being (which would have its own set
> of challenges, of course).
>
> * How massive would these sets of auto-generated articles get? I
> suspect the technical complexity of setting up the templates and
> adding the links in Wikidata itself would act as a bit of a barrier to
> entry. But vast pseudo-article sets in tiny languages could pose
> operational challenges without adding a lot of value.
>
> * Would search engines penalize WP for such auto-generated content?
>
> Overall, I think it's an area where experimentation is merited, as it
> could not only expand information in languages that are
> underrepresented on the web, but also act as a force multiplier for
> new editor entrypoints. It also seems that a proof-of-concept for
> experimentation in a limited context should be very doable.
>
> Erik
>
> [1] https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2#Use_of_Wikidata_in_article_text
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l