[Wikimedia-l] Wikidata Stubs: Threat or Menace?

Jane Darnell jane023 at gmail.com
Fri Apr 26 12:10:22 UTC 2013


Well, I am going to come out of the closet here and admit that I for
one will sometimes want to read that machine-generated text over the
human-written English one. Sometimes to uncover the real little gems
of Wikipedia, you need to have a lot of patience with Google translate
options.

2013/4/26, Delirium <delirium at hackish.org>:
> This is a very interesting proposal. I think how well it will work may
> vary considerably based on the language.
>
> The strongest case in favor of machine-generating stubs, imo, is in
> languages where there are many monolingual speakers and the Wikipedia is
> already quite large and active. In that case, machine-generated stubs
> can help promote expansion into not-yet-covered areas, plus provide
> monolingual speakers with information they would otherwise either not
> get, or have to get in worse form via a machine-translated article.
>
> At the other end of the spectrum you have quite small Wikipedias, and
> Wikipedias which are both small and read/written mostly/entirely by
> bilingual readers. In these Wikipedias, article-writing tends to focus
> on things more specifically relevant to a certain culture and history.
> Suddenly creating tens or hundreds of thousands of stubs in such
> languages might serve to dilute a small Wikipedia more than strengthen
> it: if you take a Wikipedia with 10,000 articles, and it gains 500,000
> machine-generated stubs, *almost every* article that comes up in search
> engines will be machine-generated, making it much less obvious what
> parts of the encyclopedia are actually active and human-written amidst
> the sea of auto-generated content.
>
> Plus, from a reader's perspective, it may not even improve the
> availability of information. For example, I doubt there are many
> speakers of Bavarian who would prefer to read a machine-generated
> bar.wiki article, over a human-written de.wiki article. That may even be
> true for some less-related languages: most Danes I know would prefer a
> human-written English article over a machine-generated Danish one.
>
> -Mark
>
>
> On 4/25/13 8:16 PM, Erik Moeller wrote:
>> Millions of Wikidata stubs invade small Wikipedias .. Volapük
>> Wikipedia now best curated source on asteroids .. new editors flood
>> small wikis .. Google spokesperson: "This is out of control. We will
>> shut it down."
>>
>> Denny suggested:
>>
>>>> II ) develop a feature that blends into Wikipedia's search if an article
>>>> about a topic does not exist yet, but we  have data on Wikidata about
>>>> that
>>>> topic
>> Andrew Gray responded:
>>
>>> I think this would be amazing. A software hook that says "we know X
>>> article does not exist yet, but it is matched to Y topic on Wikidata"
>>> and pulls out core information, along with a set of localised
>>> descriptions... we gain all the benefit of having stub articles
>>> (scope, coverage) without the problems of a small community having to
>>> curate a million pages. It's not the same as hand-written content, but
>>> it's immeasurably better than no content, or even an attempt at
>>> machine-translating free text.
>>>
>>> XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
>>> Vietnam]. It [grows to: 20 cm]. (pictures)
>> This seems very doable. Is it desirable?
>>
>> For many languages, it would allow hundreds of thousands of
>> pseudo-stubs (not real articles stored in the DB, but generated from
>> Wikidata) to be served to readers and crawlers that would otherwise
>> not exist in that language.
>>
>> Looking back 10 years, User:Ram-Man was one of the first to generate
>> thousands of en.wp articles from, in this case, US census data. It was
>> controversial at the time and it stuck. Other Wikipedias have since
>> then either allowed or prohibited bot-creation of articles on a
>> project-by-project basis. It tends to lead to frustration when folks
>> compare article counts and see artificial inflation by bot-created
>> content.
>>
>> Does anyone know if the impact of bot-creation on (new) editor
>> behavior has been studied? I do know that many of the Rambot articles
>> were expanded over time, and I suspect many wouldn't have been if they
>> hadn't turned up in search engines in the first place. On the flip
>> side, a large "surface area" of content being indexed by search
>> engines will likely also attract a fair bit of drive-by vandalism that
>> may not be detected because those pages aren't watched.
>>
>> A model like the proposed one might offer a solution to a lot of these
>> challenges. How I imagine it could work:
>>
>> * Templates could be defined for different Wikidata entities. We could
>> make it possible to let users add links from items in Wikidata to
>> Wikipedia articles that don't exist yet. (Currently this is
>> prohibited.) If such a link is added, _and_ a relevant template is
>> defined for the Wikidata entity type (perhaps through an entity
>> type->template mapping), WP will render an article using that
>> template, pulling structured info from Wikidata.
>>
>> * A lot of the grammatical rules would be defined in the template
>> using checks against the Wikidata result. Depending on the complexity
>> of grammatical variations beyond basics such as singular/plural this
>> might require Lua scripting.
>>
>> * The article is served as a normal HTTP 200 result, cached, and
>> indexed by search engines. In WP itself, links to the article might
>> have some special affordance that suggests that they're neither
>> ordinary red links nor existing articles.
>>
>> * When a user tries to edit the article, wikitext (or visual edit
>> mode) is generated, allowing the user to expand or add to the
>> automatically generated prose and headings. Such edits are tagged so
>> they can more easily be monitored (they could also be gated by default
>> if the vandalism rate is too high).
>>
>> * We'd need to decide whether we want these pages to show up in
>> searches on WP itself.
>>
>> Advantages:
>>
>> * These pages wouldn't inflate page counts, but they would offer
>> useful information to readers and be higher quality than machine
>> translation.
>>
>> * They could serve as powerful lures for new editors in languages that
>> are currently underrepresented on the web.
>>
>> Disadvantages/concerns:
>>
>> * Depending on implementation, I continue to have some concern about
>> {{#property}} references ending up in article text (as opposed to
>> templates); these concerns are consistent with the ones expressed in
>> the en.wp RFC [1]. This might be mitigated if Visual Editor offers a
>> super-intuitive in-place editing method. {{#property}} references in
>> text could also be converted to their plain text representation the
>> moment a page is edited by a human being (which would have its own set
>> of challenges, of course).
>>
>> * How massive would these sets of auto-generated articles get? I
>> suspect the technical complexity of setting up the templates and
>> adding the links in Wikidata itself would act as a bit of a barrier to
>> entry. But vast pseudo-article sets in tiny languages could pose
>> operational challenges without adding a lot of value.
>>
>> * Would search engines penalize WP for such auto-generated content?
>>
>> Overall, I think it's an area where experimentation is merited, as it
>> could not only expand information in languages that are
>> underrepresented on the web, but also act as a force multiplier for
>> new editor entrypoints. It also seems that a proof-of-concept for
>> experimentation in a limited context should be very doable.
>>
>> Erik
>>
>> [1]
>> https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2#Use_of_Wikidata_in_article_text
>> --
>> Erik Möller
>> VP of Engineering and Product Development, Wikimedia Foundation
>>
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l at lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>



More information about the Wikimedia-l mailing list