[Wikimedia-l] Wikidata Stubs: Threat or Menace?

Thu Apr 25 18:16:05 UTC 2013

Millions of Wikidata stubs invade small Wikipedias .. Volapük
Wikipedia now best curated source on asteroids .. new editors flood
small wikis .. Google spokesperson: "This is out of control. We will
shut it down."

Denny suggested:

>> II ) develop a feature that blends into Wikipedia's search if an article
>> about a topic does not exist yet, but we  have data on Wikidata about that
>> topic

Andrew Gray responded:

> I think this would be amazing. A software hook that says "we know X
> article does not exist yet, but it is matched to Y topic on Wikidata"
> and pulls out core information, along with a set of localised
> descriptions... we gain all the benefit of having stub articles
> (scope, coverage) without the problems of a small community having to
> curate a million pages. It's not the same as hand-written content, but
> it's immeasurably better than no content, or even an attempt at
> machine-translating free text.
>
> XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos,
> Vietnam]. It [grows to: 20 cm]. (pictures)

This seems very doable. Is it desirable?

For many languages, it would allow hundreds of thousands of
pseudo-stubs (not real articles stored in the DB, but generated from
Wikidata) to be served to readers and crawlers that would otherwise
not exist in that language.

Looking back 10 years, User:Ram-Man was one of the first to generate
thousands of en.wp articles from, in this case, US census data. It was
controversial at the time and it stuck. Other Wikipedias have since
then either allowed or prohibited bot-creation of articles on a
project-by-project basis. It tends to lead to frustration when folks
compare article counts and see artificial inflation by bot-created
content.

Does anyone know if the impact of bot-creation on (new) editor
behavior has been studied? I do know that many of the Rambot articles
were expanded over time, and I suspect many wouldn't have been if they
hadn't turned up in search engines in the first place. On the flip
side, a large "surface area" of content being indexed by search
engines will likely also attract a fair bit of drive-by vandalism that
may not be detected because those pages aren't watched.

A model like the proposed one might offer a solution to a lot of these
challenges. How I imagine it could work:

* Templates could be defined for different Wikidata entities. We could
make it possible to let users add links from items in Wikidata to
Wikipedia articles that don't exist yet. (Currently this is
prohibited.) If such a link is added, _and_ a relevant template is
defined for the Wikidata entity type (perhaps through an entity
type->template mapping), WP will render an article using that
template, pulling structured info from Wikidata.

* A lot of the grammatical rules would be defined in the template
using checks against the Wikidata result. Depending on the complexity
of grammatical variations beyond basics such as singular/plural this
might require Lua scripting.

* The article is served as a normal HTTP 200 result, cached, and
indexed by search engines. In WP itself, links to the article might
have some special affordance that suggests that they're neither
ordinary red links nor existing articles.

* When a user tries to edit the article, wikitext (or visual edit
mode) is generated, allowing the user to expand or add to the
automatically generated prose and headings. Such edits are tagged so
they can more easily be monitored (they could also be gated by default
if the vandalism rate is too high).

* We'd need to decide whether we want these pages to show up in
searches on WP itself.

Advantages:

* These pages wouldn't inflate page counts, but they would offer
useful information to readers and be higher quality than machine
translation.

* They could serve as powerful lures for new editors in languages that
are currently underrepresented on the web.

Disadvantages/concerns:

* Depending on implementation, I continue to have some concern about
{{#property}} references ending up in article text (as opposed to
templates); these concerns are consistent with the ones expressed in
the en.wp RFC [1]. This might be mitigated if Visual Editor offers a
super-intuitive in-place editing method. {{#property}} references in
text could also be converted to their plain text representation the
moment a page is edited by a human being (which would have its own set
of challenges, of course).

* How massive would these sets of auto-generated articles get? I
suspect the technical complexity of setting up the templates and
adding the links in Wikidata itself would act as a bit of a barrier to
entry. But vast pseudo-article sets in tiny languages could pose
operational challenges without adding a lot of value.

* Would search engines penalize WP for such auto-generated content?

Overall, I think it's an area where experimentation is merited, as it
could not only expand information in languages that are
underrepresented on the web, but also act as a force multiplier for
new editor entrypoints. It also seems that a proof-of-concept for
experimentation in a limited context should be very doable.

Erik

[1] https://en.wikipedia.org/wiki/Wikipedia:Requests_for_comment/Wikidata_Phase_2#Use_of_Wikidata_in_article_text
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation