You _really_ need to exclude markup and include only body text when
measuring stubs. It's not uncommon for mass-produced articles with a only
one or two sentences of text to approach 1K characters, once you include
maintenance templates, content templates, categories, infobox, references,
etc, etc
cheers
stuart
--
...let us be heard from red core to black sky
On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang <nettrom(a)gmail.com> wrote:
I don't know of a clean, language-independent way
of grabbing all stubs.
Stuart's suggestion is quite sensible, at least for English Wikipedia. When
I last checked a few years ago, the mean length of an English language stub
(on a log-scale) is around 1kB (including all markup), and they're quite
much smaller than any other class.
I'd also see if the category system allows for some straightforward
retrieval. English has
https://en.wikipedia.org/
wiki/Category:Stub_categories and
https://en.wikipedia.org/
wiki/Category:Stubs with quite a lot of links to other languages, which
could be a good starting point. For some of the research we've done on
quality, exploiting regularities in the category system using database
access (in other words, LIKE-queries), is a quick way to grab most articles.
A combination of both approaches might be a good way. If you're looking
for even more thorough classification, grabbing a set and training a
classifier might be the way to go.
Cheers,
Morten
On 20 September 2016 at 02:40, Stuart A. Yeates <syeates(a)gmail.com> wrote:
en:WP:DYK has a measure of 1,500+ characters of
prose, which is a useful
cutoff. There is weaponised javascript to measure that at en:WP:Did you
know/DYKcheck
Probably doesn't translate to CJK languages which have radically
different information content per character.
cheers
stuart
--
...let us be heard from red core to black sky
On Tue, Sep 20, 2016 at 9:26 PM, Robert West <west(a)cs.stanford.edu>
wrote:
Hi everyone,
Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?
Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.
I've found lists for various languages (e.g., Italian
<https://it.wikipedia.org/wiki/Categoria:Stub> or English
<https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the
lists are in different formats, so separate code is required for each
language, which doesn't scale.
I guess in the worst case, I'll have to grep for the respective stub
templates in the respective wikitext dumps, but even this requires to know
for each language what the respective template is. So if anyone could point
me to a list of stub templates in different languages, that would also be
appreciated.
Thanks!
Bob
--
Up for a little language game? --
http://www.unfun.me
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l