You _really_ need to exclude markup and include only body text when measuring stubs. It's not uncommon for mass-produced articles with a  only one or two sentences of text to approach 1K characters, once you include maintenance templates, content templates, categories, infobox, references, etc, etc

cheers
stuart

--
...let us be heard from red core to black sky

On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang <nettrom@gmail.com> wrote:
I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.

I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/wiki/Category:Stub_categories and https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles.

A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go.


Cheers,
Morten


On 20 September 2016 at 02:40, Stuart A. Yeates <syeates@gmail.com> wrote:
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck 

Probably doesn't translate to CJK languages which have radically different information content per character. 

cheers
stuart

--
...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West <west@cs.stanford.edu> wrote:
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian or English), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks!
Bob

--
Up for a little language game? -- http://www.unfun.me

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l