Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages

20 Sep 2016

Hi all,

I'd strongly caution against using the stub categories without *also*
doing some kind of filtering on size. There's a real problem with
"stub lag" - articles get tagged, incrementally improve, no-one thinks
they've done enough to justify removing the tag (or notices the tag is
there, or thinks they're allowed to remove it)... and you end up with
a lot of multi-section pages a good hundred words of text still
labelled "stub"....

(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang &lt;nettrom(a)gmail.com&gt; wrote:
...
  I don't know of a clean, language-independent way
of grabbing all stubs.
 Stuart's suggestion is quite sensible, at least for English Wikipedia. When
 I last checked a few years ago, the mean length of an English language stub
 (on a log-scale) is around 1kB (including all markup), and they're quite
 much smaller than any other class.

 I'd also see if the category system allows for some straightforward
 retrieval. English has
 https://en.wikipedia.org/wiki/Category:Stub_categories and
 https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to
 other languages, which could be a good starting point. For some of the
 research we've done on quality, exploiting regularities in the category
 system using database access (in other words, LIKE-queries), is a quick way
 to grab most articles.

 A combination of both approaches might be a good way. If you're looking for
 even more thorough classification, grabbing a set and training a classifier
 might be the way to go.

 Cheers,
 Morten

 On 20 September 2016 at 02:40, Stuart A. Yeates &lt;syeates(a)gmail.com&gt; wrote:

 en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful
 cutoff. There is weaponised javascript to measure that at en:WP:Did you
 know/DYKcheck

 Probably doesn't translate to CJK languages which have radically different
 information content per character.

 cheers
 stuart

 --
 ...let us be heard from red core to black sky

 On Tue, Sep 20, 2016 at 9:26 PM, Robert West &lt;west(a)cs.stanford.edu&gt; wrote:

 Hi everyone,

 Does anyone know if there's a straightforward (ideally
 language-independent) way of identifying stub articles in Wikipedia?

 Whatever works is ok, whether it's publicly available data or data
 accessible only on the WMF cluster.

 I've found lists for various languages (e.g., Italian or English), but
 the lists are in different formats, so separate code is required for each
 language, which doesn't scale.

 I guess in the worst case, I'll have to grep for the respective stub
 templates in the respective wikitext dumps, but even this requires to know
 for each language what the respective template is. So if anyone could point
 me to a list of stub templates in different languages, that would also be
 appreciated.

 Thanks!
 Bob

 --
 Up for a little language game? -- http://www.unfun.me

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- 
- Andrew Gray
  andrew.gray(a)dunelm.org.uk

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Identifying Wikipedia stubs in various languages