Identifying Wikipedia stubs in various languages

List overview All Threads
Download

newer

older

Re: [Wiki-research-l] [Analytics]...

measuring time to proofread...

Robert West

20 Sep 2016 20 Sep '16

6:26 a.m.

Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian https://it.wikipedia.org/wiki/Categoria:Stub or English https://en.wikipedia.org/wiki/Category:All_stub_articles), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks! Bob

-- Up for a little language game? -- http://www.unfun.me

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Stuart A. Yeates

20 Sep 20 Sep

6:40 a.m.

New subject: Identifying Wikipedia stubs in various languages

en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck

Probably doesn't translate to CJK languages which have radically different information content per character.

cheers stuart

-- ...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West west@cs.stanford.edu wrote:

...

Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian https://it.wikipedia.org/wiki/Categoria:Stub or English https://en.wikipedia.org/wiki/Category:All_stub_articles), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks! Bob

-- Up for a little language game? -- http://www.unfun.me

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Morten Wang

2:01 p.m.

New subject: Identifying Wikipedia stubs in various languages

I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.

I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/wiki/Category:Stub_categories and https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles.

A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go.

Cheers, Morten

On 20 September 2016 at 02:40, Stuart A. Yeates syeates@gmail.com wrote:

...

en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck

Probably doesn't translate to CJK languages which have radically different information content per character.

cheers stuart

-- ...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West west@cs.stanford.edu wrote:

...
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian https://it.wikipedia.org/wiki/Categoria:Stub or English https://en.wikipedia.org/wiki/Category:All_stub_articles), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks! Bob

-- Up for a little language game? -- http://www.unfun.me

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Stuart A. Yeates

5:34 p.m.

New subject: Identifying Wikipedia stubs in various languages

You _really_ need to exclude markup and include only body text when measuring stubs. It's not uncommon for mass-produced articles with a only one or two sentences of text to approach 1K characters, once you include maintenance templates, content templates, categories, infobox, references, etc, etc

cheers stuart

-- ...let us be heard from red core to black sky

On Wed, Sep 21, 2016 at 5:01 AM, Morten Wang nettrom@gmail.com wrote:

...

I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.

I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/ wiki/Category:Stub_categories and https://en.wikipedia.org/ wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles.

A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go.

Cheers, Morten

On 20 September 2016 at 02:40, Stuart A. Yeates syeates@gmail.com wrote:

...
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck

Probably doesn't translate to CJK languages which have radically different information content per character.

cheers stuart

-- ...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West west@cs.stanford.edu wrote:

...
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian https://it.wikipedia.org/wiki/Categoria:Stub or English https://en.wikipedia.org/wiki/Category:All_stub_articles), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks! Bob

-- Up for a little language game? -- http://www.unfun.me

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Andrew Gray

6:33 p.m.

New subject: Identifying Wikipedia stubs in various languages

Hi all,

I'd strongly caution against using the stub categories without *also* doing some kind of filtering on size. There's a real problem with "stub lag" - articles get tagged, incrementally improve, no-one thinks they've done enough to justify removing the tag (or notices the tag is there, or thinks they're allowed to remove it)... and you end up with a lot of multi-section pages a good hundred words of text still labelled "stub"....

(Talkpage ratings are even worse for this, but that's another issue.)

Andrew.

On 20 September 2016 at 18:01, Morten Wang nettrom@gmail.com wrote:

...

I don't know of a clean, language-independent way of grabbing all stubs. Stuart's suggestion is quite sensible, at least for English Wikipedia. When I last checked a few years ago, the mean length of an English language stub (on a log-scale) is around 1kB (including all markup), and they're quite much smaller than any other class.

I'd also see if the category system allows for some straightforward retrieval. English has https://en.wikipedia.org/wiki/Category:Stub_categories and https://en.wikipedia.org/wiki/Category:Stubs with quite a lot of links to other languages, which could be a good starting point. For some of the research we've done on quality, exploiting regularities in the category system using database access (in other words, LIKE-queries), is a quick way to grab most articles.

A combination of both approaches might be a good way. If you're looking for even more thorough classification, grabbing a set and training a classifier might be the way to go.

Cheers, Morten

On 20 September 2016 at 02:40, Stuart A. Yeates syeates@gmail.com wrote:

...
en:WP:DYK has a measure of 1,500+ characters of prose, which is a useful cutoff. There is weaponised javascript to measure that at en:WP:Did you know/DYKcheck

Probably doesn't translate to CJK languages which have radically different information content per character.

cheers stuart

-- ...let us be heard from red core to black sky

On Tue, Sep 20, 2016 at 9:26 PM, Robert West west@cs.stanford.edu wrote:

...
Hi everyone,

Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia?

Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster.

I've found lists for various languages (e.g., Italian or English), but the lists are in different formats, so separate code is required for each language, which doesn't scale.

I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated.

Thanks! Bob

-- Up for a little language game? -- http://www.unfun.me

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

-- - Andrew Gray andrew.gray@dunelm.org.uk

3022

Age (days ago)

3022

Last active (days ago)

wiki-research-l@lists.wikimedia.org

4 comments

4 participants

tags (0)

participants (4)

Andrew Gray
Morten Wang
Robert West
Stuart A. Yeates