[WikiEN-l] What proportion of articles are stubs?
Carl (CBM)
cbm.wikipedia at gmail.com
Mon Nov 29 21:37:45 UTC 2010
On Mon, Nov 29, 2010 at 4:22 PM, Carcharoth <carcharothwp at googlemail.com> wrote:
> Is it possible to have a breakdown of the high-end of that? i.e.
> Number of articles from 10,000 bytes upwards in steps of 5,000 bytes?
Sure, I'll put a table below. The number shown under "len" is the
bottom end of the length range.
> Also, have you
> looked at the byte size and word count of some actual articles, to see
> how accurate your "4.5-bytes-per-word" estimate is?
No, that was just a napkin calculation, based on a google search. Take
it with a grain of salt.
- Carl
+--------+----------------+
| len | count |
+--------+----------------+
| 10000 | 167362 |
| 15000 | 73821 |
| 20000 | 40156 |
| 25000 | 25163 |
| 30000 | 16405 |
| 35000 | 11474 |
| 40000 | 8383 |
| 45000 | 6169 |
| 50000 | 4754 |
| 55000 | 3672 |
| 60000 | 2895 |
| 65000 | 2223 |
| 70000 | 1759 |
| 75000 | 1508 |
| 80000 | 1235 |
| 85000 | 960 |
| 90000 | 809 |
| 95000 | 669 |
| 100000 | 531 |
| 105000 | 450 |
| 110000 | 345 |
| 115000 | 268 |
| 120000 | 270 |
| 125000 | 211 |
| 130000 | 210 |
| 135000 | 143 |
| 140000 | 141 |
+--------+----------------+
There are 765 articles longer than 140,000 bytes, which seem to almost
all be lists.
More information about the WikiEN-l
mailing list