Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

15 Mar 2014

Aaron,

this seems kind of redundant as I already agreed that there is an overall high correlation
and you posted this (almost) identical analysis 7 months ago. I don't know if you
missed my later emails on the topic, but I already wrote that this "mistake" as
you repeatedly put it, was a result of the selective sampling between 5000 and 6000 bytes.
Hence, as I already said, my initial observations cannot be transferred to the general
population of articles.

Not surprising and congruent with Aarons results, I also get a high linear correlation of
0.96 (random sample of 5000 articles) outside the 5800-6000 sample even if I filter out
Disamb articles.

But, as I as well explained, there seem to be some indicators that in smaller size
articles, this correlation is not as strong.

I split up the random 5000 article sample I posted last time at the median (3709 bytes)
into two parts, each 2500 articles big.
For the  "higher byte size" part (>3709 bytes) the correlation is 0.964
For the  "lesser byte size" part (<3710 bytes ) the correlation is only
0.295

You will of course not see that in your example if you just take all data (of all article
sizes) and draw a straight regression line through them. The "blob" on the
bottom left might need some further investigation. Maybe you could look at only articles
under 5000, 3000, 1000 bytes and see if the correlation changes somehow. My guess is it
will be less strong.

BTW: did you try to fit nonlinear models?
I did not, and one reason for the bad fit in the lesser size articles could also be that
there's a high correlation but not a linear one.

Best,

Fabian

On 04.08.2013, at 11:43, Aaron Halfaker
<aaron.halfaker@gmail.com<mailto:aaron.halfaker@gmail.com>> wrote:

I just replicated this analysis.  I think you might have made some mistakes.

I took a random sample of non-redirect articles from English Wikipedia and compared the
byte_length (from database) to the content_length (from API, tags and comments stripped).

I get a pearson correlation coef of 0.9514766.

See the attached scatter plot including a linear regression line.  See also the regress
output below.

Call:
lm(formula = page_len ~ content_length, data = pages)

Residuals:
   Min     1Q Median     3Q    Max
-38263   -419     82    592  37605

Coefficients:
                Estimate Std. Error t value Pr(>|t|)
(Intercept)    -97.40412   72.46523  -1.344    0.179
content_length   1.14991    0.00832 138.210   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2722 on 1998 degrees of freedom
Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
F-statistic: 1.91e+04 on 1 and 1998 DF,  p-value: < 2.2e-16

On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB)
<fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>> wrote:
Hi,
to whoever is interested in this (and I hope I didn't just repeat someone else's
experiments on this):

I wanted to know if a "long" or "short" article in terms of how much
readable material (excluding pictures) is presented to the reader in the front-end is
correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as
people often define the "length" of an article by its length in bytes.

TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable
content of a Wikipedia article, even worse than I thought.

We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0,
no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the
total en.wiki average for these articles). = 41981 articles.
Results for size in characters (w/ whitespaces) after cleaning the HTML out:
Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748

Especially the gap between Min and Max was interesting. But templates make it possible.
(See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --
Allthough for the ladder you could argue that expandable template listings are not really
main "reading" content..)

Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in
the sample.

If someone already did this or a similar analysis, I'd appreciate pointers.

Best,

Fabian

--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods

Dipl.-Medwiss. Fabian Flöck
Research Associate

Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe

Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck<http://www.aifb.kit.edu/web/Fab…

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

<bytes.content_length.scatter.png><ATT00001.c>

--
Dipl.-Medwiss. Fabian Flöck
Research Associate

Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods

Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe

Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: floeck@kit.edu<mailto:floeck@kit.edu>
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles