Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

15 Mar 2014

Hi,
Aaron, I tend to agree with your conclusion, and personally have little
interest in the relationship between actual size and readable size.
But from technical point of view, I guess you should plot your scatter plot
in log-log scale and also calculate the correlation between the logarithm
of the variables.
The sizes are not normally distributed but log-normally [1], and linear
statistics on heavy-tailed distributions are usually spurious.

[1]
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/jour…

Take care,

Taha

On 15 Mar 2014 18:21, "Aaron Halfaker" &lt;aaron.halfaker(a)gmail.com&gt; wrote:

...
  Hi Fabian,

 I think that the primary reason that articles with smaller byte counts
 show less consistency is due to templates.  A lot of stubs and starts are
 created with a collection of templates that consume few bytes of wikitext,
 but balloon into lots of HTML/content.  Regardless, there doesn't seem to
 be much cause for concern, so I saw the issue as resolved.

 FWIW, I originally showed up in this conversation because I was skeptical
 of your initial conclusion: "size in bytes is a really, really bad
 indicator for the actual, readable content of a Wikipedia article".  Now
 that we've worked out the strong correlation between wikitext length and
 readable content length for nearly all articles, I have little interest in
 looking into the data further.

 -Aaron

 On Sat, Mar 15, 2014 at 12:47 PM, Floeck, Fabian (AIFB) <
 fabian.floeck(a)kit.edu&gt; wrote:

  Aaron,

 this seems kind of redundant as I already agreed that there is an overall
 high correlation and you posted this (almost) identical analysis 7 months
 ago. I don't know if you missed my later emails on the topic, but I already
 wrote that this "mistake" as you repeatedly put it, was a result of the
 selective sampling between 5000 and 6000 bytes. Hence, as I already said,
 my initial observations cannot be transferred to the general population of
 articles.

 Not surprising and congruent with Aarons results, I also get a high
 linear correlation of 0.96 (random sample of 5000 articles) outside the
 5800-6000 sample even if I filter out Disamb articles.

 But, as I as well explained, there seem to be some indicators that in
 smaller size articles, this correlation is not as strong.

 I split up the random 5000 article sample I posted last time at the
 median (3709 bytes) into two parts, each 2500 articles big.
 For the  "higher byte size" part (>3709 bytes) the correlation is 0.964
 For the  "lesser byte size" part (<3710 bytes ) the correlation is only
 0.295

 You will of course not see that in your example if you just take all data
 (of all article sizes) and draw a straight regression line through them.
 The "blob" on the bottom left might need some further investigation. Maybe
 you could look at only articles under 5000, 3000, 1000 bytes and see if the
 correlation changes somehow. My guess is it will be less strong.

 BTW: did you try to fit nonlinear models?
 I did not, and one reason for the bad fit in the lesser size articles
 could also be that there's a high correlation but not a linear one.

 Best,

 Fabian

 On 04.08.2013, at 11:43, Aaron Halfaker &lt;aaron.halfaker(a)gmail.com&gt; wrote:

 I just replicated this analysis.  I think you might have made some
 mistakes.

 I took a random sample of non-redirect articles from English Wikipedia
 and compared the byte_length (from database) to the content_length (from
 API, tags and comments stripped).

 I get a pearson correlation coef of *0.9514766*.

 See the attached scatter plot including a linear regression line.  See
 also the regress output below.

 Call:
 lm(formula = page_len ~ content_length, data = pages)

 Residuals:
    Min     1Q Median     3Q    Max
 -38263   -419     82    592  37605

 Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
 (Intercept)    -97.40412   72.46523  -1.344    0.179
 content_length   1.14991    0.00832 138.210   <2e-16 ***
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1
' ' 1

 Residual standard error: 2722 on 1998 degrees of freedom
 Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
 F-statistic: 1.91e+04 on 1 and 1998 DF,  p-value: < 2.2e-16

 On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB) <
 fabian.floeck(a)kit.edu&gt; wrote:

  Hi,
 to whoever is interested in this (and I hope I didn't just repeat
 someone else's experiments on this):

 I wanted to know if a "long" or "short" article in terms of how much
 readable material (excluding pictures) is presented to the reader in the
 front-end is correlated to the byte size of the Wikisyntax which can be
 obtained from the DB or API; as people often define the "length" of an
 article by its length in bytes.

 TL;DR: Turns out size in bytes is a really, really bad indicator for the
 actual, readable content of a Wikipedia article, even worse than I thought.

 We "curl"ed the front-end HTML of all articles of the English Wikipedia
 (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as
 around 5900 bytes is the total en.wiki average for these articles). = 41981
 articles.
 Results for size in characters (w/ whitespaces) after cleaning the HTML
 out:
 Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748

 Especially the gap between Min and Max was interesting. But templates
 make it possible.
 (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --
 Allthough for the ladder you could argue that expandable template listings
 are not really main "reading" content..)

 Effectively, correlation for readable character size with byte size =
 0.04 (i.e. none) in the sample.

 If someone already did this or a similar analysis, I'd appreciate
 pointers.

 Best,

 Fabian

 --
 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Fax: +49 721 608 4 6580
 Skype: f.floeck_work
 E-Mail: fabian.floeck(a)kit.edu
 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT - University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 <bytes.content_length.scatter.png><ATT00001.c>

 --

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Fax: +49 721 608 4 6580
 Skype: f.floeck_work
 E-Mail: floeck(a)kit.edu

 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT - University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles