[Wiki-research-l] Readable characters vs. size in bytes of articles

2 Aug 2013

Hi, 
to whoever is interested in this (and I hope I didn't just repeat someone else's
experiments on this):

I wanted to know if a "long" or "short" article in terms of how much
readable material (excluding pictures) is presented to the reader in the front-end is
correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as
people often define the "length" of an article by its length in bytes. 

TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable
content of a Wikipedia article, even worse than I thought. 

We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0,
no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the
total en.wiki average for these articles). = 41981 articles. 
Results for size in characters (w/ whitespaces) after cleaning the HTML out: 
Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748

Especially the gap between Min and Max was interesting. But templates make it possible.
(See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --
Allthough for the ladder you could argue that expandable template listings are not really
main "reading" content..) 

Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in
the sample. 

If someone already did this or a similar analysis, I'd appreciate pointers.

Best, 

Fabian

-- 
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods

Dipl.-Medwiss. Fabian Flöck
Research Associate

Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe

Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck(a)kit.edu
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[Wiki-research-l] Readable characters vs. size in bytes of articles