Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

10 Aug 2013

Hi Fabian,

I can honestly say I had never seen an article like "Timeline of
architectural styles 1000–present"
But even with that one and removing everything I could interpret as hidden
or code generated I wound up with a lot more than 95 bytes:

6000BC–1000AD • 1000–1750 • 1750–1900 1900–Present
Architectural style Architecture timeline
Julian calendar Gregorian calendar Neoclassical Georgian
Sicilian Baroque
English Baroque
Rococo
Palladianism
Jacobean
Baroque
Elizabethan
Mannerism
Spanish Colonial
Manueline
Tudor
High Renaissance
Renaissance
Perpendicular Period
Brick Gothic
Decorated Period
Early English Period
Gothic
Norman
Romanesque
Byzantine
Roman
Ancient Greek
Ancient Egyptian
Sumerian
Neolithic

So my suspicion is that part of the reason that you and Aaron are getting
different results is because your methods of extracting display bytes are
different. To get just 95 bytes from this article I think that the program
you used you would have had to strip out at least some of the linked words.

Regards

Jonathan

On 6 August 2013 14:55, Floeck, Fabian (AIFB) &lt;fabian.floeck(a)kit.edu&gt; wrote:

...
  @Jonathan: Good point, but I'm actually not
stripping the content of
 tables, just the mark-up of the tables. (Also I leave the whitespaces in
 and count them, just remove line breaks, as the cleaning leaves a lot of
 empty lines) I checked the results manually in over 50 cases and what my
 script outputs is almost exactly what you get when you take the article
 text of a page and copy and paste it into a text editor or Word from the
 browser by hand, including tables and infoboxes.
 So it finds exactly what I wanted, the readable, displayed text portion of
 an article. Remember that I said I also remove "Disambiguation" articles
 (indicated by "disambiguation" in the article name or category name. Reason
 being that I wanted articles with running text). They probably have a
 higher correlation as they don't use templates very much I think. As for
 the remaining difference in the corr coefficients, it could also be caused
 by the manner of cleaning.
 The shortest I got was "Timeline of architectural styles 1000–present" with
 95 chars, but this example reveals that sometimes, you would have to
 include characters inside pictures (can you tell me by chance how frequent
 these types of code-generated pictures are?).
 But there are also these examples like the "Veer Teja Vidhya Mandir
 School"  I mentioned (chars= 404) were the template is simply highly
 underused and bloats the syntax.

 @Federico: You are completely right, size in bytes is a good indicator for
 many things; you could for example argue it accurately measures the work
 put into an article by the editors, as constructing the Wikisyntax can be a
 big part of a good article.

 @Aaron: "You've severely limited the range of your regressor and therefor
 invalidated a set of assumptions for the correlation."
 You seem to be very confused about some statistical concepts:
 1. You mix up the concept of inference statistics with a descriptive
 statistical analysis when you tell me that reporting the result of an
 experiment on a sample (and as nothing more was this declared) is a
 "mistake". All I said was that in this sample, with my (ad-hoc!) method,
 this is the result. No inference about the rest of the articles beyond that
 sample. Turns out that I was correct, no mistake whatsoever. For me it was
 interesting enough to post to the list that there is no correlation between
 the two variables in *this* sample. Which is still a very interesting
 result as obviously, for at least in this byte size range (maybe others?),
 there is no or just a tiny correlation to the display char size.  I'm happy
 that you took the time to investigate articles outside this sample, that's
 the kind of input for which I turned to the research list.
 2. I didn't  "invalidate" anything, I ran a completely appropriate Pearson
 correlation over a sample I chose, however unrepresentative that sample may
 be (again: inference vs descriptive statistics). FYI: A correlation doesn't
 have a *regressor*, as you don't have to decide what is the independent
 and the dependent variable. That's regression; which adds no substantial
 information here imho (you can draw a fitted R^2 line on a scatterplot just
 fine without doing a regression).
 Moreover, you repeatedly ignored the fact in your replication that I also
 filtered out "Disambiguation" articles. Of course, then you wont get the
 exact same results as me.

 As soon as I find the time, I will run my stuff also over a sample outside
 the limited 5800-6000 byte range to see what comes out.

 Best,

 Fabian
 On 06.08.2013, at 10:53, Federico Leva (Nemo) &lt;nemowiki(a)gmail.com&gt; wrote:

 Ziko van Dijk, 06/08/2013 02:12:

 Hello,
 When in 2008 I made some observations on language versions, it struck me
 that in some cases the wikisyntax and the "meta article information" was
 more KB than the whole encyclopedic content of an article. For example,
 the wikicode of the article "Berlin" in Upper Sorabian consisted of more
 than 50 % characters for categories, interwiki links etc. This made me
 largely disregarding the cooncerning features of the Wikimedia statistics.

 You'd better not disregard it completely, as it is used as a key metric
 for evaluating e.g. the WMF university programs (whether a good or a bad
 thing). ;-) I don't know how sophisticated a variant of the metric they
 use; probably whatever the new metrics.wmflabs.org uses.

 Personally, I often find the database size on WikiStats tables a useful
 one to check the evolution of a single wiki, as it's less fluctuating
 and harder to cheat than other metrics, short of huge bot imports. It
 requires greater care in cross-wiki comparisons, of course.

 Nemo

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 --
 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Fax: +49 721 608 4 6580
 Skype: f.floeck_work
 E-Mail: fabian.floeck(a)kit.edu
 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT – University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles