Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

6 Aug 2013

Hi Aaron,

I'm not sure how Fabian limiting his byte length to 5,500-6,000 would make
a difference. But as you've confirmed that your formula includes both the
whitespace and the contents of tables, I suspect we just need Fabian to
confirm that he ignores both and we have an explanation for the difference
between your approaches. And since Fabian's method reduced one article by
over 98% to just 95 bytes, I would be very surprised if he is including the
text contents of tables. What was your shortest, did you get any with an
80-90% reduction? I'd be surprised if your smallest was under 580  bytes.

Jonathan

On 6 August 2013 00:39, Aaron Halfaker &lt;aaron.halfaker(a)gmail.com&gt; wrote:

...
  I am removing all HTML tags and comments to include
only those characters
 that are shown on the screen.  This will include the content of tables
 without including the markup contained within.  In other words, I stripped
 anything out of the HTML that looked like a tag (e.g. "<foo>" and
"</bar>")
 or a comment ("") but kept the in-between characters,
 whitespace and all.

 It seems much more reasonable to me that the difference is due to the fact
 that Fabian's dataset is limited to a very narrow range of bytes.  To check
 this hypothesis, I drew a new sample of pages with byte length between 5800
 and 6000.

 The pearson correlation that I found for that sample is* 0.06466406.  *This
 corresponds nicely to the poor correlation that Fabian found.
 *
 *
 I've update the plot[1] to show the difference visually.

 -Aaron

 1.

http://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter.correla…

 On Tue, Aug 6, 2013 at 6:04 AM, WereSpielChequers <
 werespielchequers(a)gmail.com&gt; wrote:

  Thanks both of you,

 I suspect that you two are using very different rules to define "readable
 characters", and for Aaron to get a close correlation and Fabian not to get
 any correlation implies to me that Fabian is stripping out the things that
 are not linked to article size, and that Aaron may be leaving such things
 in.

 For reasons that I'm going to pretend I don't understand, we have some
 articles with a lot of redundant spaces. Others with so few you'd be
 correct in thinking that certain editors have been making semiautomated
 edits to strip out those spaces. I suspect that Fabian's formulae ignores
 redundant spaces, and that Aaron's does not.

 I picked on alt text because it is very patchy across the pedia, but
 usually consistent at article level. I.e if someone has written a whole
 paragraph of alt text for one picture they have probably done so for every
 picture in an article, and conversely many articles will have no alt text
 at all.

 Similarly we have headings, and counterintuitively it is the subheadings
 that add most non display characters. So an article like Peasant's revolt
 will have 32 equals signs for its 8 headings, but 60 equal signs for its 10
 subheadings. 92 bytes which I suspect one or both of you will have stripped
 out. The actual display text of course omits all 92 of those bytes, but
 repeats the content of those headings and subheadings in the contents
 section.

 The size of sections varies enormously  from one article to another, and
 if there are three or fewer sections the contents section is not generated
 at all. I suspect that the average length of section headings also has
 quite a bit of variance as it is a stylistic choice. So I would expect that
 a "display bytes" count that simply stripped out the multiple equal signs
 would still be a pretty good correlation with article size, but a display
 bytes count that factored in the complication that headings and subheadings
 are displayed twice as they are repeated in the contents field, would have
 another factor drifting it away from a good correlation with raw byte count.

 But probably the biggest variance will be over infoboxes, tables,
 picture  captions, hidden comments and the like. If you strip all of them
 out, including perhaps even the headings, captions and table contents, then
 you are going to get a very poor fit between article length and readable
 byte size. But I would be surprised if you could get Fabian's minimum
 display size of 95 bytes from 6,000 byte articles without having at least
 one article that consisted almost entirely of tables and which had been
 reduced to a sentence or two of narrative. So my suspicion is that Aaron's
 plot is at least including the displayed contents of tables et al whilst
 Fabian is only measuring the prose sections and completely stripping out
 anything in a table.

 Both approaches of course have their merits, and there are even some
 editors who were recent edit warring to keep articles they cared about free
 from clutter by infoboxes and tables.

 Regards

 Jonathan

 On 5 August 2013 21:16, Floeck, Fabian (AIFB) &lt;fabian.floeck(a)kit.edu&gt;wrote;wrote:

  Hi,

 thanks for your feedback Jonathan and Aaron.

 @Jonathan: You are rightfully pointing at some things that could have
 been done differently, as this was just an ad-hoc experiment.  What I did
 was getting the curl result of "
 http://en.wikipedia.org/w/api.php?action=parse&prop=text&pageid=X&q…  and
 running it through BeautifulSoup [1] in Python.
 Regarding references: yes, all the markup was stripped away which you
 cannot see in form of readable characters as a human when you look at an
 article. Take as an example [2]: in the final output (which was the base
 for counting chars) what is left in characters of this reference is the
 readable "[1]" and " ^ William Goldenberg at the Internet Movie
Database".
 Regarding alt text: it was completely stripped out. This can arguably be
 done different, if you see it as "readable main article text" as well.
 You are sure right that including these would lead to a higher
 correlation. Looking at samples from the output, the increase in
 correlation will however not be very big, but that's a mere hunch. Anyway,
 this was not what I was looking for. I wanted to compare really only the
 readable text you see directly when scrolling through the article.
 What is another issue is the inclusion of expandable template listings
 as I mentioned in my first mail. Are the long listings of related articles
 "main, readable article text"?  I suppose not, but we did not filter them
 out yet.

 @Aaron, I'm pretty sure I didn't make a mistake, but before I can answer
 your mail: What exactly does this content_length API call give you back
 (I'm not aware of that). Takes the Wikisyntax and strips it of tags and
 comments? Or the HTML shown in the front-end including all content
 generated by templates minus all mark-up? Only in the ladder case would
 this be comparable in any way to what I have done. Please clarify and send
 me the concrete API call. I don't think your content_length is the length
 of the readable front-end text as I used it.
 (On a side note: I'm unsure why you paste the complete results of a
 linear regression, as a Pearson correlation will perfectly suffice in such
 a simple bivariate case. They - due to the nature of these statistical
 methods - of course yield the same results in this case. Or was there any
 important extra information that I missed in these regression results?).

 Best,

 Fabian

 [1] http://www.crummy.com/software/BeautifulSoup/bs4/doc/
 [2] http://en.wikipedia.org/wiki/William_Goldenberg#cite_note-1

 On 05.08.2013, at 01:15, Aaron Halfaker &lt;aaron.halfaker(a)gmail.com&gt;
 wrote:

  (note that I posted this yesterday, but the
message bounced due to the  attached scatter plot.  I just uploaded the plot to
commons and re-sent)

 I just replicated this analysis.  I think you might have made some  mistakes.

 I took a random sample of non-redirect articles from English Wikipedia  and
compared the byte_length (from database) to the content_length (from
 API, tags and comments stripped).c

 I get a pearson correlation coef of 0,9514766.

 See the scatter plot including a linear regression line.  See also the  regress
output below.

 Call:
 lm(formula = byte_len ~ content_length, data = pages)

 Residuals:
    Min     1Q Median     3Q    Max
 -38263   -419     82    592  37605

 Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
 (Intercept)    -97.40412   72.46523  -1.344    0.179
 content_length   1.14991    0.00832 138.210   <2e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 2722 on 1998 degrees of freedom
 Multiple R-squared: 0.9053,   Adjusted R-squared: 0.9053
 F-statistic: 1.91e+04 on 1 and 1998 DF,  p-value: < 2.2e-16

 On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers < 
werespielchequers(a)gmail.com&gt; wrote:
  Hi Fabian,

 That's interesting. When you say you stripped out the html did you  also strip
out the other parts of the references? Some citation styles will
 take up more bytes than others, and citation style is supposed to be
 consistent at the article level.

 It would also make a difference whether you included or excluded alt  text from
readable material as I suspect it is non granular - ie if someone
 is going to create alt text for one picture in an article they will do so
 for all pictures.

 More significantly there is a big difference in standards of  referencing , broadly
the higher the assessed quality and or the more
 contentious the article the more references there will be.

 I would expect that if you factored that in there would be some  correlation
between readable length and bytes within assessed classes of
 quality, and the outliers would include some of the controversial articles
 like Jerusalem (353 references)

 Hope that helps.

 Jonathan

 On 2 August 2013 18:24, Floeck, Fabian (AIFB) &lt;fabian.floeck(a)kit.edu&gt; 
wrote:
  Hi,
 to whoever is interested in this (and I hope I didn't just repeat  someone
else's experiments on this):

 I wanted to know if a "long" or "short" article in terms of how much
 readable material (excluding pictures) is presented to the reader in the
 front-end is correlated to the byte size of the Wikisyntax which can be
 obtained from the DB or API; as people often define the "length" of an
 article by its length in bytes.

 TL;DR: Turns out size in bytes is a really, really bad indicator for  the actual,
readable content of a Wikipedia article, even worse than I
 thought.

 We "curl"ed the front-end HTML of all articles of the English  Wikipedia
(ns=0, no disambiguation, no redirects) between 5800 and 6000
 bytes (as around 5900 bytes is the total en.wiki average for these
 articles). = 41981 articles.
  Results for size in characters (w/ whitespaces)
after cleaning the  HTML out:
  Min= 95 Max= 49441 Mean=4794.41 Std.
Deviation=1712.748

 Especially the gap between Min and Max was interesting. But templates  make it
possible.
  (See e.g. "Veer Teja Vidhya Mandir
School", "Martin Callanan" --  Allthough for the ladder you could
argue that expandable template listings
 are not really main "reading" content..)

 Effectively, correlation for readable character size with byte size =  0.04 (i.e.
none) in the sample.

 If someone already did this or a similar analysis, I'd appreciate  pointers.

 Best,

 Fabian

 --
 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Fax: +49 721 608 4 6580
 Skype: f.floeck_work
 E-Mail: fabian.floeck(a)kit.edu
 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT – University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 <ATT00001.c> 

 --
 Karlsruhe Institute of Technology (KIT)
 Institute of Applied Informatics and Formal Description Methods

 Dipl.-Medwiss. Fabian Flöck
 Research Associate

 Building 11.40, Room 222
 KIT-Campus South
 D-76128 Karlsruhe

 Phone: +49 721 608 4 6584
 Fax: +49 721 608 4 6580
 Skype: f.floeck_work
 E-Mail: fabian.floeck(a)kit.edu
 WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

 KIT – University of the State of Baden-Wuerttemberg and
 National Research Center of the Helmholtz Association

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles