Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

6 Aug 2013

Hello,

Thanks for pointing this out - Wikidata and other evolutions might change
something, yes. But still... For example, sometimes WP in  Upper Sorabian
(hsb.WP) copies whole lists of literature from German Wikipedia, or the
secction weblinks etc. That's all "encyclopedic content" then in hsb.WP,
but it does not reflect work in the hsb language, and it adds no extra
value for the hsb readers. Those books and weblinks, of course, are in
German.
Nothing to do about - an impression about the size and quality of a
language must consider quantitative and qualitative aspects (from autopsy,
looking-with-your-own eyes). But on that we all agree anyway. :-)
Ziko

Am Dienstag, 6. August 2013 schrieb WereSpielChequers :

...
  Hi Ziko,

 You'll find that articles like that changed radically at the beginning of
 this year. At that point we moved from a system where all 200  or more
 articles on Berlin contained  200 or more intrawiki links to the other 200
 articles  on Berlin, to one where the Intrawiki links are all on Wikidata.
 That had a very dramatic effect on very stubby articles the Aceh article
 on Berlin
dropped<http://ak.wikipedia.org/w/index.php?title=Berlin&diff=15464&…
 3716 bytes to just 110, and many minor and poorly served languages
 would be likely to have very short articles on Berlin, dozens still don't
 have one at all.

 I doubt if this accounts for the differences that Fabian and Aaron are
 experiencing as I've been assuming that they are both looking at current
 data and I think Fabian mentioned EN.

 The change in the way we hold interwiki links also had a radical effect on
 bot editing numbers as it used to be that each time another language
 version of the Berlin article was created over 200  other languages version
 would have a bot edit adding that intrawiki link. I'm assuming that someone
 sometime is going to pick up on this and report it as a radical slump  in
 editing of Wikipedia's minor languages. But in reality it is just as much a
 cosmetic and misleading side effect of a change in the way we automate
 things as measuring the raw edit counts on EN wikipedia since the edit
 filters were introduced in 2009 and assuming that because we now stop most
 vandalism from reaching the wiki we have a fall in edit numbers.

 Jonathan

 On 6 August 2013 01:12, Ziko van Dijk <zvandijk@gmail.com<javascript:_e({},
'cvml', &#39;zvandijk(a)gmail.com&#39;);&gt;
  wrote: 
  Hello,
 When in 2008 I made some observations on language versions, it struck me
 that in some cases the wikisyntax and the "meta article information" was
 more KB than the whole encyclopedic content of an article. For example,
 the wikicode of the article "Berlin" in Upper Sorabian consisted of more
 than 50 % characters for categories, interwiki links etc. This made me
 largely disregarding the cooncerning features of the Wikimedia statistics.
 Kind regards
 Ziko

 Am Dienstag, 6. August 2013 schrieb Aaron Halfaker :

I am removing all HTML tags and comments to include only those
 characters that are shown on the screen.  This will include the content of
 tables without including the markup contained within.  In other words, I
 stripped anything out of the HTML that looked like a tag (e.g. "<foo>"
and
 "</bar>") or a comment ("") but kept the
in-between
 characters, whitespace and all.

 It seems much more reasonable to me that the difference is due to the
 fact that Fabian's dataset is limited to a very narrow range of bytes.  To
 check this hypothesis, I drew a new sample of pages with byte length
 between 5800 and 6000.

 The pearson correlation that I found for that sample is* 0.06466406.  *This
 corresponds nicely to the poor correlation that Fabian found.
 *
 *
 I've update the plot[1] to show the difference visually.

 -Aaron

 1.

http://commons.wikimedia.org/wiki/File:Bytes.content_length.scatter.correla…

 On Tue, Aug 6, 2013 at 6:04 AM, WereSpielChequers <
 werespielchequers(a)gmail.com&gt; wrote:

 Thanks both of you,

 I suspect that you two are using very different rules to define "readable
 characters", and for Aaron to get a close correlation and Fabian not to get
 any correlation implies to me that Fabian is stripping out the things that
 are not linked to article size, and that Aaron may be leaving such things
 in.

 For reasons that I'm going to pretend I don't understand, we have some
 articles with a lot of redundant spaces. Others with so few you'd be
 correct in thinking that certain editors have been making semiautomated
 edits to strip out those spaces. I suspect that Fabian's formulae ignores
 redundant spaces, and that Aaron's does not.

 I picked on alt text because it is very patchy across the pedia, but
 usually consistent at article level. I.e if someone has written a whole
 paragraph of alt text for one picture they have probably done so for every
 picture in an article, and conversely many articles will have no alt text
 at all.

 Similarly we have headings, and counterintuitively it is the subheadings
 that add most non display characters. So an article like Peasant's revolt
 will have 32 equals signs for its 8 headings, but 60 equal signs for its 10
 subheadings. 92 bytes which I suspect one or both of you will have stripped
 out. The actual display text of course omits all 92 of those bytes, but
 repeats the content of those headings and subheadings in the contents
 section.

 The size of sections varies enormously  from one article to another, and
 if there are three or fewer sections the contents section is not generated
 at all. I suspect that the average length of section headings also has
 quite a bit of variance as it is a stylistic choice. So I would expect that
 a "display bytes" count that simply stripped out the multiple equal signs
 would still be a pretty good correlation with article size, but a display
 bytes count that factored in the complication that headings and subheadings
 are displayed twice as they are repeated in the contents field, would have
 another factor drifting it away from a good correlation with raw byte count.

 But probably the biggest variance will be over infoboxes, tables,
 picture  captions, hidden comments and the like. If you strip all of them
 out, including perhaps even the headings, captions and table contents, then
 you are going to get a very poor fit between article length and readable
 byte size. But I would be surprised if you could get Fabian's minimum
 display size of 95 bytes from 6,000 byte articles without having at least
 one article that consisted almost entirely of tables and which had been
 reduced to a sentence or two of narrative. So my suspicion is that Aaron's
 plot is at least including the displayed contents of tables et al whilst
 Fabian is only measuring the prose se

 _______________________________________________
 Wiki-research-l mailing list
 Wiki-research-l(a)lists.wikimedia.org <javascript:_e({}, 'cvml',
 &#39;Wiki-research-l(a)lists.wikimedia.org&#39;);&gt;
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] Readable characters vs. size in bytes of articles