Thanks both of you,
I suspect that you two are using very different rules to define "readable
characters", and for Aaron to get a close correlation and Fabian not to get
any correlation implies to me that Fabian is stripping out the things that
are not linked to article size, and that Aaron may be leaving such things
in.
For reasons that I'm going to pretend I don't understand, we have some
articles with a lot of redundant spaces. Others with so few you'd be
correct in thinking that certain editors have been making semiautomated
edits to strip out those spaces. I suspect that Fabian's formulae ignores
redundant spaces, and that Aaron's does not.
I picked on alt text because it is very patchy across the pedia, but
usually consistent at article level. I.e if someone has written a whole
paragraph of alt text for one picture they have probably done so for every
picture in an article, and conversely many articles will have no alt text
at all.
Similarly we have headings, and counterintuitively it is the subheadings
that add most non display characters. So an article like Peasant's revolt
will have 32 equals signs for its 8 headings, but 60 equal signs for its 10
subheadings. 92 bytes which I suspect one or both of you will have stripped
out. The actual display text of course omits all 92 of those bytes, but
repeats the content of those headings and subheadings in the contents
section.
The size of sections varies enormously from one article to another, and
if there are three or fewer sections the contents section is not generated
at all. I suspect that the average length of section headings also has
quite a bit of variance as it is a stylistic choice. So I would expect that
a "display bytes" count that simply stripped out the multiple equal signs
would still be a pretty good correlation with article size, but a display
bytes count that factored in the complication that headings and subheadings
are displayed twice as they are repeated in the contents field, would have
another factor drifting it away from a good correlation with raw byte count.
But probably the biggest variance will be over infoboxes, tables,
picture captions, hidden comments and the like. If you strip all of them
out, including perhaps even the headings, captions and table contents, then
you are going to get a very poor fit between article length and readable
byte size. But I would be surprised if you could get Fabian's minimum
display size of 95 bytes from 6,000 byte articles without having at least
one article that consisted almost entirely of tables and which had been
reduced to a sentence or two of narrative. So my suspicion is that Aaron's
plot is at least including the displayed contents of tables et al whilst
Fabian is only measuring the prose sections and completely stripping out
anything in a table.
Both approaches of course have their merits, and there are even some
editors who were recent edit warring to keep articles they cared about free
from clutter by infoboxes and tables.
Regards
Jonathan
On 5 August 2013 21:16, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu>wrote;wrote:
Hi,
thanks for your feedback Jonathan and Aaron.
@Jonathan: You are rightfully pointing at some things that could have
been done differently, as this was just an ad-hoc experiment. What I did
was getting the curl result of "
http://en.wikipedia.org/w/api.php?action=parse&prop=text&pageid=X&q… and
running it through BeautifulSoup [1] in Python.
Regarding references: yes, all the markup was stripped away which you
cannot see in form of readable characters as a human when you look at an
article. Take as an example [2]: in the final output (which was the base
for counting chars) what is left in characters of this reference is the
readable "[1]" and " ^ William Goldenberg at the Internet Movie
Database".
Regarding alt text: it was completely stripped out. This can arguably be
done different, if you see it as "readable main article text" as well.
You are sure right that including these would lead to a higher
correlation. Looking at samples from the output, the increase in
correlation will however not be very big, but that's a mere hunch. Anyway,
this was not what I was looking for. I wanted to compare really only the
readable text you see directly when scrolling through the article.
What is another issue is the inclusion of expandable template listings
as I mentioned in my first mail. Are the long listings of related articles
"main, readable article text"? I suppose not, but we did not filter them
out yet.
@Aaron, I'm pretty sure I didn't make a mistake, but before I can answer
your mail: What exactly does this content_length API call give you back
(I'm not aware of that). Takes the Wikisyntax and strips it of tags and
comments? Or the HTML shown in the front-end including all content
generated by templates minus all mark-up? Only in the ladder case would
this be comparable in any way to what I have done. Please clarify and send
me the concrete API call. I don't think your content_length is the length
of the readable front-end text as I used it.
(On a side note: I'm unsure why you paste the complete results of a
linear regression, as a Pearson correlation will perfectly suffice in such
a simple bivariate case. They - due to the nature of these statistical
methods - of course yield the same results in this case. Or was there any
important extra information that I missed in these regression results?).
Best,
Fabian
[1]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
[2]
http://en.wikipedia.org/wiki/William_Goldenberg#cite_note-1
On 05.08.2013, at 01:15, Aaron Halfaker <aaron.halfaker(a)gmail.com>
wrote:
(note that I posted this yesterday, but the
message bounced due to the
attached scatter plot. I just uploaded the plot to
commons and re-sent)
I just replicated this analysis. I think you might have made some
mistakes.
I took a random sample of non-redirect articles from English Wikipedia
and
compared the byte_length (from database) to the content_length (from
API, tags and comments stripped).c
I get a pearson correlation coef of 0,9514766.
See the scatter plot including a linear regression line. See also the
regress
output below.
Call:
lm(formula = byte_len ~ content_length, data = pages)
Residuals:
Min 1Q Median 3Q Max
-38263 -419 82 592 37605
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -97.40412 72.46523 -1.344 0.179
content_length 1.14991 0.00832 138.210 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2722 on 1998 degrees of freedom
Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053
F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16
On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers <
werespielchequers(a)gmail.com> wrote:
Hi Fabian,
That's interesting. When you say you stripped out the html did you
also strip
out the other parts of the references? Some citation styles will
take up more bytes than others, and citation style is supposed to be
consistent at the article level.
It would also make a difference whether you included or excluded alt
text from
readable material as I suspect it is non granular - ie if someone
is going to create alt text for one picture in an article they will do so
for all pictures.
More significantly there is a big difference in standards of
referencing , broadly
the higher the assessed quality and or the more
contentious the article the more references there will be.
I would expect that if you factored that in there would be some
correlation
between readable length and bytes within assessed classes of
quality, and the outliers would include some of the controversial articles
like Jerusalem (353 references)
Hope that helps.
Jonathan
On 2 August 2013 18:24, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu>
wrote:
Hi,
to whoever is interested in this (and I hope I didn't just repeat
someone
else's experiments on this):
I wanted to know if a "long" or "short" article in terms of how much
readable material (excluding pictures) is presented to the reader in the
front-end is correlated to the byte size of the Wikisyntax which can be
obtained from the DB or API; as people often define the "length" of an
article by its length in bytes.
TL;DR: Turns out size in bytes is a really, really bad indicator for
the actual,
readable content of a Wikipedia article, even worse than I
thought.
We "curl"ed the front-end HTML of all articles of the English
Wikipedia
(ns=0, no disambiguation, no redirects) between 5800 and 6000
bytes (as around 5900 bytes is the total en.wiki average for these
articles). = 41981 articles.
Results for size in characters (w/ whitespaces)
after cleaning the
HTML out:
Min= 95 Max= 49441 Mean=4794.41 Std.
Deviation=1712.748
Especially the gap between Min and Max was interesting. But templates
make it
possible.
(See e.g. "Veer Teja Vidhya Mandir
School", "Martin Callanan" --
Allthough for the ladder you could
argue that expandable template listings
are not really main "reading" content..)
Effectively, correlation for readable character size with byte size =
0.04 (i.e.
none) in the sample.
If someone already did this or a similar analysis, I'd appreciate
pointers.
Best,
Fabian
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck(a)kit.edu
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
<ATT00001.c>
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck(a)kit.edu
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org