Update:
Not surprising and congruent with Aarons results, I also get a high linear correlation of
0.96 (random sample of 5000 articles) outside the 5800-6000 sample even if I filter out
Disamb articles. See scatterplot [1].
So first of all, it can be fairly certainly concluded that our methods of cleaning are
quite similar and are not the cause of mayor differences in the measurements.
Secondly, this seems to be a very interesting distribution were the overall correlation is
very high but in certain sections (I'm using a speculative plural here) of the
distribution is very low.
That means we can make the statement "In general, byte size and display char length
of an article are highly correlated". This is however not automatically valid once
you limit the byte size of the articles you look at in any way (for example, by just
looking at Featured Articles, Stubs, etc. (I have yet to check these myself)). Then, you
will have to check again if the statement holds true for the given subsample.
These results show very nicely that random sampling in a population is not an infallible
"universal weapon" for inferring information about all subsamples of a
population and should, where possible, always be cross-checked with non-random selective
sample analysis.
Next steps I'd like to take is to look at articles above or below a certain size to
see if the correlation differs (maybe because a different proportion of their content is
made up of templates).
I'll give an update asap.
Best,
Fabian
[1]
https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_NOdisamb1.png
On 06.08.2013, at 15:55, Fabian Flöck
<fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>> wrote:
@Jonathan: Good point, but I'm actually not stripping the content of tables, just the
mark-up of the tables. (Also I leave the whitespaces in and count them, just remove line
breaks, as the cleaning leaves a lot of empty lines) I checked the results manually in
over 50 cases and what my script outputs is almost exactly what you get when you take the
article text of a page and copy and paste it into a text editor or Word from the browser
by hand, including tables and infoboxes.
So it finds exactly what I wanted, the readable, displayed text portion of an article.
Remember that I said I also remove "Disambiguation" articles (indicated by
"disambiguation" in the article name or category name. Reason being that I
wanted articles with running text). They probably have a higher correlation as they
don't use templates very much I think. As for the remaining difference in the corr
coefficients, it could also be caused by the manner of cleaning.
The shortest I got was "Timeline of architectural styles 1000–present" with 95
chars, but this example reveals that sometimes, you would have to include characters
inside pictures (can you tell me by chance how frequent these types of code-generated
pictures are?).
But there are also these examples like the "Veer Teja Vidhya Mandir School" I
mentioned (chars= 404) were the template is simply highly underused and bloats the
syntax.
@Federico: You are completely right, size in bytes is a good indicator for many things;
you could for example argue it accurately measures the work put into an article by the
editors, as constructing the Wikisyntax can be a big part of a good article.
@Aaron: "You've severely limited the range of your regressor and therefor
invalidated a set of assumptions for the correlation."
You seem to be very confused about some statistical concepts:
1. You mix up the concept of inference statistics with a descriptive statistical analysis
when you tell me that reporting the result of an experiment on a sample (and as nothing
more was this declared) is a "mistake". All I said was that in this sample, with
my (ad-hoc!) method, this is the result. No inference about the rest of the articles
beyond that sample. Turns out that I was correct, no mistake whatsoever. For me it was
interesting enough to post to the list that there is no correlation between the two
variables in *this* sample. Which is still a very interesting result as obviously, for at
least in this byte size range (maybe others?), there is no or just a tiny correlation to
the display char size. I'm happy that you took the time to investigate articles
outside this sample, that's the kind of input for which I turned to the research
list.
2. I didn't "invalidate" anything, I ran a completely appropriate Pearson
correlation over a sample I chose, however unrepresentative that sample may be (again:
inference vs descriptive statistics). FYI: A correlation doesn't have a regressor, as
you don't have to decide what is the independent and the dependent variable.
That's regression; which adds no substantial information here imho (you can draw a
fitted R^2 line on a scatterplot just fine without doing a regression).
Moreover, you repeatedly ignored the fact in your replication that I also filtered out
"Disambiguation" articles. Of course, then you wont get the exact same results
as me.
As soon as I find the time, I will run my stuff also over a sample outside the limited
5800-6000 byte range to see what comes out.
Best,
Fabian
On 06.08.2013, at 10:53, Federico Leva (Nemo)
<nemowiki@gmail.com<mailto:nemowiki@gmail.com>> wrote:
Ziko van Dijk, 06/08/2013 02:12:
Hello,
When in 2008 I made some observations on language versions, it struck me
that in some cases the wikisyntax and the "meta article information" was
more KB than the whole encyclopedic content of an article. For example,
the wikicode of the article "Berlin" in Upper Sorabian consisted of more
than 50 % characters for categories, interwiki links etc. This made me
largely disregarding the cooncerning features of the Wikimedia statistics.
You'd better not disregard it completely, as it is used as a key metric
for evaluating e.g. the WMF university programs (whether a good or a bad
thing). ;-) I don't know how sophisticated a variant of the metric they
use; probably whatever the new
metrics.wmflabs.org<http://metrics.wmflabs.org/>
uses.
Personally, I often find the database size on WikiStats tables a useful
one to check the evolution of a single wiki, as it's less fluctuating
and harder to cheat than other metrics, short of huge bot imports. It
requires greater care in cross-wiki comparisons, of course.
Nemo
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck<http://www.aifb.kit.edu/web/Fab…
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association
--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods
Dipl.-Medwiss. Fabian Flöck
Research Associate
Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe
Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>
WWW:
http://www.aifb.kit.edu/web/Fabian_Flöck
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association