Re: [Wiki-research-l] Readable characters vs. size in bytes of articles - Wiki-research-l

7 Aug 2013


      Ok, I don't want to spam, but...
I just finished downloading another random sample of 5000 articles with sizes of 5000 bytes and below (this time not excluding disambiguation articles, as the respective DB query is quite slow...) . For these, the correlation is 0.514. Plot [1]
This seems to be in favor my observation from earlier that the smaller articles in byte size *might* show a weaker correlation.
Very interesting topic, too bad I have to get back to project work now.. :)
Best,
Fabian
[1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_UNDER5001byte...
________________________________________
From: Giovanni Luca Ciampaglia [glciampagl@gmail.com]
Sent: Wednesday, August 07, 2013 3:55 PM
To: Floeck, Fabian (AIFB)
Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles
Hi Fabian,
in principle you should be able to recover the same correlation also in the
range 5800-6000 Kb, provided that you control for the noise in the data. From
your scatterplot it looks like that the variance of the residuals is constant (a
scatter plot of the residuals should be enough to confirm this), so if you
standardize the residuals by the standard deviation of the residuals you should
be able to recover the correlation, though the significance of such estimate
might be at risk if the sample size is small.
Thanks for the interesting discussion.
Best,
Giovanni
On Wed 07 Aug 2013 08:02:21 AM EDT, Floeck, Fabian (AIFB) wrote:
...
Update:
Not surprising and congruent with Aarons results, I also get a high linear
correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000
sample even if I filter out Disamb articles. See scatterplot [1].
So first of all, it can be fairly certainly concluded that our methods of
cleaning are quite similar and are not the cause of mayor differences in the
measurements.
Secondly, this seems to be a very interesting distribution were the overall
correlation is very high but in certain sections (I'm using a speculative
plural here) of the distribution is very low.
That means we can make the statement "In general, byte size and display char
length of an article are highly correlated". This is however not automatically
valid once you limit the byte size of the articles you look at in any way (for
example, by just looking at Featured Articles, Stubs, etc. (I have yet to
check these myself)). Then, you will have to check again if the statement
holds true for the given subsample.
These results show very nicely that random sampling in a population is not an
infallible "universal weapon" for inferring information about all subsamples
of a population and should, where possible, always be cross-checked with
non-random selective sample analysis.
Next steps I'd like to take is to look at articles above or below a certain
size to see if the correlation differs (maybe because a different proportion
of their content is made up of templates).
I'll give an update asap.
Best,
Fabian
[1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_NOdisamb1.png
--
Giovanni Luca Ciampaglia
Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciampag@indiana.edu