Re: [Wiki-research-l] Readable characters vs. size in bytes of articles - Wiki-research-l

7 Aug 2013

Ok, I don't want to spam, but...

I just finished downloading another random sample of 5000 articles with sizes of 5000
bytes and below (this time not excluding disambiguation articles, as the respective DB
query is quite slow...) . For these, the correlation is 0.514. Plot [1]

This seems to be in favor my observation from earlier that the smaller articles in byte
size *might* show a weaker correlation. 

Very interesting topic, too bad I have to get back to project work now.. :)

Best, 

Fabian

[1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_UNDER5001byt…

________________________________________
From: Giovanni Luca Ciampaglia [glciampagl(a)gmail.com]
Sent: Wednesday, August 07, 2013 3:55 PM
To: Floeck, Fabian (AIFB)
Subject: Re: [Wiki-research-l] Readable characters vs. size in bytes of articles

Hi Fabian,

in principle you should be able to recover the same correlation also in the
range 5800-6000 Kb, provided that you control for the noise in the data. From
your scatterplot it looks like that the variance of the residuals is constant (a
scatter plot of the residuals should be enough to confirm this), so if you
standardize the residuals by the standard deviation of the residuals you should
be able to recover the correlation, though the significance of such estimate
might be at risk if the sample size is small.

Thanks for the interesting discussion.

Best,

Giovanni

On Wed 07 Aug 2013 08:02:21 AM EDT, Floeck, Fabian (AIFB) wrote:
...

 Update:

 Not surprising and congruent with Aarons results, I also get a high linear
 correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000
 sample even if I filter out Disamb articles. See scatterplot [1].

 So first of all, it can be fairly certainly concluded that our methods of
 cleaning are quite similar and are not the cause of mayor differences in the
 measurements.
 Secondly, this seems to be a very interesting distribution were the overall
 correlation is very high but in certain sections (I'm using a speculative
 plural here) of the distribution is very low.
 That means we can make the statement "In general, byte size and display char
 length of an article are highly correlated". This is however not automatically
 valid once you limit the byte size of the articles you look at in any way (for
 example, by just looking at Featured Articles, Stubs, etc. (I have yet to
 check these myself)). Then, you will have to check again if the statement
 holds true for the given subsample.

 These results show very nicely that random sampling in a population is not an
 infallible "universal weapon" for inferring information about all subsamples
 of a population and should, where possible, always be cross-checked with
 non-random selective sample analysis.

 Next steps I'd like to take is to look at articles above or below a certain
 size to see if the correlation differs (maybe because a different proportion
 of their content is made up of templates).

 I'll give an update asap.

 Best,

 Fabian

 [1] https://dl.dropboxusercontent.com/u/3021002/scatter_random5000_NOdisamb1.png

--
Giovanni Luca Ciampaglia

Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University

✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciampag(a)indiana.edu