Re: [Wiki-research-l] Request for statistics: effectiveness of WikiLove

List overview All Threads
Download

newer

older

Maps based on Wikidata data

2014 ACM Web Science Conference...

ENWP Pine

26 Jul 2013 26 Jul '13

6:18 a.m.

Also remember that barnstars and PUAs existed before Wikilove and continue to be awarded without using Wikilove. Pine

Attachments:

attachment.htm (text/html — 350 bytes)

Show replies by date

Floeck, Fabian (AIFB)

2 Aug 2 Aug

5:24 p.m.

New subject: Readable characters vs. size in bytes of articles

Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a "long" or "short" article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Aaron Halfaker

4 Aug 4 Aug

9:43 a.m.

New subject: Readable characters vs. size in bytes of articles

I just replicated this analysis. I think you might have made some mistakes. I took a random sample of non-redirect articles from English Wikipedia and compared the byte_length (from database) to the content_length (from API, tags and comments stripped). I get a pearson correlation coef of *0.9514766*. See the attached scatter plot including a linear regression line. See also the regress output below. Call: lm(formula = page_len ~ content_length, data = pages) Residuals: Min 1Q Median 3Q Max -38263 -419 82 592 37605 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -97.40412 72.46523 -1.344 0.179 content_length 1.14991 0.00832 138.210 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2722 on 1998 degrees of freedom Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053 F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16 On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB) < fabian.floeck(a)kit.edu> wrote:

...

Floeck, Fabian (AIFB)

15 Mar 15 Mar

5:47 p.m.

New subject: Readable characters vs. size in bytes of articles

Aaron, this seems kind of redundant as I already agreed that there is an overall high correlation and you posted this (almost) identical analysis 7 months ago. I don't know if you missed my later emails on the topic, but I already wrote that this "mistake" as you repeatedly put it, was a result of the selective sampling between 5000 and 6000 bytes. Hence, as I already said, my initial observations cannot be transferred to the general population of articles. Not surprising and congruent with Aarons results, I also get a high linear correlation of 0.96 (random sample of 5000 articles) outside the 5800-6000 sample even if I filter out Disamb articles. But, as I as well explained, there seem to be some indicators that in smaller size articles, this correlation is not as strong. I split up the random 5000 article sample I posted last time at the median (3709 bytes) into two parts, each 2500 articles big. For the "higher byte size" part (>3709 bytes) the correlation is 0.964 For the "lesser byte size" part (<3710 bytes ) the correlation is only 0.295 You will of course not see that in your example if you just take all data (of all article sizes) and draw a straight regression line through them. The "blob" on the bottom left might need some further investigation. Maybe you could look at only articles under 5000, 3000, 1000 bytes and see if the correlation changes somehow. My guess is it will be less strong. BTW: did you try to fit nonlinear models? I did not, and one reason for the bad fit in the lesser size articles could also be that there's a high correlation but not a linear one. Best, Fabian On 04.08.2013, at 11:43, Aaron Halfaker <aaron.halfaker@gmail.com<mailto:aaron.halfaker@gmail.com>> wrote: I just replicated this analysis. I think you might have made some mistakes. I took a random sample of non-redirect articles from English Wikipedia and compared the byte_length (from database) to the content_length (from API, tags and comments stripped). I get a pearson correlation coef of 0.9514766. See the attached scatter plot including a linear regression line. See also the regress output below. Call: lm(formula = page_len ~ content_length, data = pages) Residuals: Min 1Q Median 3Q Max -38263 -419 82 592 37605 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -97.40412 72.46523 -1.344 0.179 content_length 1.14991 0.00832 138.210 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2722 on 1998 degrees of freedom Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053 F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16 On Fri, Aug 2, 2013 at 12:24 PM, Floeck, Fabian (AIFB) <fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu>> wrote: Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a "long" or "short" article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck@kit.edu<mailto:fabian.floeck@kit.edu> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck<http://www.aifb.kit.edu/web/Fab… KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l <bytes.content_length.scatter.png><ATT00001.c> -- Dipl.-Medwiss. Fabian Flöck Research Associate Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: floeck@kit.edu<mailto:floeck@kit.edu> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Aaron Halfaker

6:20 p.m.

New subject: Readable characters vs. size in bytes of articles

Hi Fabian, I think that the primary reason that articles with smaller byte counts show less consistency is due to templates. A lot of stubs and starts are created with a collection of templates that consume few bytes of wikitext, but balloon into lots of HTML/content. Regardless, there doesn't seem to be much cause for concern, so I saw the issue as resolved. FWIW, I originally showed up in this conversation because I was skeptical of your initial conclusion: "size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article". Now that we've worked out the strong correlation between wikitext length and readable content length for nearly all articles, I have little interest in looking into the data further. -Aaron On Sat, Mar 15, 2014 at 12:47 PM, Floeck, Fabian (AIFB) < fabian.floeck(a)kit.edu> wrote:

...

Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a "long" or "short" article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT - University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

<bytes.content_length.scatter.png><ATT00001.c> -- Dipl.-Medwiss. Fabian Flöck Research Associate Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT - University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Taha Yasseri

9:11 p.m.

New subject: Readable characters vs. size in bytes of articles

Hi, Aaron, I tend to agree with your conclusion, and personally have little interest in the relationship between actual size and readable size. But from technical point of view, I guess you should plot your scatter plot in log-log scale and also calculate the correlation between the logarithm of the variables. The sizes are not normally distributed but log-normally [1], and linear statistics on heavy-tailed distributions are usually spurious. [1] http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/jour… Take care, Taha On 15 Mar 2014 18:21, "Aaron Halfaker" <aaron.halfaker(a)gmail.com> wrote:

...

Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a "long" or "short" article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT - University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

WereSpielChequers

4 Aug 4 Aug

4:59 p.m.

New subject: Readable characters vs. size in bytes of articles

Hi Fabian, That's interesting. When you say you stripped out the html did you also strip out the other parts of the references? Some citation styles will take up more bytes than others, and citation style is supposed to be consistent at the article level. It would also make a difference whether you included or excluded alt text from readable material as I suspect it is non granular - ie if someone is going to create alt text for one picture in an article they will do so for all pictures. More significantly there is a big difference in standards of referencing , broadly the higher the assessed quality and or the more contentious the article the more references there will be. I would expect that if you factored that in there would be some correlation between readable length and bytes within assessed classes of quality, and the outliers would include some of the controversial articles like Jerusalem (353 references) Hope that helps. Jonathan On 2 August 2013 18:24, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu> wrote:

...

Aaron Halfaker

11:15 p.m.

New subject: Readable characters vs. size in bytes of articles

...

_______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Floeck, Fabian (AIFB)

5 Aug 5 Aug

8:16 p.m.

New subject: Readable characters vs. size in bytes of articles

Hi, thanks for your feedback Jonathan and Aaron. @Jonathan: You are rightfully pointing at some things that could have been done differently, as this was just an ad-hoc experiment. What I did was getting the curl result of "http://en.wikipedia.org/w/api.php?action=parse&prop=text&pageid=X" and running it through BeautifulSoup [1] in Python. Regarding references: yes, all the markup was stripped away which you cannot see in form of readable characters as a human when you look at an article. Take as an example [2]: in the final output (which was the base for counting chars) what is left in characters of this reference is the readable "[1]" and " ^ William Goldenberg at the Internet Movie Database". Regarding alt text: it was completely stripped out. This can arguably be done different, if you see it as "readable main article text" as well. You are sure right that including these would lead to a higher correlation. Looking at samples from the output, the increase in correlation will however not be very big, but that's a mere hunch. Anyway, this was not what I was looking for. I wanted to compare really only the readable text you see directly when scrolling through the article. What is another issue is the inclusion of expandable template listings as I mentioned in my first mail. Are the long listings of related articles "main, readable article text"? I suppose not, but we did not filter them out yet. @Aaron, I'm pretty sure I didn't make a mistake, but before I can answer your mail: What exactly does this content_length API call give you back (I'm not aware of that). Takes the Wikisyntax and strips it of tags and comments? Or the HTML shown in the front-end including all content generated by templates minus all mark-up? Only in the ladder case would this be comparable in any way to what I have done. Please clarify and send me the concrete API call. I don't think your content_length is the length of the readable front-end text as I used it. (On a side note: I'm unsure why you paste the complete results of a linear regression, as a Pearson correlation will perfectly suffice in such a simple bivariate case. They - due to the nature of these statistical methods - of course yield the same results in this case. Or was there any important extra information that I missed in these regression results?). Best, Fabian [1] http://www.crummy.com/software/BeautifulSoup/bs4/doc/ [2] http://en.wikipedia.org/wiki/William_Goldenberg#cite_note-1 On 05.08.2013, at 01:15, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote:

...

(note that I posted this yesterday, but the message bounced due to the attached scatter plot. I just uploaded the plot to commons and re-sent) I just replicated this analysis. I think you might have made some mistakes. I took a random sample of non-redirect articles from English Wikipedia and compared the byte_length (from database) to the content_length (from API, tags and comments stripped).c I get a pearson correlation coef of 0,9514766. See the scatter plot including a linear regression line. See also the regress output below. Call: lm(formula = byte_len ~ content_length, data = pages) Residuals: Min 1Q Median 3Q Max -38263 -419 82 592 37605 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -97.40412 72.46523 -1.344 0.179 content_length 1.14991 0.00832 138.210 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2722 on 1998 degrees of freedom Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053 F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16 On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers <werespielchequers(a)gmail.com> wrote: Hi Fabian, That's interesting. When you say you stripped out the html did you also strip out the other parts of the references? Some citation styles will take up more bytes than others, and citation style is supposed to be consistent at the article level. It would also make a difference whether you included or excluded alt text from readable material as I suspect it is non granular - ie if someone is going to create alt text for one picture in an article they will do so for all pictures. More significantly there is a big difference in standards of referencing , broadly the higher the assessed quality and or the more contentious the article the more references there will be. I would expect that if you factored that in there would be some correlation between readable length and bytes within assessed classes of quality, and the outliers would include some of the controversial articles like Jerusalem (353 references) Hope that helps. Jonathan On 2 August 2013 18:24, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu> wrote: Hi, to whoever is interested in this (and I hope I didn't just repeat someone else's experiments on this): I wanted to know if a "long" or "short" article in terms of how much readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes. TL;DR: Turns out size in bytes is a really, really bad indicator for the actual, readable content of a Wikipedia article, even worse than I thought. We "curl"ed the front-end HTML of all articles of the English Wikipedia (ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles. Results for size in characters (w/ whitespaces) after cleaning the HTML out: Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates make it possible. (See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" -- Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..) Effectively, correlation for readable character size with byte size = 0.04 (i.e. none) in the sample. If someone already did this or a similar analysis, I'd appreciate pointers. Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l <ATT00001.c>

-- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association

Aaron Halfaker

10:01 p.m.

New subject: Readable characters vs. size in bytes of articles

Fabian, I suspect that your primary mistake is in only looking at pages with byte length between 5800 and 6000 bytes. You've severely limited the range of your regressor and therefor invalidated a set of assumptions for the correlation. If you still don't think that you made a mistake, I suggest that you try a scatter plot like the one that I have provided. Yes, of course I got the HTML from the API in a similar way as you have. Here's an example that gets the parsed content of the most recent revision to Anachronism (at the time of writing this email): http://en.wikipedia.org/w/api.php?action=query&prop=revisions&revid… Note that I am looking up page content by revisions ID. When I generated my page sample, I also gathered the most recent revision ID from the page history so that when I submitted a follow-up call to get the page's content, I would gather the parsed content at the same point in time. As for the regression, it's more informative than a simple pearson correlation because it separates the fitted slope (beta coef) from the fitness of the model (R^2). Without the regression, I could not have plotted the fitted line over the scatter plot. I've uploaded my code to bitbucket for your reference. Specifically, see: - Page sample query<https://bitbucket.org/halfak/byte_length/src/bba477bd1dcd1311498de… - API querying code<https://bitbucket.org/halfak/byte_length/src/bba477bd1dcd1311498de8… (also strips HTML & comments based on regexs) - R code for generating plot and stats<https://bitbucket.org/halfak/byte_length/src/bba477bd1dcd1311498de… -Aaron On Aug 6, 2013 4:17 AM, "Floeck, Fabian (AIFB)" <fabian.floeck(a)kit.edu> wrote:

...

Hi, thanks for your feedback Jonathan and Aaron. @Jonathan: You are rightfully pointing at some things that could have been done differently, as this was just an ad-hoc experiment. What I did was getting the curl result of " http://en.wikipedia.org/w/api.php?action=parse&prop=text&pageid=X&q… and running it through BeautifulSoup [1] in Python. Regarding references: yes, all the markup was stripped away which you cannot see in form of readable characters as a human when you look at an article. Take as an example [2]: in the final output (which was the base for counting chars) what is left in characters of this reference is the readable "[1]" and " ^ William Goldenberg at the Internet Movie Database". Regarding alt text: it was completely stripped out. This can arguably be done different, if you see it as "readable main article text" as well. You are sure right that including these would lead to a higher correlation. Looking at samples from the output, the increase in correlation will however not be very big, but that's a mere hunch. Anyway, this was not what I was looking for. I wanted to compare really only the readable text you see directly when scrolling through the article. What is another issue is the inclusion of expandable template listings as I mentioned in my first mail. Are the long listings of related articles "main, readable article text"? I suppose not, but we did not filter them out yet. @Aaron, I'm pretty sure I didn't make a mistake, but before I can answer your mail: What exactly does this content_length API call give you back (I'm not aware of that). Takes the Wikisyntax and strips it of tags and comments? Or the HTML shown in the front-end including all content generated by templates minus all mark-up? Only in the ladder case would this be comparable in any way to what I have done. Please clarify and send me the concrete API call. I don't think your content_length is the length of the readable front-end text as I used it. (On a side note: I'm unsure why you paste the complete results of a linear regression, as a Pearson correlation will perfectly suffice in such a simple bivariate case. They - due to the nature of these statistical methods - of course yield the same results in this case. Or was there any important extra information that I missed in these regression results?). Best, Fabian [1] http://www.crummy.com/software/BeautifulSoup/bs4/doc/ [2] http://en.wikipedia.org/wiki/William_Goldenberg#cite_note-1 On 05.08.2013, at 01:15, Aaron Halfaker <aaron.halfaker(a)gmail.com> wrote:

(note that I posted this yesterday, but the message bounced due to the

attached scatter plot. I just uploaded the plot to commons and re-sent)

I just replicated this analysis. I think you might have made some

mistakes.

I took a random sample of non-redirect articles from English Wikipedia

and compared the byte_length (from database) to the content_length (from API, tags and comments stripped).c

I get a pearson correlation coef of 0,9514766. See the scatter plot including a linear regression line. See also the

regress output below.

Call: lm(formula = byte_len ~ content_length, data = pages) Residuals: Min 1Q Median 3Q Max -38263 -419 82 592 37605 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -97.40412 72.46523 -1.344 0.179 content_length 1.14991 0.00832 138.210 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2722 on 1998 degrees of freedom Multiple R-squared: 0.9053, Adjusted R-squared: 0.9053 F-statistic: 1.91e+04 on 1 and 1998 DF, p-value: < 2.2e-16 On Mon, Aug 5, 2013 at 12:59 AM, WereSpielChequers <

werespielchequers(a)gmail.com> wrote:

Hi Fabian, That's interesting. When you say you stripped out the html did you also

strip out the other parts of the references? Some citation styles will take up more bytes than others, and citation style is supposed to be consistent at the article level.

It would also make a difference whether you included or excluded alt

text from readable material as I suspect it is non granular - ie if someone is going to create alt text for one picture in an article they will do so for all pictures.

More significantly there is a big difference in standards of referencing

, broadly the higher the assessed quality and or the more contentious the article the more references there will be.

I would expect that if you factored that in there would be some

correlation between readable length and bytes within assessed classes of quality, and the outliers would include some of the controversial articles like Jerusalem (353 references)

Hope that helps. Jonathan On 2 August 2013 18:24, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu>

wrote:

Hi, to whoever is interested in this (and I hope I didn't just repeat

someone else's experiments on this):

I wanted to know if a "long" or "short" article in terms of how much

readable material (excluding pictures) is presented to the reader in the front-end is correlated to the byte size of the Wikisyntax which can be obtained from the DB or API; as people often define the "length" of an article by its length in bytes.

TL;DR: Turns out size in bytes is a really, really bad indicator for the

actual, readable content of a Wikipedia article, even worse than I thought.

We "curl"ed the front-end HTML of all articles of the English Wikipedia

(ns=0, no disambiguation, no redirects) between 5800 and 6000 bytes (as around 5900 bytes is the total en.wiki average for these articles). = 41981 articles.

Results for size in characters (w/ whitespaces) after cleaning the HTML

out:

Min= 95 Max= 49441 Mean=4794.41 Std. Deviation=1712.748 Especially the gap between Min and Max was interesting. But templates

make it possible.

(See e.g. "Veer Teja Vidhya Mandir School", "Martin Callanan" --

Allthough for the ladder you could argue that expandable template listings are not really main "reading" content..)

Effectively, correlation for readable character size with byte size =

0.04 (i.e. none) in the sample.

If someone already did this or a similar analysis, I'd appreciate

pointers.

Best, Fabian -- Karlsruhe Institute of Technology (KIT) Institute of Applied Informatics and Formal Description Methods Dipl.-Medwiss. Fabian Flöck Research Associate Building 11.40, Room 222 KIT-Campus South D-76128 Karlsruhe Phone: +49 721 608 4 6584 Fax: +49 721 608 4 6580 Skype: f.floeck_work E-Mail: fabian.floeck(a)kit.edu WWW: http://www.aifb.kit.edu/web/Fabian_Flöck KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l _______________________________________________ Wiki-research-l mailing list Wiki-research-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l <ATT00001.c>

WereSpielChequers

10:04 p.m.

New subject: Readable characters vs. size in bytes of articles

Thanks both of you, I suspect that you two are using very different rules to define "readable characters", and for Aaron to get a close correlation and Fabian not to get any correlation implies to me that Fabian is stripping out the things that are not linked to article size, and that Aaron may be leaving such things in. For reasons that I'm going to pretend I don't understand, we have some articles with a lot of redundant spaces. Others with so few you'd be correct in thinking that certain editors have been making semiautomated edits to strip out those spaces. I suspect that Fabian's formulae ignores redundant spaces, and that Aaron's does not. I picked on alt text because it is very patchy across the pedia, but usually consistent at article level. I.e if someone has written a whole paragraph of alt text for one picture they have probably done so for every picture in an article, and conversely many articles will have no alt text at all. Similarly we have headings, and counterintuitively it is the subheadings that add most non display characters. So an article like Peasant's revolt will have 32 equals signs for its 8 headings, but 60 equal signs for its 10 subheadings. 92 bytes which I suspect one or both of you will have stripped out. The actual display text of course omits all 92 of those bytes, but repeats the content of those headings and subheadings in the contents section. The size of sections varies enormously from one article to another, and if there are three or fewer sections the contents section is not generated at all. I suspect that the average length of section headings also has quite a bit of variance as it is a stylistic choice. So I would expect that a "display bytes" count that simply stripped out the multiple equal signs would still be a pretty good correlation with article size, but a display bytes count that factored in the complication that headings and subheadings are displayed twice as they are repeated in the contents field, would have another factor drifting it away from a good correlation with raw byte count. But probably the biggest variance will be over infoboxes, tables, picture captions, hidden comments and the like. If you strip all of them out, including perhaps even the headings, captions and table contents, then you are going to get a very poor fit between article length and readable byte size. But I would be surprised if you could get Fabian's minimum display size of 95 bytes from 6,000 byte articles without having at least one article that consisted almost entirely of tables and which had been reduced to a sentence or two of narrative. So my suspicion is that Aaron's plot is at least including the displayed contents of tables et al whilst Fabian is only measuring the prose sections and completely stripping out anything in a table. Both approaches of course have their merits, and there are even some editors who were recent edit warring to keep articles they cared about free from clutter by infoboxes and tables. Regards Jonathan On 5 August 2013 21:16, Floeck, Fabian (AIFB) <fabian.floeck(a)kit.edu> wrote: