Hi researchers,
I could use a little help with understanding these dumps:
https://dumps.wikimedia.org/enwikisource/latest/
https://dumps.wikimedia.org/enwiki/20150901/
I'm trying to verify the claim that ENWP is the world's largest open text project, and to do that I need to verify that ENWP is larger than English Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger open text project than ENWP? Perhaps there's a library somewhere that has such a huge volume of out-of-copyright materials that the combined bytes of published text are larger than ENWP?
Thanks!
Pine
Hi Pine,
TL;DR: best to just say it's the largest encyclopedia ever. That should be safe.
Claims like this are hard to make because terms that seem concrete from afar tend to break down up close. For example: What do you mean by largest?
Largest in bytes? Words? Content "units" (articles vs. manuscripts in this case, I guess)? Contributors?
What do you mean by "open text project"? Is archive.org an open text project? It has 8.2 million books. How would you compare the two? Does 1 book = 1 article?
Having said all that, I'm curious how others have/would craft a claim like this. My guess is that most of us who've written for an academic audience have settled for some variant of "largest encyclopedia" (you've got to put something in your Introduction paragraph, after all). What sayst?
J
On Tue, Sep 15, 2015 at 4:45 PM, Pine W wiki.pine@gmail.com wrote:
Hi researchers,
I could use a little help with understanding these dumps:
https://dumps.wikimedia.org/enwikisource/latest/
https://dumps.wikimedia.org/enwiki/20150901/
I'm trying to verify the claim that ENWP is the world's largest open text project, and to do that I need to verify that ENWP is larger than English Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger open text project than ENWP? Perhaps there's a library somewhere that has such a huge volume of out-of-copyright materials that the combined bytes of published text are larger than ENWP?
Thanks!
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I'm pretty sure that English Wikipedia is the largest English language encyclopaedia, but there are some humongous ones in China.
Baidu Baike with almost 12.5 million articles is way bigger than any one language version of Wikipedia and Baike.com formerly Hudong is about a million bigger still.
Ok they are more inclusionist than us, recipes included, and they have somewhat dropped the distinction between a dictionary and an encyclopaedia.
So you can claim that Wikipedia with near 35 million articles in 288 languages is the largest encyclopaedia ever. Adding wiktionary would make that even bigger.
Source Wikipedia - I'm afraid I don't speak Chinese to check them myself.
Of course articles is a flawed metric, combining almost all the individual Pokemon articles into a handful of lists reduced the number of Wikipedia articles by hundreds, but still left us with more information on Pokemon than I would want to see in a printed encyclopaedia. But then can anyone suggest a meaningful metric for comparing such projects; Participants? Contributed edits? Shelf space if printed in traditional encyclopaedia sized books? Gigabytes of text? Trays of microfiche?
Regards
Jonathan
On 16 Sep 2015, at 01:24, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Pine,
TL;DR: best to just say it's the largest encyclopedia ever. That should be safe.
Claims like this are hard to make because terms that seem concrete from afar tend to break down up close. For example: What do you mean by largest?
Largest in bytes? Words? Content "units" (articles vs. manuscripts in this case, I guess)? Contributors?
What do you mean by "open text project"? Is archive.org an open text project? It has 8.2 million books. How would you compare the two? Does 1 book = 1 article?
Having said all that, I'm curious how others have/would craft a claim like this. My guess is that most of us who've written for an academic audience have settled for some variant of "largest encyclopedia" (you've got to put something in your Introduction paragraph, after all). What sayst?
J
On Tue, Sep 15, 2015 at 4:45 PM, Pine W wiki.pine@gmail.com wrote: Hi researchers,
I could use a little help with understanding these dumps:
https://dumps.wikimedia.org/enwikisource/latest/
https://dumps.wikimedia.org/enwiki/20150901/
I'm trying to verify the claim that ENWP is the world's largest open text project, and to do that I need to verify that ENWP is larger than English Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger open text project than ENWP? Perhaps there's a library somewhere that has such a huge volume of out-of-copyright materials that the combined bytes of published text are larger than ENWP?
Thanks!
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I was thinking in terms of GB of text.
I too have wondered about creating closer ties between Wiktionary, Wikipedia and Wikisource so that it's easier for someone to start their search on one site and quickly find relevant pages on the other sites. This might (among other things) lead to an increase in pageviews. (Adding Toby to this email chain to see if he has any thoughts about that.) It would also conceivably lead to an increase in the "size" of Wikipedia (measured in bytes, content pages, and contributors) if Wiktionary and Wikisource were, for purposes of the reader, practically the same site. The downside might be increased complexity for contributors as the number of workflows increases, and the standards for inclusion may be different.
Pine
On Wed, Sep 16, 2015 at 12:21 AM, WereSpielChequers < werespielchequers@gmail.com> wrote:
I'm pretty sure that English Wikipedia is the largest English language encyclopaedia, but there are some humongous ones in China.
Baidu Baike with almost 12.5 million articles is way bigger than any one language version of Wikipedia and Baike.com formerly Hudong is about a million bigger still.
Ok they are more inclusionist than us, recipes included, and they have somewhat dropped the distinction between a dictionary and an encyclopaedia.
So you can claim that Wikipedia with near 35 million articles in 288 languages is the largest encyclopaedia ever. Adding wiktionary would make that even bigger.
Source Wikipedia - I'm afraid I don't speak Chinese to check them myself.
Of course articles is a flawed metric, combining almost all the individual Pokemon articles into a handful of lists reduced the number of Wikipedia articles by hundreds, but still left us with more information on Pokemon than I would want to see in a printed encyclopaedia. But then can anyone suggest a meaningful metric for comparing such projects; Participants? Contributed edits? Shelf space if printed in traditional encyclopaedia sized books? Gigabytes of text? Trays of microfiche?
Regards
Jonathan
On 16 Sep 2015, at 01:24, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Pine,
TL;DR: best to just say it's the largest encyclopedia ever. That should be safe.
Claims like this are hard to make because terms that seem concrete from afar tend to break down up close. For example: What do you mean by largest?
Largest in bytes? Words? Content "units" (articles vs. manuscripts in this case, I guess)? Contributors?
What do you mean by "open text project"? Is archive.org an open text project? It has 8.2 million books. How would you compare the two? Does 1 book = 1 article?
Having said all that, I'm curious how others have/would craft a claim like this. My guess is that most of us who've written for an academic audience have settled for some variant of "largest encyclopedia" (you've got to put something in your Introduction paragraph, after all). What sayst?
J
On Tue, Sep 15, 2015 at 4:45 PM, Pine W wiki.pine@gmail.com wrote:
Hi researchers,
I could use a little help with understanding these dumps:
https://dumps.wikimedia.org/enwikisource/latest/
https://dumps.wikimedia.org/enwiki/20150901/
I'm trying to verify the claim that ENWP is the world's largest open text project, and to do that I need to verify that ENWP is larger than English Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger open text project than ENWP? Perhaps there's a library somewhere that has such a huge volume of out-of-copyright materials that the combined bytes of published text are larger than ENWP?
Thanks!
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF) https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Search is a Discovery team focus, rather than a Readership focus. I'd suggest reaching out to Dan Garry (we have been talking about project integration very recently, actually).
On 16 September 2015 at 15:32, Pine W wiki.pine@gmail.com wrote:
I was thinking in terms of GB of text.
I too have wondered about creating closer ties between Wiktionary, Wikipedia and Wikisource so that it's easier for someone to start their search on one site and quickly find relevant pages on the other sites. This might (among other things) lead to an increase in pageviews. (Adding Toby to this email chain to see if he has any thoughts about that.) It would also conceivably lead to an increase in the "size" of Wikipedia (measured in bytes, content pages, and contributors) if Wiktionary and Wikisource were, for purposes of the reader, practically the same site. The downside might be increased complexity for contributors as the number of workflows increases, and the standards for inclusion may be different.
Pine
On Wed, Sep 16, 2015 at 12:21 AM, WereSpielChequers werespielchequers@gmail.com wrote:
I'm pretty sure that English Wikipedia is the largest English language encyclopaedia, but there are some humongous ones in China.
Baidu Baike with almost 12.5 million articles is way bigger than any one language version of Wikipedia and Baike.com formerly Hudong is about a million bigger still.
Ok they are more inclusionist than us, recipes included, and they have somewhat dropped the distinction between a dictionary and an encyclopaedia.
So you can claim that Wikipedia with near 35 million articles in 288 languages is the largest encyclopaedia ever. Adding wiktionary would make that even bigger.
Source Wikipedia - I'm afraid I don't speak Chinese to check them myself.
Of course articles is a flawed metric, combining almost all the individual Pokemon articles into a handful of lists reduced the number of Wikipedia articles by hundreds, but still left us with more information on Pokemon than I would want to see in a printed encyclopaedia. But then can anyone suggest a meaningful metric for comparing such projects; Participants? Contributed edits? Shelf space if printed in traditional encyclopaedia sized books? Gigabytes of text? Trays of microfiche?
Regards
Jonathan
On 16 Sep 2015, at 01:24, Jonathan Morgan jmorgan@wikimedia.org wrote:
Hi Pine,
TL;DR: best to just say it's the largest encyclopedia ever. That should be safe.
Claims like this are hard to make because terms that seem concrete from afar tend to break down up close. For example: What do you mean by largest?
Largest in bytes? Words? Content "units" (articles vs. manuscripts in this case, I guess)? Contributors?
What do you mean by "open text project"? Is archive.org an open text project? It has 8.2 million books. How would you compare the two? Does 1 book = 1 article?
Having said all that, I'm curious how others have/would craft a claim like this. My guess is that most of us who've written for an academic audience have settled for some variant of "largest encyclopedia" (you've got to put something in your Introduction paragraph, after all). What sayst?
J
On Tue, Sep 15, 2015 at 4:45 PM, Pine W wiki.pine@gmail.com wrote:
Hi researchers,
I could use a little help with understanding these dumps:
https://dumps.wikimedia.org/enwikisource/latest/
https://dumps.wikimedia.org/enwiki/20150901/
I'm trying to verify the claim that ENWP is the world's largest open text project, and to do that I need to verify that ENWP is larger than English Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger open text project than ENWP? Perhaps there's a library somewhere that has such a huge volume of out-of-copyright materials that the combined bytes of published text are larger than ENWP?
Thanks!
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Oh, ok! Now pinging Discovery Dan. (: On Sep 16, 2015 1:04 PM, "Oliver Keyes" okeyes@wikimedia.org wrote:
Search is a Discovery team focus, rather than a Readership focus. I'd suggest reaching out to Dan Garry (we have been talking about project integration very recently, actually).
On 16 September 2015 at 15:32, Pine W wiki.pine@gmail.com wrote:
I was thinking in terms of GB of text.
I too have wondered about creating closer ties between Wiktionary,
Wikipedia
and Wikisource so that it's easier for someone to start their search on
one
site and quickly find relevant pages on the other sites. This might
(among
other things) lead to an increase in pageviews. (Adding Toby to this
chain to see if he has any thoughts about that.) It would also
conceivably
lead to an increase in the "size" of Wikipedia (measured in bytes,
content
pages, and contributors) if Wiktionary and Wikisource were, for purposes
of
the reader, practically the same site. The downside might be increased complexity for contributors as the number of workflows increases, and the standards for inclusion may be different.
Pine
On Wed, Sep 16, 2015 at 12:21 AM, WereSpielChequers werespielchequers@gmail.com wrote:
I'm pretty sure that English Wikipedia is the largest English language encyclopaedia, but there are some humongous ones in China.
Baidu Baike with almost 12.5 million articles is way bigger than any one language version of Wikipedia and Baike.com formerly Hudong is about a million bigger still.
Ok they are more inclusionist than us, recipes included, and they have somewhat dropped the distinction between a dictionary and an
encyclopaedia.
So you can claim that Wikipedia with near 35 million articles in 288 languages is the largest encyclopaedia ever. Adding wiktionary would
make
that even bigger.
Source Wikipedia - I'm afraid I don't speak Chinese to check them
myself.
Of course articles is a flawed metric, combining almost all the
individual
Pokemon articles into a handful of lists reduced the number of Wikipedia articles by hundreds, but still left us with more information on Pokemon than I would want to see in a printed encyclopaedia. But then can anyone suggest a meaningful metric for comparing such projects; Participants? Contributed edits? Shelf space if printed in traditional encyclopaedia
sized
books? Gigabytes of text? Trays of microfiche?
Regards
Jonathan
On 16 Sep 2015, at 01:24, Jonathan Morgan jmorgan@wikimedia.org
wrote:
Hi Pine,
TL;DR: best to just say it's the largest encyclopedia ever. That should
be
safe.
Claims like this are hard to make because terms that seem concrete from afar tend to break down up close. For example: What do you mean by
largest?
Largest in bytes? Words? Content "units" (articles vs. manuscripts in
this
case, I guess)? Contributors?
What do you mean by "open text project"? Is archive.org an open text project? It has 8.2 million books. How would you compare the two? Does 1 book = 1 article?
Having said all that, I'm curious how others have/would craft a claim
like
this. My guess is that most of us who've written for an academic
audience
have settled for some variant of "largest encyclopedia" (you've got to
put
something in your Introduction paragraph, after all). What sayst?
J
On Tue, Sep 15, 2015 at 4:45 PM, Pine W wiki.pine@gmail.com wrote:
Hi researchers,
I could use a little help with understanding these dumps:
https://dumps.wikimedia.org/enwikisource/latest/
https://dumps.wikimedia.org/enwiki/20150901/
I'm trying to verify the claim that ENWP is the world's largest open
text
project, and to do that I need to verify that ENWP is larger than
English
Wikisource. Which files should I be comparing?
Are there any other projects that could make a claim to be a larger
open
text project than ENWP? Perhaps there's a library somewhere that has
such a
huge volume of out-of-copyright materials that the combined bytes of published text are larger than ENWP?
Thanks!
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Oliver Keyes Count Logula Wikimedia Foundation
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Thu, Sep 17, 2015 at 2:13 AM, Pine W wiki.pine@gmail.com wrote:
Oh, ok! Now pinging Discovery Dan. (:
On that topic please see https://phabricator.wikimedia.org/T103102 as well.
Cheers Lydia
wiki-research-l@lists.wikimedia.org