You may be interested in the word cloud I created with the full archive of foundation-l: http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud_small.png You can find it at a bigger resolution and with the "source code" (if you want to improve it) here: http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud.png
Nemo
On Mon, Oct 4, 2010 at 11:28, Federico Leva (Nemo) nemowiki@gmail.com wrote:
You may be interested in the word cloud I created with the full archive of foundation-l: http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud_small.png You can find it at a bigger resolution and with the "source code" (if you want to improve it) here: http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud.png
Nemo
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
May you exclude headers from the cloud?
Milos Rancic, 04/10/2010 11:29:
May you exclude headers from the cloud?
Well, I did. Which additional (parts of) headers would you like to exclude? (Suggest them on talk page.) I left only timezones, years and months to give a clue on activity in different times; and text/plain vs. html given the frequent discussions there are on this topic. :-p
Nemo
On Mon, Oct 4, 2010 at 8:48 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Milos Rancic, 04/10/2010 11:29:
May you exclude headers from the cloud?
Well, I did. Which additional (parts of) headers would you like to exclude? (Suggest them on talk page.) I left only timezones, years and months to give a clue on activity in different times; and text/plain vs. html given the frequent discussions there are on this topic. :-p
It's probably easier to strip them entirely before pushing them into the
generator, rather than using them as stopwords.
Andrew Garrett, 04/10/2010 11:49:
It's probably easier to strip them entirely before pushing them into the generator, rather than using them as stopwords.
Ehm, I can't do that. :-p Moreover, I didn't want to exclude /everything/ (e.g. subjects, names, dates).
K. Peachey, 04/10/2010 11:51:
Although I don't have a issue with it, but you may wish to double check the licensing you have attached to those uploads, since from understanding is that copyright and ownership does apply to emails.
Yes, but not to /words/ (and their frequency), AFAIK.
Nemo
On Mon, Oct 4, 2010 at 2:49 AM, Andrew Garrett agarrett@wikimedia.org wrote:
On Mon, Oct 4, 2010 at 8:48 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Milos Rancic, 04/10/2010 11:29:
May you exclude headers from the cloud?
Well, I did. Which additional (parts of) headers would you like to exclude? (Suggest them on talk page.) I left only timezones, years and months to give a clue on activity in different times; and text/plain vs. html given the frequent discussions there are on this topic. :-p
It's probably easier to strip them entirely before pushing them into the
generator, rather than using them as stopwords.
This is fun! thanks for doing it. It would be interesting to see a version with all of the headers stripped out (dates & email terms: mailman/mimedel, etc.) so the content words would really show up. I like that "community" is huge and "individual" is tiny :)
Phoebe
phoebe ayers, 04/10/2010 17:29:
This is fun! thanks for doing it. It would be interesting to see a version with all of the headers stripped out (dates & email terms: mailman/mimedel, etc.) so the content words would really show up.
If someone tells me how to do what Werdna suggested (I'm not a progammer)... In the meanwhile, just add stopwords to the talk as phoebe did: http://commons.wikimedia.org/wiki/File_talk:Foundation-l_word_cloud.png
I like that "community" is huge and "individual" is tiny :)
With 1000 words there are lots of funny things to discover. :-D
Nemo
In looking at the contents of the gzip'ed archives, stripping out the headers does not look trivial, but it appears that it could be done in most cases. A whole other problem is quoted text. Any preference on whether or not that should be included as well? If it is included, the word are not entirely accurate.
Peter
On Mon, Oct 4, 2010 at 2:13 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
phoebe ayers, 04/10/2010 17:29:
This is fun! thanks for doing it. It would be interesting to see a version with all of the headers stripped out (dates & email terms: mailman/mimedel, etc.) so the content words would really show up.
If someone tells me how to do what Werdna suggested (I'm not a progammer)... In the meanwhile, just add stopwords to the talk as phoebe did: http://commons.wikimedia.org/wiki/File_talk:Foundation-l_word_cloud.png
I like that "community" is huge and "individual" is tiny :)
With 1000 words there are lots of funny things to discover. :-D
Nemo
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Oct 5, 2010 at 7:48 AM, Peter Gehres in2thats12@gmail.com wrote:
In looking at the contents of the gzip'ed archives, stripping out the headers does not look trivial, but it appears that it could be done in most cases. A whole other problem is quoted text. Any preference on whether or not that should be included as well? If it is included, the word are not entirely accurate.
If it is including quoted passages, a simple way to address this is to remove any line starting with '>' and all attachments.
btw, very interesting Nemo!
-- John Vandenberg
If it is including quoted passages, a simple way to address this is to remove any line starting with '>' and all attachments.
That is what I was planning to do. I was referring to it as a problem in reference to incidence.
I am currently working on a python implementation that strips headers and quoted passages. One problem I have discovered is that the gzip'd archives often contain multiple copies of the same message (matching "message-id"s in the header). I am removing duplicates and the count after this operation matched the count when viewed online in the archives.
-Peter
I uploaded my version of the "cloud" [1] to the same location. I removed all duplicate emails from the archives and omitted all subjects and quoted text.
phoebe ayers, 04/10/2010 17:29:
This is fun! thanks for doing it. It would be interesting to see a version with all of the headers stripped out (dates & email terms: mailman/mimedel, etc.) so the content words would really show up.
Are there any other tweaks you want?
-Peter Gehres
[1] http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud.png
On Tue, Oct 5, 2010 at 3:07 PM, Peter Gehres in2thats12@gmail.com wrote:
I uploaded my version of the "cloud" [1] to the same location. I removed all duplicate emails from the archives and omitted all subjects and quoted text.
Very nice.
Not surprisingly, ..
Wikipedia - huge Commons - medium (blue, beneath 'foundation-l') Wikinews - small (brown, near 'see') Wiktionary - tiny (blue, in the t of 'foundation-l')
Can anyone see wikisource? (weep)
-- John Vandenberg
Peter Gehres, 05/10/2010 06:07:
I uploaded my version of the "cloud" [1] to the same location. I removed all duplicate emails from the archives and omitted all subjects and
quoted
text.
Thank you! It's indeed much more readable. :-) I moved it to http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud_without_heade... because it's completely different (resolution, author, description etc.). What about adding your script to the description so that others can easily create their version (also for other lists)?
John Vandenberg, 05/10/2010 06:48:
Wikipedia - huge Commons - medium (blue, beneath 'foundation-l') Wikinews - small (brown, near 'see') Wiktionary - tiny (blue, in the t of 'foundation-l')
Can anyone see wikisource? (weep)
Perhaps we need a bigger resolution. :-p I found "mediawiki" (blue, under "see") and "wikiversity" (green, under "policy").
Nemo
Thank you! It's indeed much more readable. :-) I moved it to
http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud_without_heade... because it's completely different (resolution, author, description etc.). What about adding your script to the description so that others can easily create their version (also for other lists)?
I am happy to release the source code to mine, but it is much more than a
script. I used python's Email parser to parse each file and create Email objects and stored them in a sqlite database. It leaves the door open for much more flexibility and it is not much more than a bunch of hacking at the moment, but I am happy to fill requests and try and clean up the code over time.
Peter
Although I don't have a issue with it, but you may wish to double check the licensing you have attached to those uploads, since from understanding is that copyright and ownership does apply to emails. -Peachey
On 4 October 2010 10:51, K. Peachey p858snake@yahoo.com.au wrote:
Although I don't have a issue with it, but you may wish to double check the licensing you have attached to those uploads, since from understanding is that copyright and ownership does apply to emails.
Not even within Commons level of copyright paranoia is a word count chart encumbered.
- d.
On Mon, Oct 4, 2010 at 6:10 AM, David Gerard dgerard@gmail.com wrote:
On 4 October 2010 10:51, K. Peachey p858snake@yahoo.com.au wrote:
Although I don't have a issue with it, but you may wish to double check the licensing you have attached to those uploads, since from understanding is that copyright and ownership does apply to emails.
Not even within Commons level of copyright paranoia is a word count chart encumbered.
I'm sure we can find somebody to debate it, if you'd like to go down that road ;-)
Nemo: This is really cool!
-Chad
On Mon, Oct 4, 2010 at 12:10 PM, David Gerard dgerard@gmail.com wrote:
On 4 October 2010 10:51, K. Peachey p858snake@yahoo.com.au wrote:
Although I don't have a issue with it, but you may wish to double check the licensing you have attached to those uploads, since from understanding is that copyright and ownership does apply to emails.
Not even within Commons level of copyright paranoia is a word count chart encumbered.
Like!
Delphine
Mi piace molto, grazie :)
It's fun some particular timezones are highly visible than others, but can you please generate another version which strips all headers like Date:? More content oriented version would be also interested.
A presto,
On Mon, Oct 4, 2010 at 6:28 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
You may be interested in the word cloud I created with the full archive of foundation-l: http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud_small.png You can find it at a bigger resolution and with the "source code" (if you want to improve it) here: http://commons.wikimedia.org/wiki/File:Foundation-l_word_cloud.png
Nemo
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org