Hi! This e-mail was sent to Research_l. Shall we suggest to collect and systematize these resources somewhere in Meta and linked to it from Rcom page? Although, I don't know exactly where to suggest it. Cheers! Mayo ________________________________________ From: wiki-research-l-bounces@lists.wikimedia.org [wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of mohamad mehdi [mohamad_mehdi@hotmail.com] Sent: 18 April 2011 15:19 To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Hi everyone,
This is a follow up on a previous thread (Wikipedia data sets) related to the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following list. Please let us know of any other sources you know about. Also, we would like to know if there is any existing Wikipedia page that includes such a list so we can add to it. Otherwise, where do you suggest adding this list so it is noticeable and useful for the community?
http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /* Structured information from Wikipedia*/ http://labs.systemone.at/wikipedia3 /* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */ http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomi... http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia Datasets for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a (fast) Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html /* Complete Wikipedia edit history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+ /* list of Wikipedia data sets */ Examples: http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-acces... /* Top 1000 Accessed Wikipedia Articles */ http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits... /* Wikipedia Hits */
Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html /* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121 /* Wikipedia article traffic statistics */ http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-f... /* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete
Thank you, Mohamad Mehdi
Well, we obviously need to set up a page on Meta linked from the RCom page. If this has not been done yet, I can do it today.
Cheers Yaroslav
On Tue, 19 Apr 2011 12:25:55 -0700, "Fuster, Mayo" Mayo.Fuster@EUI.eu wrote:
Hi! This e-mail was sent to Research_l. Shall we suggest to collect and systematize these resources somewhere in Meta and linked to it from Rcom page? Although, I don't know exactly where to suggest it. Cheers! Mayo ________________________________________ From: wiki-research-l-bounces@lists.wikimedia.org [wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of mohamad mehdi [mohamad_mehdi@hotmail.com] Sent: 18 April 2011 15:19 To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Hi everyone,
This is a follow up on a previous thread (Wikipedia data sets) related
to
the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following
list.
Please let us know of any other sources you know about. Also, we would
like
to know if there is any existing Wikipedia page that includes such a
list
so we can add to it. Otherwise, where do you suggest adding this list so
it
is noticeable and useful for the community?
http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /*
Structured
information from Wikipedia*/ http://labs.systemone.at/wikipedia3 /* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly
updated
dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */
http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomi...
http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia
Datasets
for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a
(fast)
Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html /* Complete Wikipedia
edit
history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+
/* list of Wikipedia data sets */
Examples:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-acces...
/* Top 1000 Accessed Wikipedia Articles */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits...
/* Wikipedia Hits */
Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html /* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121 /*
Wikipedia
article traffic statistics */
http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-f...
/* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete
Thank you, Mohamad Mehdi
Hi everybody,
we are currently creating a new "Research:" namespace on Meta:
https://bugzilla.wikimedia.org/show_bug.cgi?id=28742
as soon as the request is processed we will be able to use this dedicated namespace to host all RCom-related activities and documentation, which will allow us to filter and search pages more easily, to create shortcuts and to set up special properties for all pages in the namespace.
You may want to wait until the change is implemented, or we will rename any existing research page to the new namespace as soon as it goes live.
Dario
On May 1, 2011, at 11:22 PM, Yaroslav M. Blanter wrote:
Well, we obviously need to set up a page on Meta linked from the RCom page. If this has not been done yet, I can do it today.
Cheers Yaroslav
On Tue, 19 Apr 2011 12:25:55 -0700, "Fuster, Mayo" Mayo.Fuster@EUI.eu wrote:
Hi! This e-mail was sent to Research_l. Shall we suggest to collect and systematize these resources somewhere in Meta and linked to it from Rcom page? Although, I don't know exactly where to suggest it. Cheers! Mayo ________________________________________ From: wiki-research-l-bounces@lists.wikimedia.org [wiki-research-l-bounces@lists.wikimedia.org] On Behalf Of mohamad mehdi [mohamad_mehdi@hotmail.com] Sent: 18 April 2011 15:19 To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] Wikipedia Literature Review - Tools and Data Sets
Hi everyone,
This is a follow up on a previous thread (Wikipedia data sets) related
to
the Wikipedia literature review (Chitu Okoli). As I mentioned in my previous email, part of our study is to identify the data collection methods and data sets used for Wikipedia studies. Therefore, we searched for online tools used to extract Wikipedia articles and for pre-compiled Wikipedia articles data sets; we were able to identify the following
list.
Please let us know of any other sources you know about. Also, we would
like
to know if there is any existing Wikipedia page that includes such a
list
so we can add to it. Otherwise, where do you suggest adding this list so
it
is noticeable and useful for the community?
http://download.wikimedia.org/ /* official Wikipedia database dumps */ http://datamob.org/datasets/tag/wikipedia /* Multiple data sets (English Wikipedia articles that have been transformed into XML) */ http://wiki.dbpedia.org/Datasets /*
Structured
information from Wikipedia*/ http://labs.systemone.at/wikipedia3 /* Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly
updated
dataset containing around 47 million triples.*/ http://www.scribd.com/doc/9582/integrating-wikipediawordnet /* article talking about integrating WorldNet and Wikipedia with YAGO */
http://www.infochimps.com/datasets/taxobox-wikipedia-infoboxes-with-taxonomi...
http://www.infochimps.com/link_frame?dataset=11043 /* Wikipedia
Datasets
for the Hadoop Hack | Cloudera */ http://www.infochimps.com/link_frame?dataset=11166 /* Wikipedia: Lists of common misspellings/For machines */ http://www.infochimps.com/link_frame?dataset=11028 /* Building a
(fast)
Wikipedia offline reader */ http://www.infochimps.com/link_frame?dataset=11004 /* Using the Wikipedia page-to-page link database */ http://www.infochimps.com/link_frame?dataset=11285 /* List of films */ http://www.infochimps.com/link_frame?dataset=11598 /* MusicBrainz Database */ http://dammit.lt/wikistats/ /* Wikitech-l page counters */ http://snap.stanford.edu/data/wiki-meta.html /* Complete Wikipedia
edit
history (up to January 2008) */ http://aws.amazon.com/datasets/2596?_encoding=UTF8&jiveRedirect=1 /* Wikipedia Page Traffic Statistics */ http://aws.amazon.com/datasets/2506 /* Wikipedia XML Data */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets?q=Wikipedia+
/* list of Wikipedia data sets */
Examples:
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/top-1000-acces...
/* Top 1000 Accessed Wikipedia Articles */
http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/wikipedia-hits...
/* Wikipedia Hits */
Tools to extract data from Wikipedia: http://www.evanjones.ca/software/wikipedia2text.html /* Extracting Text from Wikipedia */ http://www.infochimps.com/link_frame?dataset=11121 /*
Wikipedia
article traffic statistics */
http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-f...
/* Generating a Plain Text Corpus from Wikipedia */ http://www.infochimps.com/datasets/wikipedia-articles-title-autocomplete
Thank you, Mohamad Mehdi
RCom-l mailing list RCom-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/rcom-l
On Mon, 2 May 2011 11:06:36 -0700, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Hi everybody,
we are currently creating a new "Research:" namespace on Meta:
https://bugzilla.wikimedia.org/show_bug.cgi?id=28742
as soon as the request is processed we will be able to use this
dedicated
namespace to host all RCom-related activities and documentation, which
will
allow us to filter and search pages more easily, to create shortcuts and
to
set up special properties for all pages in the namespace.
You may want to wait until the change is implemented, or we will rename any existing research page to the new namespace as soon as it goes live.
Dario
Thanks, Dario.
I am travelling till the end of the week, and I hope the request will be implemented by the end of the week.
Cheers Yaroslav
Hi Yaroslav,
the request has just been implemented: there's a brand new "Research:" namespace we can now work with. I'll take a stab at it this week and start moving stuff around, this is roughly the structure I had in mind:
We should organize the Research section of Meta around three different types of resources: • Projects - listing current and past studies, spanning both internal WMF research and external research • Policies - the body of official policies produced by the RCom • Resources - including tools, materials and pointers for researchers (we could also include a fourth category, i.e. "People" or "Teams", for researchers to maintain their contact information and add a description of their current and past work, should User pages not be suitable for this purpose). Pages belonging to the same category should all ideally have a predefined structure and use the same templates.
On top of these 3 (or 4) main sections we will need to create some easily accessible top pages, such as:
Research:Index this will be the landing page pointing visitors to the relevant subpages (something along the lines of Aaron's intro page would work great). Erik suggested we pull into this page the list of research projects to make them easily accessible. My feeling though is that the "Projects/" prefix is useful to quickly identify research project pages from the URL. I suggest we keep the projects directory under Research:Projects and add a prominent navigational template to all pages to make it easy to access this directory from any other page on Research.
Research:Committee we should move to this node all information about the RCom, its activities and areas of interests as well as any open discussions/drafts. I suggest that we move actual RCom-driven research project (such as the Expert participation survey) under Projects/ and that we separate discussions from official documents by moving the latter under Policies/ – the lack of a clear distinction between past and ongoing discussions on the one hand and official documents on the other hand has plagued Meta for a long time, I'd love to address this issue in the overhaul of the Research section.
We should also decide whether to move or not all legacy research pages to the new namespace (see http://meta.wikimedia.org/wiki/Category:Research). If we do (which sounds like a good option to me) we will need to flag obsolete or archived discussions to keep them separate from the current RCom work.
Thoughts? If this makes sense to you all we can start moving/rewriting pages. We will need to make a public announcement on wikiresearch-l once we have the basic structure in place so others can contribute.
Dario
On May 2, 2011, at 2:58 PM, Yaroslav M. Blanter wrote:
On Mon, 2 May 2011 11:06:36 -0700, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Hi everybody,
we are currently creating a new "Research:" namespace on Meta:
https://bugzilla.wikimedia.org/show_bug.cgi?id=28742
as soon as the request is processed we will be able to use this
dedicated
namespace to host all RCom-related activities and documentation, which
will
allow us to filter and search pages more easily, to create shortcuts and
to
set up special properties for all pages in the namespace.
You may want to wait until the change is implemented, or we will rename any existing research page to the new namespace as soon as it goes live.
Dario
Thanks, Dario.
I am travelling till the end of the week, and I hope the request will be implemented by the end of the week.
Cheers Yaroslav
RCom-l mailing list RCom-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/rcom-l
On Mon, 2 May 2011 16:06:19 -0700, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
Hi Yaroslav,
the request has just been implemented: there's a brand new "Research:" namespace we can now work with. I'll take a stab at it this week and start moving stuff around, this is roughly the structure I had in mind:
We should organize the Research section of Meta around three different types of resources: • Projects - listing current and past studies, spanning both internal
WMF
research and external research • Policies - the body of official policies produced by the RCom • Resources - including tools, materials and pointers for researchers (we could also include a fourth category, i.e. "People" or "Teams", for researchers to maintain their contact information and add a description
of
their current and past work, should User pages not be suitable for this purpose). Pages belonging to the same category should all ideally have a predefined structure and use the same templates.
On top of these 3 (or 4) main sections we will need to create some
easily
accessible top pages, such as:
Research:Index this will be the landing page pointing visitors to the relevant subpages (something along the lines of Aaron's intro page would work great). Erik suggested we pull into this page the list of research projects to make
them
easily accessible. My feeling though is that the "Projects/" prefix is useful to quickly identify research project pages from the URL. I
suggest
we keep the projects directory under Research:Projects and add a
prominent
navigational template to all pages to make it easy to access this
directory
from any other page on Research.
Research:Committee we should move to this node all information about the RCom, its
activities
and areas of interests as well as any open discussions/drafts. I suggest that we move actual RCom-driven research project (such as the Expert participation survey) under Projects/ and that we separate discussions
from
official documents by moving the latter under Policies/ – the lack of a clear distinction between past and ongoing discussions on the one hand
and
official documents on the other hand has plagued Meta for a long time,
I'd
love to address this issue in the overhaul of the Research section.
We should also decide whether to move or not all legacy research pages
to
the new namespace (see
http://meta.wikimedia.org/wiki/Category:Research).
If we do (which sounds like a good option to me) we will need to flag obsolete or archived discussions to keep them separate from the current RCom work.
Thoughts? If this makes sense to you all we can start moving/rewriting pages. We will need to make a public announcement on wikiresearch-l once
we
have the basic structure in place so others can contribute.
Dario
May be indeed Research:Archived for obsolete and archived discussions?
I am not sure about people. If they are just associated with one particular project they may want to keep their pages on the project subspace; if not, they are just Meta users (like all of us), and there is no need to move their bios to the Research subspace.
Do we want to keep proposals for studies (as well as "much wanted" studies) as a separate subspace?
Cheers Yaroslav