(cross-posting Sebastiano’s post from the analytics list, this may be of interest to both the wikidata and wiki-research-l communities)
Begin forwarded message:
> From: Sebastiano Vigna <vigna(a)di.unimi.it>
> Subject: [Analytics] Distributing an official graph
> Date: December 9, 2013 at 10:09:31 PM PST
>
> [Reposted from private discussion after Dario's request]
>
> My problem is that of exploring the graph structure of Wikipedia
>
> 1) easily;
> 2) reproducibly;
> 3) in a way that does not depend on parsing artifacts.
>
> Presently, when people wants to do this they either do their own parsing of the dumps, or they use the SQL data, or they download a dataset like
>
> http://law.di.unimi.it/webdata/enwiki-2013/
>
> which has everything "cooked up".
>
> My frustration in the last few days was when trying to add the category links. I didn't realize (well, it's not very documented) that bliki extracts all links and render them in HTML *except* for the category links, that are instead accessible programmatically. Once I got there, I was able to make some progress.
>
> Nonetheless, I think that the graph of Wikipedia connections (hyperlinks and category links) is really a mine of information and it is a pity that a lot of huffing and puffing is necessary to do something as simple as a reverse visit of the category links from "People" to get, actually, all people pages (this is a bit more complicated--there are many false positives, but after a couple of fixes worked quite well).
>
> Moreover, one has continuously this feeling of walking on eggshells: a small change in bliki, a small change in the XML format and everything might stop working is such a subtle manner that you realize it only after a long time.
>
> I was wondering if Wikimedia would be interested in distributing in compressed form the Wikipedia graph. That would be the "official" Wikipedia graph--the benefits, in particular for people working on leveraging semantic information from Wikipedia, would be really significant.
>
> I would (obviously) propose to use our Java framework, WebGraph, which is actually quite standard in distributing large (well, actually much larger) graphs, such as ClueWeb09 http://lemurproject.org/clueweb09/, ClueWeb12 http://lemurproject.org/clueweb12/ and the recent Common Web Crawl http://webdatacommons.org/hyperlinkgraph/index.html. But any format is OK, even a pair of integers per line. The advantage of a binary compressed form is reduced network utilization, instantaneous availability of the information, etc.
>
> Probably it would be useful to actually distribute several graphs with the same dataset--e.g., the category links, the content link, etc. It is immediate, using WebGraph, to build a union (i.e., a superposition) of any set of such graphs and use it transparently as a single graph.
>
> In my mind the distributed graph should have a contiguous ID space, say, induced by the lexicographical order of the titles (possibly placing template pages at the start or at the end of the ID space). We should provide graphs, and a bidirectional node<->title map. All such information would use about 300M of space for the current English Wikipedia. People could then associate pages to nodes using the title as a key.
>
> But this last part is just rambling. :)
>
> Let me know if you people are interested. We can of course take care of the process of cooking up the information once it is out of the SQL database.
>
> Ciao,
>
> seba
>
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
Maximilian Klein wrote:
>... Can you also think of any other dimensions or heuristics
> to programatically rate?
Ref tags per article text bytes works pretty well, even by itself.
Also, please consider readability metrics. I would say that at this point
on enwiki, about a third of our real reader-impeding quallity issues have
more to do with overly technical jargon-laden articles, which usually also
have word and sentence length issues, than underdeveloped exposition.
Especially our math articles, many of which are almost useless for
undergraduates, let alone students at the earlier grade levels where the
corresponding concepts are introduced.
The good news is that doesn't seem to be happening in other topic areas
like biology, physics, or medicine. But math is kind of a disaster area
that way and it's not getting better with time.
Not wiki-related per se, but probably many people on this list might be interested.
G
> *From: *Rayid Ghani <rayid(a)uchicago.edu <mailto:rayid@uchicago.edu>>
> *Subject: **Data Science for Social Good Summer Fellowship*
> *Date: *December 9, 2013 3:00:10 PM EST
> *To: *Rayid Ghani <rayid(a)uchicago.edu <mailto:rayid@uchicago.edu>>
>
> Hi,
> I'm running the Eric & Wendy Schmidt "Data Science for Social Good" Summer
> Fellowship again this year at the University of Chicago and need help in
> recruiting strong students (grad students or junior/senior undergrads with
> CS, Machine Learning, and/or Stats background). The goal is to get up to
> 50 students in Chicago this summer and have them work on high-impact
> social problems (in education, healthcare, energy, transportation, crime,
> etc.) using Machine Learning, Data Mining, and other related buzzwords. The
> students will work with full-time mentors from academia and industry. The
> fellowships are paid competitively and we will provide housing as well.
>
> More details are at http://dssg.uchicago.edu <http://dssg.uchicago.edu/>.
> Applications for the fellowship are due February 1, 2014.
>
> If you have (or know of) strong CS/Stats/Econometrics/Applied Math/Policy
> students who have an interest in making an impact by working on high-impact
> social problems using machine learning/data mining/stats, please forward this
> to them.
>
> Thanks,
> Rayid
>
> P.S. We’re also looking for full-time mentors (strong technical folks with
> real-world experience who want to spend the summer in Chicago working with a
> team of fellows).
>
> Rayid Ghani
> Computation Institute & Harris School of Public Policy
> University of Chicago
> rayid(a)uchicago.edu <mailto:rayid@uchicago.edu>
> http://www.rayidghani.com <http://www.rayidghani.com/>
Has there ever been a general purpose encyclopedia which was found
suitable for medical student instruction?
What are our median level readers going to do if we suddenly start
including enough pathophysiology images to please the med school
instructors? I'm not entirely sure it will help them, although on the
other hand it might encourage them to see a professional which is what
they often should be doing instead of reading Wikipedia. (But if
wishes were horses, beggars would ride....)
>... Daniel Mietchen wrote:
>>
>> A similar paper on 39 gastroenterology/ hepatology articles on the
>> English Wikipedia came to different conclusions:
>> https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Medicine#Paper:_.2…
(a la the old Nature study)
http://www.helsinkitimes.fi/finland/finland-news/domestic/8619-world-s-larg…
World’s largest study on Wikipedia: Better than it’s reputation
05 Dec 2013
Wikipedia is the most popular encyclopedia in the world, and now in many
countries it is also the only one. But can the information in Wikipedia be
trusted at all?
Helsingin Sanomat newspaper evaluated 134 articles in the Finnish-language
version of Wikipedia with 96 experts. As far as we know, our study is the
most extensive individual investigation on the trustworthiness of Wikipedia
in the whole world.
We found out that Wikipedia is better than it’s reputation: seventy per
cent of the articles got good points for accuracy. But in many ways the
Finnish Wikipedia is also far from flawless.
Olavi Koistinen HS
Hundreds of thousands of people use the Finnish-language version of
Wikipedia every day. Its articles rank at the top in Google search results,
whether you are looking for information on climate change, Michael Jackson,
Sydney, ball bearings, cancer, allosaurus, or any number of other topics.
Wikipedia is the only Finnish encyclopedia that is still being updated. The
golden age of printed encyclopedias was in the 1990s, but with the advent
of the internet, their sales collapsed. The last traditional Finnish
encyclopedia, the online version of WSOY’s Facta, was closed down in 2011.
The reliability of the English language Wikipedia has been studied for
years. The most famous study was published by the science magazine Nature
in 2005, comparing the number of errors in Wikipedia with those in the
Encyclopaedia Britannica. The result of the comparison was that Wikipedia
was almost as accurate as the Encyclopaedia Britannica. The study involved
42 articles from each encyclopaedia.
However, no thorough investigations have been made about the reliability of
the Finnish Wikipedia – until now.
We first chose 134 articles in the Finnish Wikipedia, covering different
areas of life, to be studied. Then we asked an expert with a thorough
knowledge on the subject matter to evaluate each article. There were a
total of 96 people making assessments, most of them professors or other
university researchers. They work in eight different Finnish universities.
The Finnish Wikipedia currently has nearly 340 000 articles, so the random
selection of 134 articles naturally does not give a complete picture of the
reliability of the encyclopedia. Our study is nevertheless the most
extensive individual investigation or report on the trustworthiness of
Wikipedia in the whole world, as far as we know, if the matter is measured
on the basis of the number of articles examined.
“With this measure the study is unique in its scale”, says researcher Arto
Lanamäki of the University of Bergen in Norway. Not even the
English-language Wikipedia has undergone such an extensive single survey.
The experts read the articles and scored them based on six different
indicators, which were lack of errors, coverage and balance, sourcing,
topicality, neutrality, and clarity. The articles were graded on a scale of
1-5, in which 1 was the worst mark and 5 was the best.
The entire result dataset was published as an open data, and is available
here.
First the good news: The Finnish Wikipedia is largely error-free. The lack
of errors is the area in which Wikipedia clearly got its best score. For
instance, if someone uses Wikipedia to find the year in which Christopher
Columbus discovered the America, the information in the article is most
likely to be correct.
No less than 70 per cent of the articles were judged to be good (4) or
excellent (5) with respect to lack of errors. According to the indicative
evaluation scale a four means that that the article has only “scattered
small errors, no big ones”.
This is how we evaluated the Finnish language Wikipedia
Journalists of Helsingin Sanomat first listed 150 topics from different
walks of life. Only after that did we check if any Wikipedia articles
actually existed on the topic (the vast majority did exist). 134 Wikipedia
articles were selected for the final evaluation.
To evaluate each article we chose an university-level researcher with
knowledge on the subject matter to be an evaluator. There were a total of
96 evaluators, of which 94 were researchers. On top of that, two
experienced sport journalists did evaluate 8 Wikipedia articles about
sports.
Some of the evaluators assessed more than one article.
The evaluation was done via a web form.
The basic principle of the evaluation was: If an amateur would read the
article, would he or she get a truthful impression of the subject matter?
Each evaluator scored the article on the basis of six criteria: lack of
errors, coverage and balance, sourcing, topicality, neutrality, and
clarity. The scale of the points was 5=excellent, 4=good, 3=adequate,
2=tolerable, 1=poor. The evaluators were also given the chance to comment
the article based on each criteria.
Most of the evaluations were made in early November. They were based on
copies of Wikipedia articles as they were on October 22, 2013.
On the question of lack of errors the most common mark was four. What does
it mean in practice? According to the evaluators, factors such as these:
“No misleading errors, but some imprecision.” (Higgs boson)
“Quite proficient text, hardly any actual factual errors. However, odd
transliteration stands out.” (Icon)
“Geological periods do not quite hit the mark, the factual errors are
insignificant.” (Climate change)
However, the conclusion should not be drawn that Wikipedia can be trusted
naively. A total of 14 articles were graded as poor (1) or tolerable (2).
“Already in the first sentence the definition of photosynthesis is odd, and
several mistakes were found.” (Photosynthesis)
“The article is very uneven in quality, and at times quite propagandistic.”
(Syrian civil war)
Many of the articles containing the most errors were about a complicated or
abstract phenomena, such as the eurozone debt crisis, evolution, or the
Syrian civil war. On the other hand, several articles dealing with equally
complex subjects got high marks for lack of errors. For instance, the
articles for the big bang and climate change were praised for their
precision.
Although most of the articles were extolled for their accuracy, it can be
concluded that there is more variation in the quality of articles in
Finnish Wikipedia than in traditional encyclopedias. According to studies,
the same also applies to English Wikipedia: the best articles are
brilliant, but many are weak.
How can a reader assess the quality of a Wikipedia article?
A good way to examine the reliability of any particular article is to check
it’s factual sources. If the sources of information are listed at the end
of an entry or inside the text itself, the reader can become acquainted
with them and form an opinion of the credibility of the information on that
basis.
According to our study there are serious shortcomings in the Finnish
Wikipedia specifically in the sourcing of articles. No less than 38 per
cent of the articles got a grade of weak (1) or tolerable (2). The
evaluators had harsh criticism for these articles.
“Sources were not used at all and it shows.” (Internal combustion engine)
“On the basis of the text it would seem that sources have mainly included
TV documentaries or children’s dinosaur books.” (Allosaurus)
“Based primarily on a single disputed work. Sources from antiquity are
considered inadequate and hostile.” (Caligula)
The last comment highlights a broader problem. The evaluators were critical
that many articles were based mostly or partly on one source. This
sometimes causes problems: the information from individual works can be
selective, and the interpretations biased.
“The article seeks to be neutral. However, the points of emphasis of the
material that it was based on can clearly be seen.” (Protestant Reformation)
Of Wikipedia’s own ideals, neutrality is one of the most important.
Wikipedia emphasises that the articles need to be written from a “neutral”
point of view.
On the basis of our survey the Finnish Wikipedia has been written mainly
with a balanced approach, with 56 per cent of the articles getting good (4)
or excellent (5) marks. If some articles in the sample were slanted in
their statements or points of view, it usually was not attributable to the
writer’s deliberate partiality.
“The article is positive in its attitude and aims at neutrality. The
problems in the content are connected with the source material and the
writer’s basic knowledge: it is clearly not a result of anything
deliberate.” (Middle Ages)
Of the 134 articles that were surveyed, only six were found to have been
written with a clear bias, in the opinion of the reviewer. The articles
were connected with politics of the United States (Osama bin Laden’s death,
the US Democratic Party, the 9/11 terror attacks, Alan Greenspan) and
events in the Middle East (The civil war in Syria, the second intifada).
“In conditions of war and with an intense conflict dominating, the
neutrality requirement for the article is difficult. Occasionally, when
reading this article it seems, however, that no actual attempt was made to
attain neutrality.” (Civil war in Syria)
“A fairly flattering article for the Democrats. Other points of view would
also exist.” (US Democratic Party)
It is interesting that the experts felt that the wrong kind of “neutrality”
and avoiding expressing points of view is also a problem. If the writer is
afraid to make interpretations of any kind, the article sometimes ends up
being superficial.
“There are no interpretations and consequently no points of view either.”
(Kingdom of Mali)
“Generally a text only lists things instead of pondering them and
presenting well-founded evaluations.” (Finnish composer Kaija Saariaho)
“As the article seeks to be very objective, it starts strangely by
describing Pinochet as a ‘president’ – after all, he was one of the
best-known dictators of the 20th century.” (Augusto Pinochet)
Sometimes individual facts in an article are correct, but the text fails to
mention relevant aspects of the topic, or mentions them too briefly. The
article might also ramble on excessively about insignificant details. In
such a situation the lay reader might get an inaccurate image of, for
instance, what the most important turning points of a country’s history or
what the most important achievements of a researcher might be.
This is a quality factor which the evaluators measured by scoring each
article based on “coverage and balance”. It has an impact on how well
Wikipedia can convey an overall image of the matter to the reader.
The experts’ marks on coverage and balance were divided almost equally
between good and bad. So at least half of the articles could have been more
comprehensive and balanced.
“The article primarily tells about marriage based on Western – mainly Roman
law and the Christian tradition. Other cultures and religions have been
left completely outside the examination.” (Marriage)
“The article puts far too much emphasis on personal history and even on
related insignificant details. The presentation of the main topic, the
scientific work, is far too short and superficial compared with the rest of
the material.” (Albert Einstein)
Printed encyclopedias were often criticised as containing obsolete
information even when they were fresh off the press. Wikipedia has a better
chance to stay topical, since articles can be updated at any time.
The evaluators gave Finnish Wikipedia fairly high marks for topicality,
with 43 per cent of the articles getting either excellent or good marks,
but on the other hand, 31 per cent were poor or just tolerable in terms of
freshness. When something big happens, breaking news are often updated into
Wikipedia, but the follow-up on events is poor.
“All sources are old, which means that the content of the text has not been
updated after 2011 to any practical degree.” (Osama bin Laden’s death)
On the other hand, articles can become obsolete even if the topic is a
mountain range that rose from inside the earth in prehistoric times, or a
notable person who died in the 1930s; research on these topics often still
goes on.
“The biggest problem of the article is that it does not reflect the current
situation of international research.” (Middle Ages)
“Within psychoanalysis, much has happened since Freud, and continues to
happen, and Freud’s ideas have been re-evaluated many times. There is
almost nothing about this in the article.” (Sigmund Freud)
Table of results on the study of Finnish Wikipedia
1: Lack of errors 2: Coverage and balance 3. Sourcing 4: Topicality 5:
Neutrality 6. Clarity
1 (poor) 3 % 5 % 17 % 13 % 5 % 2 %
2 (tolerable) 7 % 24 % 21 % 19 % 16 % 11 %
3 (adequate) 19 % 41 % 34 % 25 % 22 % 31 %
4 (good) 47 % 25 % 21 % 30 % 34 % 42 %
5 (excellent) 23 % 5 % 7 % 13 % 22 % 13 %
% = amount of articles inside the whole sample, which was 134 articles
So what are we supposed to think about all of these results? Can
information from Finnish Wikipedia be trusted or not?
There are two points of view on this matter.
Those who have felt so far that everything that is in Wikipedia is true
would do well to re-examine their naiveté to a certain degree.
Those who have felt that Wikipedia’s content is nothing but inaccurate and
biased pseudo-information, should ease up a little. Based on our survey,
this is not the case; as a source of information, Wikipedia is a better
than it’s reputation – which is not a particularly good one.
Arto Lanamäki, who has studied Wikipedia at the University of Bergen, says
that people often take a very suspicious view on Wikipedia, even if they
themselves use it regularly for seeking information.
“In studies, the same article has been placed in the framework of the
Encyclopaedia Britannica and Wikipedia, and brought to different people for
evaluation. It is quite common for people to take a more suspicious view of
the article when it is presented in the framework of Wikipedia”, Lanamäki
says.
On the other hand, there are some good reasons for the doubts: the quality
of Wikipedia fluctuates considerably. Seventy per cent of the articles got
good points for accuracy in our study. If the bar is lowered slightly, then
90 per cent of the articles were at least “adequate” in the view of the
reviewers. That is a vast majority, but you might also ask yourself a
question: would you trust a printed encyclopedia if you knew that every
tenth article in it was inaccurate?
Wikipedia’s undeniable strength, however, is that the information is
updated and upgraded all the time. If you paid good money in the early
2000s for a set of books for your bookshelf, it is already obsolete in many
respects. Wikipedia’s massive popularity indicates that our view of factual
information is in flux. Rising alongside information confirmed by experts
and printed in dignified books, is peer-produced information that is
constantly spreading and accumulating on the internet.
If you see that something written in Wikipedia makes no sense, do everyone
a favour and edit the article so that it is better. You don’t even need a
Wikipedia user account for it.
Ask not how Wikipedia can help you – ask how you can help Wikipedia.
(This article was originally published in Finnish in Helsingin Sanomat at
November 30th. The HS working group taking part in the drafting of the
survey included Riikka Haikarainen, Tuomas Kaseva, Niko Kettunen, Olavi
Koistinen, Veikko Lautsi, Siri Markula, Sami Simola and Timo Paukku.)
Finnish Wikipedia has only a few hundred regular writers
By Olavi Koistinen HS
Wikipedia can be written or edited by anyone. About 70,000 people have done
one or more edits on the Finnish Wikipedia.
However, most of the content comes from a much smaller group of active
contributors.
About 200 people write or edit Finnish Wikipedia regularly, says Joonas
Lyytinen, a veteran Wikipedia contributor. He also says that there is an
even more active core group consisting of only 20–30 people who produce
very large amount of content and take care of many administrative tasks in
Finnish Wikipedia. Being part of that group, Lyytinen himself has written
about 2000 articles for Finnish version of Wikipedia.
A typical Wikipedia author is a male university student. Often he is a
layman who enjoys writing Wikipedia articles because it is a nice way to
learn new things – by researching and writing about them.
Researchers also write in Wikipedia, but less than laypeople do. At least
some of the researchers feel hesitant with respect to Wikipedia, says
University of Bergen researcher Arto Lanamäki, who has studied the Finnish
Wikipedia.
The articles in Wikipedia are often the result of teamwork, and there is an
ongoing debate about the content in the discussion pages of the Wikipedia.
In this debate, a researcher thoroughly familiar with the subject does not
get any credibility over a layman just because of his or her position or
title – the best argument wins.
Arguing with hobbyists can be stressful for many researchers. “Often a
writer has tried to explain his or her point of view by saying ‘don’t you
understand that I have a doctorate and I have researched this topic?’
However, this will not work with Wikipedia”, Lanamäki says.
There is no advance censorship, but active users of Wikipedia monitor new
updates constantly to counter vandalism. Sometimes, for instance, middle
school students tend to sabotage articles.
Finnish Wikipedia would benefit from having more writers. There are just 5
million Finnish speakers and only a few real experts in some fields of
knowledge. Wikipedia writers also suffer from a somewhat negative image
among occasional users of the service. This might even hinder some people
from participating in writing Wikipedia articles.
“It has been observed in academic studies that an impediment to
participation in Wikipedia is that people do not want to be labelled
Wikipedia nerds”, Lanamäki says.
Olavi Koistinen HS
--
* I use this address for lists; send personal messages to phoebe.ayers <at>
gmail.com *
Google has released over time a huge amount of open data from or about Wikipedia. Check them out:
http://googleresearch.blogspot.com/2013/12/free-language-lessons-for-comput…
Some highlights:
50,000 Lessons on How to Read: a Relation Extraction Corpus
What is it: A human-judged dataset of two relations involving public figures on Wikipedia: about 10,000 examples of “place of birth” and 40,000 examples of “attended or graduated from an institution.”
40 Million Entities in Context
What is it: A disambiguation set consisting of pointers to 10 million web pages with 40 million entities that have links to Wikipedia. This is another entity resolution corpus, since the links can be used to disambiguate the mentions, but unlike the ClueWeb example above, the links are inserted by the web page authors and can therefore be considered human annotation.
Distributing the Edit History of Wikipedia Infoboxes
What is it: The edit history of 1.8 million infoboxes in Wikipedia pages in one handy resource. Attributes on Wikipedia change over time, and some of them change more than others. Understanding attribute change is important for extracting accurate and useful information from Wikipedia.
Dictionaries for linking Text, Entities, and Ideas
What is it: We created a large database of pairs of 175 million strings associated with 7.5 million concepts, annotated with counts, which were mined from Wikipedia. The concepts in this case are Wikipedia articles, and the strings are anchor text spans that link to the concepts in question.
Dario
(ht Nicolas Torzec)
The November 2013 issue of the Wikimedia Research Newsletter is BIG - check it out:
https://meta.wikimedia.org/wiki/Research:Newsletter/2013/November
In this issue:
• 1 What drives people to contribute to Wikipedia? Experiment suggests reciprocity and social image motivations
• 2 Does "cultural imperialism" prevent the incorporation of indigenous knowledge on Wikipedia?
• 3 How PR professionals see Wikipedia: Trends from second US survey
• 4 Report from the inaugural L2 Wiki Research Hackathon
• 5 Briefly
• 5.1 "Iron Law of Oligarchy" (1911) confirmed on Wikia wikis
• 5.2 Twitter activity leads Wikipedia activity by an hour
• 5.3 "Google loves Wikipedia"
• 5.4 New article assessment algorithm scores quality of editors, too
• 5.5 "How do metrics of link analysis correlate to quality, relevance and popularity in Wikipedia?"
• 5.6 Usage of images and sounds is related to the quality of Wikipedia articles
• 5.7 Student perception of Wikipedia's credibility is significantly influenced by their professors' opinion
• 5.8 Non-participation of female students on Wikipedia influenced by school, peers and lack of community awareness
• 5.9 Gender gap coverage in media and blogs
• 5.10 German Wikipedia articles become static while English ones continue to develop
• 5.11 New sockpuppet corpus
• 5.12 Workshop on "User behavior and content generation on Wikipedia"
••• 18 publications were covered in this issue •••
Thanks to Piotr Konieczny, Brian Keegan, Nicolas Jullien, Amir E. Aharoni, Henrique Andrade, Daniel Mietchen, Giovanni Luca Ciampaglia, and Aaron Halfaker for contributing.
Dario Taraborelli and Tilman Bayer
--
Wikimedia Research Newsletter
https://meta.wikimedia.org/wiki/Research:Newsletter/
* Follow us on Twitter/Identi.ca: @WikiResearch
* Receive this newsletter by mail: https://lists.wikimedia.org/mailman/listinfo/research-newsletter
* Subscribe to the RSS feed: http://blog.wikimedia.org/c/research-2/wikimedia-research-newsletter/feed/
** Apologies for multiple postings; please circulate widely **
Websci'14 Call for Data Visualization Challenge
===============================================
We are delighted to announce the Web Science 2014 Visualization Challenge!
The Web has generated huge amounts of data at massive scale, but making
sense of these datasets and representing them in a compact and
easily-interpretable way remains very difficult. The goal of this challenge
is to encourage innovative visualizations of Web data. We particularly
encourage entries that reflect the interdisciplinary spirit of the Web
Science conference. To enable this visualization, we have prepared several
large-scale, easy-to-use, publicly-available datasets:
1. Web traffic data, including more than 200 million HTTP requests from
browsers to servers;
2. Twitter data, including a sample of more than 22
million tweets;
3. Social bookmarking data, consisting of about 430,000 bookmarked pages;
4. Co-authorship of academic papers, consisting of about 21.5 million
papers and 10.8 million authors
Complete details on these datasets are available here:
http://cnets.indiana.edu/groups/nan/webtraffic/websci14-data. All of the
datasets are stored in simple file formats, so that they can be easily used
without much technical expertise.
We are pleased to offer a cash prize of at least $1000 to be split among
the winning entries. Winners will be announced and displayed at the
Web Science conference in June 2014, presented on the Web Science website
(http://websci14.org), and the winners will be encouraged to present a
poster at the conference describing their work. The entries will be judged
based on four criteria: (1) innovative use of data, (2) clarity of
visualization, (3) quality of design, and (4) potential impact.
Rules
1. For fairness, the visualization must be primarily based on the
data that we provide. Other datasets may be used to augment ours, but these
datasets must be publicly-available and described in detail in the
documentation (see #4 below).
2. The visualization must be a static image, and must be submitted as a
PDF. In addition to the main PDF, please submit a PNG version at a
resolution of about 640x480, for display on Web pages, social media sites,
mobile devices, etc. This PNG version need not contain the full
visualization, but should be an appropriate representation (e.g. a subset
of the full PDF).
3. Please include a separate PDF file containing a description of the
visualization, including: (1) name(s), affiliation(s), and contact
information of the creator(s), (2) the purpose of the visualization, (3)
which dataset(s) were used, (4) a brief description of how the
visualizations was created, and (5) any other information you would like to
share with the judges.
4. By submitting your visualization, you agree to allow us to display your
visualization at the conference and on the Web Science website and social
media channels. (We will give proper attribution, of course.) You also
certify that you are the copyright holder of the visualization and are
authorized to give us this permission.
5. Entries are due by 11:59PM Hawaii time on April 15, 2014. Please e-mail
your entry to David Crandall <djcran(a)indiana.edu>. (If you do not receive a
confirmation email within 24 hours, your entry has not been received and
should be re-sent.)
Panel of judges
* Yong-Yeol Ahn, Indiana University
* Katy Borner, Indiana
* University Mark Meiss,
* Google Dimitar Nikolov, Indiana University
* Maximilian Schich, University of Texas
For questions, please contact David Crandall <djcran(a)indiana.edu>.
For more information about the 2014 Web Science Conference, please see
http://websci14.org.
--
Giovanni Luca Ciampaglia
Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciampag(a)indiana.edu
✆ 1-812-855-7261