[Wikimedia-l] Fwd: [Wmfcc-l] [press] Erik Zachte in Wired

Sue Gardner sgardner at wikimedia.org
Sat Jan 4 01:31:29 UTC 2014

Just wanted to share this article, because it makes me so happy!
Erik's one of our earliest contributors and *we've* all depended on
his work for years, but it's mostly invisible to the world beyond
Wikimedia. It makes me really happy to see him get some external
recognition :-)


---------- Forwarded message ----------
From: "Jay Walsh" <jwalsh at wikimedia.org>
Date: 27 Dec 2013 12:20
Subject: [Wmfcc-l] [press] Erik Z in Wired
To: "Communications Committee" <wmfcc-l at lists.wikimedia.org>


Meet the Stats Master Making Sense of Wikipedia’s Massive Data Trove

9:30 AM

Erik Zachte. Photo: Lane Hartwell/Wikimedia Foundation

There are websites, and then there’s Wikipedia. The internet behemoth
boasts 30 million articles written in more than 285 languages, tweaked
by 70,000 active editors and viewed by 530 million visitors worldwide
each month. As mountains of information go, it’s Everest. Teasing out
trends from the open source encyclopedia’s archives is a task few
would even attempt. Yet Erik Zachte did just that.

Zachte used his statistical intuition to create “Wikistats,” an online
statistics package that’s more than a trove of charts and graphs for
data geeks. It’s the most direct measure yet of Wikipedia’s success in
achieving its central objective: making the sum of all human knowledge
available to everyone everywhere.

“When I discovered Wikipedia I felt thrilled from the outset,” says
Zachte, who was working as an IT guy at KLM Airlines in the early days
of the Wiki revolution. Not content simply to edit articles, he joined
the mailing lists in which a fervid network of volunteers debated how
to increase the site’s functionality. As Wikipedia exploded in
popularity, power users complained there was no consistent way to
measure its growth in article count from the beginning.

“In 2003 there was already an online page counter if I remember
correctly, but not much else,” says Zachte. He realized it was
possible to extract far more descriptive data from historical metadata
in Wikipedia’s massive database dumps, copies of all raw content that
available to anyone in XML format.

He started crunching numbers and quickly became famous among fellow
Wikiholics for developing Wikistats. The site’s monthly reports filled
a valuable niche for descriptive metrics in the Wiki community, with
measures like article count, number of editors, and edits per article
that serve as proxy indicators of Wiki quality. Impressed by Zachte’s
stat-fu, the nonprofit Wikimedia Foundation that supports the
Wikipedia infrastructure made him its data analyst in 2008.

Since then, Zachte’s figures – all of which are open source and in the
public domain – have revealed ongoing challenges to the organization’s
growth, as well as noteworthy trends.

Wikistats data made it clear that a core of Wikipedians does an
outsize portion of the editing. As of October, 4.7 million people have
contributed to the English language Wikipedia, but just over 26,000
people have made more than 1,000 edits. In fact, that relatively small
group of people has made 73 percent of all edits. While a small core
of very active editors has remained stable, a larger pool of active
editors (those making at least five edits monthly) in all Wikipedia
language editions peaked at 90,000 in 2007 and has dropped since. As
of October, the count stands at 70,000.

That has some worried that a shrinking community indicates declining
quality and concerted efforts within the Wikimedia Foundation to boost
editor engagement, which the organization considers one of the
foremost indicators of Wikipedia’s success. In 2009, the organization
launched an ambitious five-year strategic plan to drastically increase
language and content diversity by encouraging internet users in the
“Global South” – particularly the developing regions of Africa, Asia,
the Middle East, and Latin America – to contribute. Wikistats metrics
gauge its progress each month.

“Many projects exist within WMF to influence editor influx and
retention,” says Zachte, “but in the end Wikistats gives the final
count: Are we on the right track?”

The numbers show reason for measured optimism. While the largest and
most densely populated language editions like English, German, French,
and Japanese, have seen the number of active editors level off or even
decline since about 2007, newer editor networks in highly populous
languages like Chinese, Arabic, and Persian continue to grow. In
addition, the global share of page edits is slowly shifting to
populous countries in the southern hemisphere, some of which, like
India and the Philippines, use and edit Wikipedia overwhelmingly in

Zachte’s reports also reveal idiosyncratic patterns of activity in
different languages.

For example, some volunteer coders program bots to create article
stubs in massive bursts, hoping other users will expand the articles
over time. While bots can supplement the work of active editor
networks, Wikistats summaries show that some language editions are
populated almost entirely by bot-created stubs – like the Cebuano and
Waray-Waray Wikipedias, which rocketed to almost one million articles
this year despite tiny editor networks that are unlikely to fill in
those blanks anytime soon.

Zachte’s animation of growth for all Wikipedia language sites, which
measures four aspects of each site: bubbles representing each language
slide across an x-axis indicating their age and up a y-axis measuring
their article count, expanding as their editor networks grow and
changing color as average article size grows. Image: Erik Zachte

The data also provide raw material for striking visualizations, which
Zachte sometimes creates and posts on his blog, Infodisiac and
compiles from other authors on Wikistats.

For years, Zachte was the only staffer working on general metrics
about Wikipedia, but today the Wikimedia Foundation now has many
analysts and engineers crunching data. The organization is preparing
to absorb Zachte’s work into a much more powerful data infrastructure.

“The plan is to take the existing functionality of Wikistats and
modernize it across the board,” says Toby Negrin, Wikimedia’s director
of analytics. “Erik’s work is amazing, but we need to make the data
more accessible and update it faster.”

One recent update is a streamlined Monthly Report Card that tracks
user engagement by language and geographical region, with customizable
graphs measuring factors like unique visitors, page views, and editing
activity over time. Other extensions will capture and analyze all
Wikimedia traffic, and provide metrics for editor engagement projects
like Wikipedia Zero, which gives users in developing countries free
Wikipedia access on their mobile devices.

Zachte embraces the changes. “Most of what I built will be phased out
over the coming years,” he says. “I’m fine with that. All software has
a limited lifespan.”

Until the new infrastructure can take over, Zachte maintains the
scripts that populate Wikistats reports while working from home in
Leiden, the Netherlands. Occasionally, he works on analytic pet
projects. His next idea focuses on measuring content diversity across
different Wikipedia language editions.

“In early years Wikipedia was often characterized as mostly geek
content: physics and sci-fi,” he says. “People don’t do that anymore,
but is our content really balanced now? Do we have similar depth of
content for ballet or folk culture or fashion?”

Most articles in larger Wikipedias are assigned multiple categories –
for example, the English-language entry for Barack Obama lists 45. But
users can assign a single article many different categories, and each
category can have an unlimited number of parent categories. That makes
it difficult to easily compare the number of articles in each category
as an indicator of content diversity.

Zachte’s idea is that comparing word frequencies within articles to
word frequencies for all named categories in a language (the English
Wikipedia has over 1 million, according to a 2012 estimate) can more
effectively categorize articles, and create profiles of which topics
receive more heavy coverage. He has written a proposal, but it’s still
unclear how it fits into Wikimedia’s current budget. It might just be
a hobby project – or, open source to the end, he concedes that someone
else might as well scoop him.

“Now I have given away the basic concept,” he says. “Someone can base
her thesis on this, and beat me to it, which is fine. Science would
progress faster if it did not thrive on secrecy.”

Another Zachte animation visualizes all Wikipedia edits on a specific
day in July 2011, on a world map in which 369,483 edits in multiple
languages appear as geographically distributed bursts of color in a
sped-up version of real time. Image: Erik Zachte

Tags: Erik Zachte, Wikimedia Foundation, Wikipedia

Post Comment |

Comments |


Jay Walsh

Wmfcc-l mailing list
Wmfcc-l at lists.wikimedia.org

More information about the Wikimedia-l mailing list