Who here recalls a published report of research that determined Wikipedia
was the source of the digital knowledgebases? If my memory serves me
correctly, it was some amazingly huge number like 90%+ use Wikipedia as its
source for content.
Can anyone help with the URL of the report? Thanks!
Stella Yu | STELLARESULTS | 415 690 7827
"Chronicling heritage brands and legendary people."
2nd Call for Posters & Demos
SEMANTiCS 2017 - The Linked Data Conference
13th International Conference on Semantic Systems
September 11 -14, 2017
For details please go to: https://2017.semantics.cc/calls
Important Dates (Posters & Demos Track):
*Submission Deadline: extended: July 25, 2017 (11:59 pm,
*Notification of Acceptance: August 10, 2017 (11:59 pm, Hawaii time)
*Camera-Ready Paper: August 18, 2017 (11:59 pm, Hawaii time)
As in the previous years, SEMANTiCS’17 proceedings will be published by
ACM ICPS (pending) and CEUR WS proceedings.
This year, SEMANTiCS features a special Data Science track, which is an
opportunity to bring together researchers and practitioners interested
in data science and its intersection with Linked Data to present their
ideas and discuss the most important scientific, technical and
socio-economical challenges of this emerging field.
SEMANTiCS 2017 will especially welcome submissions for the following hot
*Metadata, Versioning and Data Quality Management
*Semantics for Safety, Security & Privacy
*Web Semantics, Linked (Open) Data & schema.org
*Corporate Knowledge Graphs
*Knowledge Integration and Language Technologies
*Economics of Data, Data Services and Data Ecosystems
Special Track (please check appropriate topic in submission system)
Following the success of previous years, we welcome any submissions
related but not limited to the following ‘horizontal’ (research) and
‘vertical’ (industries) topics:
*Enterprise Linked Data & Data Integration
*Knowledge Discovery & Intelligent Search
*Business Models, Governance & Data Strategies
*Semantics in Big Data
*Data Portals & Knowledge Visualization
*Semantic Information Management
*Document Management & Content Management
*Terminology, Thesaurus & Ontology Management
*Smart Connectivity, Networking & Interlinking
*Smart Data & Semantics in IoT
*Semantics for IT Safety & Security
*Semantic Rules, Policies & Licensing
*Community, Social & Societal Aspects
Data Science Special Track Horizontals:
*Large-Scale Data Processing (stream processing, handling large-scale
*Data Analytics (Machine Learning, Predictive Analytics, Network Analytics)
*Communicating Data (Data Visualization, UX & Interaction Design,
*Cross-cutting Issues (Ethics, Privacy, Security, Provenance)
*Industry & Engineering
*Life Sciences & Health Care
*Galleries, Libraries, Archives & Museums (GLAM)
*Education & eLearning
*Media & Data Journalism
*Publishing, Marketing & Advertising
*Tourism & Recreation
*Financial & Insurance Industry
*Telecommunication & Mobile Services
*Sustainable Development: Climate, Water, Air, Ecology
*Energy, Smart Homes & Smart Grids
*Food, Agriculture & Farming
*Safety, Security & Privacy
*Transport, Environment & Geospatial
Posters & Demos Track
The Posters & Demonstrations Track invites innovative work in progress,
late-breaking research and innovation results, and smaller contributions
in all fields related to the broadly understood Semantic Web. These
include submissions on innovative applications with impact on end users
such as demos of solutions that users may test or that are yet in the
conceptual phase, but are worth discussing, and also applications, use
cases or pieces of code that may attract developers and potential
research or business partners. This also concerns new data sets made
The informal setting of the Posters & Demonstrations Track encourages
participants to present innovations to the research community, business
users and find new partners or clients and engage in discussions about
the presented work. Such discussions can be invaluable inputs for the
future work of the presenters, while offering conference participants an
effective way to broaden their knowledge of the emerging research trends
and to network with other researchers.
Poster and demo submissions should consist of a paper that describe the
work, its contribution to the field or novelty aspects. Submissions must
be original and must not have been submitted for publication elsewhere.
Accepted papers will be published in HTML (RASH) in CEUR and, as such,
the camera-ready version of the papers will be required in HTML,
following the poster and demo guidelines (https://goo.gl/3BEpV7). Papers
should be submitted through EasyChair
(https://easychair.org/conferences/?conf=semantics2017 and should be
less than 2200 words in length (equivalent to 4 pages), including the
whole content of the paper.
For the initial reviewing phase, authors may submit a PDF version of the
paper following any layout. After acceptance, authors are required to
submit the camera-ready in HTML (RASH).
Submissions will be reviewed by experienced and knowledgeable
researchers and practitioners; each submission will receive detailed
feedback. For demos, we encourage authors to include links enabling the
reviewers to test the application or review the component.
For details please go to: https://2017.semantics.cc/calls
Thank you Leila, Stuart, Pine
We will follow up on these comments and pointers
A few additional words about this research -
Our narrow definition of formal expertise focuses on those with academic
qualifications who have published a scholarly work (i.e. appears in Google
Scholar) in the topic of the specific Wikipedia articles where one was
We acknowledge that many experts do not have academic qualifications.
The choice of "formal" (i.e. academic in this context) expertise enabled a
concrete operationalization and measurement.
We welcome any ideas for pinpointing informal experts.
We are currently in the first phase of research where we try to identify
these formal experts. We've spent considerable amount of time in
identifying 500 such experts, and now we use machine learning techniques to
automatically spot them (preliminary results are quite good).
Once this is done, we can start asking interesting questions, such as:
- What is the relative role of these formal experts to overall content
contributed to Wikipedia?
- Are formal experts' contributions "better"? (e.g. survive longer or
result in increased quality score (per ORES)
- Who are those formal experts? anonymous contributors? registered users?
do they take additional roles within the community?
- Formal experts' motivation
Any other ideas for taking this research forward are more than welcome.
Ofer, Einat and Alex
There are only two weeks left to benefit from the reduced registration
fee for the SEMANTICS2017 <https://2017.semantics.cc/> conference in
*To get the discount, please register
<https://2017.semantics.cc/prices>* *before July 1th, 2017*
Looking forward to meeting you at the conference!
Semantics Organizing Team
I've been working on taxonomy learning from Wikipedia categories in my
Here's a recap of the approach I proposed to address the pruning problem
you faced. It's a pipeline with a bottom-up direction, i.e., from the
leaves up to the root.
Stage 1: leaf nodes
INPUT = category + category links SQL dumps, like you do
1.1. extract the full set of article pages;
1.2. extract categories that are linked to article pages only, by
looking at the outgoing links for each article;
1.3. identify the set of categories with no sub-categories.
Stage 2: prominent nodes
INPUT = stage 1 output
2.1. traverse the leaf graph, see the algorithm ;
2.2. NLP to identify categories that hold is-a relations, i.e., *noun
phrases* with *plural head*, inspired by the YAGO approach [2, 3];
2.3. (optional) set a usage weight based on the number of category
interlanguage links (more links = more usage across language chapters).
These 2 stages should output the clean dataset you're looking for.
Based on that, you can then build the taxonomy.
Feel free to ping me if you need more information.
 Input: L (leaf nodes set) Output: PN (prominent nodes set)
for all l in L do
isProminent = true;
P = getTransitiveParents(l);
for all p in P do
C = getChildren(p);
areAllLeaves = true;
for all c in C do
if c not in L then
areAllLeaves = false;
if areAllLeaves then
isProminent = false;
if isProminent then
 F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a
core of semantic knowledge. In Proceedings of the 16th
International Conference on World Wide Web, pages
697–706. ACM, 2007.
 J. Hoffart, F. M. Suchanek, K. Berberich, and
G. Weikum. Yago2: a spatially and temporally
enhanced knowledge base from wikipedia. AI,
On 7/11/17 03:21, wiki-research-l-request(a)lists.wikimedia.org wrote:
> Date: Mon, 10 Jul 2017 18:20:47 -0700
> From: Leila Zia<leila(a)wikimedia.org>
> To: Research into Wikimedia content and communities
> Subject: [Wiki-research-l] category extraction question
> Content-Type: text/plain; charset="UTF-8"
> Hi all,
> [If you are not interested in discussions related to the category system
> (on English Wikipedia)
> , you can stop here. :)]
> We have run into a problem that some of you may have thought about or
> addressed before. We are trying to clean up the category system on English
> Wikipedia by turning the category structure to an IS-A hierarchy. (The
> output of this work can be useful for the research on template
> recommendation , for example, but the use-cases won't stop there). One
> issue that we are facing is the following:
> We are currently
> SQL dumps to extract categories associated with every article on English
> Wikipedia (main namespace). 
> Using this approach, we get 5 categories associated with Flow cytometry
> bioinformatics article :
> The problem is that only the first two categories are the ones we are
> interested in. We have one cleaning step through which we only keep
> categories that belong to category Article and that step removes the last
> category above, but the other two Wikipedia_... remain there. We need to
> somehow prune the data and clean it from those two categories.
> One way we could do the above would be to parse wikitext instead of the SQL
> dumps and focus on extracting categories marked by pattern [[Category:XX]],
> but in that case, we would lose a good category such as
> because that's generated by a template.
> Any ideas on how we can start with a "cleaner" dataset of categories
> related to the topic of the articles as opposed to maintenance related or
> other types of categories?
>  The exact code we use is
> SELECT p.page_id id, p.page_title title, cl.cl_to category
> FROM categorylinks cl
> JOIN page p
> on cl.cl_from = p.page_id
> where cl_type = 'page'
> and page_namespace = 0
> and page_is_redirect = 0
> and the edges of the category graph are extracted with
> *SELECT p.page_title category, cl.cl_to parent *
> *FROM categorylinks cl *
> *JOIN page p *
> *ON p.page_id = cl.cl_from *
> *where p.page_namespace = 14*
---------- Forwarded message ----------
From: Melody Kramer <mkramer(a)wikimedia.org>
Date: Mon, Jul 10, 2017 at 2:26 PM
Subject: [Wikimedia-l] [fellowship] Opportunity for people working on "open
projects that support a healthy Internet."
I wanted to pass along an opportunity that I saw earlier today via Twitter:
It sets up people working on "open projects that support a healthy
Internet" with a mentor, a cohort of like-minded people from all over the
world, and a trip to Mozfest, which is a London-based open Internet
conference I've attended/presented at in past years and found really
mind-expanding due to the cross-disciplinary conversations that take place.
You can see previous projects here: https://mozilla.github.
io/leadership-training/round-3/projects/ — it looks like there's quite a
broad cross-section and many of the projects across the movement might be
applicable. The post notes participants will learn about "best practices
for project setup and communication, tools for collaboration, community
building, and running events."
Thank you to Leila for suggesting I pass this along to this listserv. Feel
free to share it broadly.
Melody Kramer <https://www.mediawiki.org/wiki/User:MKramer_(WMF)>
Senior Audience Development Manager
Read a random featured article from Wikipedia!
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
New messages to: Wikimedia-l(a)lists.wikimedia.org
This is just a friendly reminder that we plan to turn off the RCStream
service after July 7th.
We’re tracking as best we can the progress of porting clients over at
https://phabricator.wikimedia.org/T156919. But, we can only help with what
we know about. If you’ve got something still running on RCStream that
hasn’t yet ported, let us know, and/or switch soon!
On Wed, Feb 8, 2017 at 9:28 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
> Hi everyone!
> Wikimedia is releasing a new service today: EventStreams
> <https://wikitech.wikimedia.org/wiki/EventStreams>. This service allows
> us to publish arbitrary streams of JSON event data to the public.
> Initially, the only stream available will be good ol’ RecentChanges
> <https://www.mediawiki.org/wiki/Manual:RCFeed>. This event stream
> overlaps functionality already provided by irc.wikimedia.org and RCStream
> <https://wikitech.wikimedia.org/wiki/RCStream>. However, this new
> service has advantages over these (now deprecated) services.
> We can expose more than just RecentChanges.
> Events are delivered over streaming HTTP (chunked transfer) instead of
> IRC or socket.io. This requires less client side code and fewer
> special routing cases on the server side.
> Streams can be resumed from the past. By using EventSource, a
> disconnected client will automatically resume the stream from where it left
> off, as long as it resumes within one week. In the future, we would like
> to allow users to specify historical timestamps from which they would like
> to begin consuming, if this proves safe and tractable.
> I did say deprecated! Okay okay, we may never be able to fully deprecate
> irc.wikimedia.org. It’s used by too many (probably sentient by now) bots
> out there. We do plan to obsolete RCStream, and to turn it off in a
> reasonable amount of time. The deadline iiiiiis July 7th, 2017. All
> services that rely on RCStream should migrate to the HTTP based
> EventStreams service by this date. We are committed to assisting you in
> this transition, so let us know how we can help.
> Unfortunately, unlike RCStream, EventStreams does not have server side
> event filtering (e.g. by wiki) quite yet. How and if this should be done
> is still under discussion <https://phabricator.wikimedia.org/T152731>.
> The RecentChanges data you are used to remains the same, and is available
> at https://stream.wikimedia.org/v2/stream/recentchange. However, we may
> have something different for you, if you find it useful. We have been
> internally producing new Mediawiki specific events
> for a while now, and could expose these via EventStreams as well.
> Take a look at these events, and tell us what you think. Would you find
> them useful? How would you like to subscribe to them? Individually as
> separate streams, or would you like to be able to compose multiple event
> types into a single stream via an API? These things are all possible.
> I asked for a lot of feedback in the above paragraphs. Let’s try and
> centralize this discussion over on the mediawiki.org EventStreams talk
> page <https://www.mediawiki.org/wiki/Talk:EventStreams>. In summary,
> the questions are:
> What RCStream clients do you maintain, and how can we help you migrate
> to EventStreams?
> Is server side filtering, by wiki or arbitrary event field, useful to
> you? <https://www.mediawiki.org/wiki/Topic:Tkjkabtyakpm967t>
> Would you like to consume streams other than RecentChanges?
> <https://www.mediawiki.org/wiki/Topic:Tkjk4ezxb4u01a61> (Currently
> available events are described here
> - Andrew Otto
I'm a Master student working under the supervision of Drs. Arazy and Minkov
My research explores the extent to which "recognized domain experts"
contribute to Wikipedia.
(I use a narrow definition for "recognized domain experts" to include those
with academic qualifications in the relevant topic).
I manually tracked these experts using a variety of sources, and then use
machine learning methods for automatically identifying domain experts
within Wikipedia editors.
I'm writing to explore whether this research is on interest to the
community and to learn if other people have already tackled this research
Thank you in advance for pointing me to relevant research projects
Forwarding this to more research people. In case anyone needs to do
research on moodbar, get in touch with us, those tables will be deleted
---------- Forwarded message ----------
From: Nuria Ruiz <nuria(a)wikimedia.org>
Date: Fri, Jul 7, 2017 at 4:08 PM
Subject: [Analytics] Dropping MoodBar extension tables from all wikis
To: "A mailing list for the Analytics Team at WMF and everybody who has an
interest in Wikipedia and analytics." <analytics(a)lists.wikimedia.org>
Cc: Manuel Arostegui <marostegui(a)wikimedia.org>
This is an FYI that ModBar extension has been undeployed and, as such, its
tables will be removed from all wikis. See https://phabricator.wikimedia.
It looks like this extension sprang some interest in the past  and there
were some research projects about it. Please let us know (before August
7th) whether we should keep the tables for any reason.
Analytics mailing list