Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
Curious, what percentage of digital assistants (Alexa, Siri, Cortana,
Google) cite Wikipedia when a person asks a question?
Does the current Wikipedia mobile app support voice search?
Are there any reports on this? Thanks in advance!
Sincere regards,
Stella
--
Stella Yu | STELLARESULTS | 415 690 7827
"Chronicling heritage brands and legendary people."
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi all!
tl;dr: Stop using stat100[23] by September 1st.
We’re finally replacing stat1002 and stat1003. These boxes are out of
warranty, and are running Ubuntu Trusty, while most of the production fleet
is already on Debian Jessie or even Debian Stretch.
stat1005 is the new stat1002 replacement. If you have access to stat1002,
you also have access to stat1005. I’ve copied over home directories from
stat1002.
stat1006 is the new stat1003 replacement. If you have access to stat1003,
you also have access to stat1006. I’ve copied over home directories from
stat1003.
I have not migrated any personal cron jobs running on stat1002 or
stat1003. I need your help for this!
Both of these boxes are running Debian Stretch. As such, packages that
your work depends on may have upgraded. Please log into the new boxes and
try stuff out! If you find anything that doesn’t work, please let me know
by commenting on https://phabricator.wikimedia.org/T152712.
Please be fully migrated to the new nodes by September 1st. This will give
us enough time to fully decommission stat1002 and stat1003 by the end of
this quarter.
I’ve only done a single rsync of home directories. If there is new data on
stat1002 or stat1003 that you want rsynced over, let me know on the ticket.
A few notes:
- stat1002 used to have /a. This has been removed in favor of /srv. /a no
longer exists.
- Home directories are now much larger. You no longer need to create
personal directories in /srv.
- /tmp is still small, so please be careful. If you are running long jobs
that generate temporary data, please have those jobs write into your home
directory, rather than /tmp.
- We might implement user home directory quotas in the future.
Thanks all! I’ll send another email in about a months time to remind you
of the impending deadline of Sept 1.
-Andrew Otto
---------- Forwarded message ---------
From: Fil Menczer <filmenczer(a)gmail.com>
Date: Wed, Aug 30, 2017 at 8:57 PM
Subject: Postdoctoral Fellowship at IU: Simulation of Information Diffusion
in Online Social Networks
To: Combating Fake News: The Science of Misinformation <
fakenewssci(a)googlegroups.com>, <pfgn(a)googlegroups.com>
Please help spread the word about this position, which is highly
relevant to the diffusion of fake news and misinformation:
http://cnets.indiana.edu/blog/2017/08/30/socialsim-postdoc/
The Center for Complex Networks and Systems Research
(cnets.indiana.edu) at Indiana University, Bloomington has an open
postdoctoral position to study how information spreads through complex
online social networks. The position funded by the DARPA program on
Computational Simulation of Online Social Behavior (SocialSim). The
anticipated start date for this position is January 1, 2018
(negotiable). This is an annual renewable appointment for up to 3
years subject to performance and funding.
The postdoc will join a dynamic and interdisciplinary team that
includes computer, physical, and cognitive scientists. The postdoc
will work with PIs Filippo Menczer, Yong-Yeol Ahn and Alessandro
Flammini, other postdocs, and several PhD students on analysis and
modeling of social media data. Areas of focus will include empirical
analysis of information diffusion patterns, agent-based models for the
spread of information, and cognitive models of information processing.
Go to osome.iuni.iu.edu for further details on the team and related
current projects.
The ideal candidate will have a PhD in computing, mathematical or
physical sciences; a strong background in analysis and modeling of
complex systems and networks; a strong interest in computational
social science; and solid programming skills necessary to handle big
data and develop large scale simulations.
To apply, upload a letter of interest, a CV, a list of publications,
and names and emails for three professional references using this
online application link
(http://indiana.peopleadmin.com/postings/4468). Questions may be sent
to by email to Tara Holbrook (https://cnets.indiana.edu/contact/). For
best consideration, apply by 8 October 2017.
Indiana University is an equal employment and affirmative action
employer and a provider of ADA services. All qualified applicants will
receive consideration for employment without regard to age, ethnicity,
color, race, religion, sex, sexual orientation or identity, national
origin, disability status or protected veteran status.
-Fil
Filippo Menczer
Professor of Informatics and Computer Science
Center for Complex Networks and Systems Research
Indiana University Network Science Institute
http://cnets.indiana.edu/fil
--
You received this message because you are subscribed to the Google Groups
"Post Fact Global Network" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to pfgn+unsubscribe(a)googlegroups.com.
To post to this group, send email to pfgn(a)googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/pfgn/CAGu3Ug8E808x_S3CnRFU4JJDgEyHv1rS-Em…
.
For more options, visit https://groups.google.com/d/optout.
--
Giovanni Luca Ciampaglia <http://glciampaglia.com/> *∙* Assistant
Research Scientist, Indiana University
SocInfo 2017 <http://socinfo2017.oii.ox.ac.uk/> *∙* Register NOW
<http://socinfo2017.oii.ox.ac.uk/#registration>
WWW 2018 <https://www2018.thewebconf.org/> *∙* Alternate track on
Journalism,
Misinformation, and Fact Checking
<https://www2018.thewebconf.org/call-for-papers/misinformation-cfp/>
Hi all,
==Question==
Do you know of a dataset we can use as ground truth for aligning
sections of one article in two languages? I'm thinking a tool such as
Content Translation may capture this data somewhere, or there may be
some other community initiative that has matched a subset of the
sections between two versions of one article in two languages. Any
insights/directions is appreciated. :) I'm not going to worry about
what language pairs we do have this dataset in right now, the first
question is: do we have anything? :)
==Context==
As part of the research we are doing to build recommendation systems
that can recommend sections (or templates) for already existing
Wikipedia articles, we are looking at the problem of section alignment
between languages, i.e., given two languages x and y and two version
of article a in these two languages, can an algorithm (with relatively
high accuracy) tell us which section in the article in language x
correspond to which other section in the article in language y?
Thanks,
Leila
--
Leila Zia
Senior Research Scientist
Wikimedia Foundation
Hi Leila,
>From the top of my head, I can think of this paper only I've read a while
ago:
https://eprints.soton.ac.uk/403386/1/tweb_gottschalk_demidova_multiwiki.pdf
I assume what is to be considered is the (lack of) content overlap of
articles in different languages in general as of, for example,
http://dl.acm.org/citation.cfm?id=1753370 which also compares different
language Wikipedias but more in the sense of completeness.
Sounds like interesting work, looking forward to seeing what you come up
with!
All the best,
Lucie
On 30 August 2017 at 00:13, Leila Zia <leila(a)wikimedia.org> wrote:
> Hi Scott,
>
>
> On Mon, Aug 28, 2017 at 2:01 AM, Scott Hale <computermacgyver(a)gmail.com>
> wrote:
> > Dear Leila,
> >
> > ==Question==
> >> Do you know of a dataset we can use as ground truth for aligning
> >> sections of one article in two languages?
> >>
> >
> > This question is super interesting to me. I am not aware of any ground
> > truth data, but could imagine trying to build some from
> > [[Template:Translated_page]]. At least on enwiki it has a "section"
> > parameter that is to be set:
>
> nice! :) Thanks for sharing it. It is definitely worth looking into
> it. I did some search across a few languages and the usage of it is
> limited, in es, around 600, for example and once you start slice and
> dicing it, the labels become too few. but still, we may be able to use
> it now or in the future.
>
> >> ==Context==
> >> As part of the research we are doing to build recommendation systems
> >> that can recommend sections (or templates) for already existing
> >> Wikipedia articles, we are looking at the problem of section alignment
> >> between languages, i.e., given two languages x and y and two version
> >> of article a in these two languages, can an algorithm (with relatively
> >> high accuracy) tell us which section in the article in language x
> >> correspond to which other section in the article in language y?
> >>
> >
> >
> > While I am not aware of research on Wikipedia section alignment per se,
> > there is a good amount of work on sentence alignment and building
> > parallel/bilingual corpora that seems relevant to to this [1-4]. I can
> > imagine an approach that would look for near matches across two Wikipedia
> > articles in different languages and then examine the distribution of
> these
> > sentences within sections to see if one or more sections looked to be
> > omitted. One challenge is the sub-article problem [5], which of course
> you
> > are already familiar. I wonder whether computing the overlap in article
> > links a la Omnipedia [6] and then examining the distribution of these
> > between sections would work and be much less computationally intensive. I
> > fear, however, that this could over identify sections further down an
> > article as missing given (I believe) that article links are often
> > concentrated towards the beginning of an article.
>
> exactly.
>
> a side note: we are trying to stay away, as much as possible, from
> research/results that rely on NLP techniques as the introduction of
> NLP will usually translate relatively quickly to limitations on what
> languages our methodologies can scale to.
>
> Thanks, again! :)
>
> Leila
>
> >
> > [1] Learning Joint Multilingual Sentence Representations with Neural
> > Machine Translation. 2017
> > https://arxiv.org/abs/1704.04154
> >
> > [2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002.
> > https://www.microsoft.com/en-us/research/publication/fast-
> and-accurate-sentence-alignment-of-bilingual-corpora/
> >
> > [3] Large scale parallel document mining for machine translation. 2010.
> > http://www.aclweb.org/anthology/C/C10/C10-1124.pdf
> >
> > [4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010.
> > http://www.academia.edu/download/39073036/building_
> bilingual_parallel_corpora.pdf
> >
> > [5] Problematizing and Addressing the Article-as-Concept Assumption in
> > Wikipedia. 2017
> > http://www.brenthecht.com/publications/cscw17_subarticles.pdf
> >
> > [6] Omnipedia: Bridging the Wikipedia Language Gap. 2012.
> > http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf
> >
> > Best wishes,
> > Scott
> >
> > --
> > Dr Scott Hale
> > Senior Data Scientist
> > Oxford Internet Institute, University of Oxford
> > Turing Fellow, Alan Turing Institute
> > http://www.scotthale.net/
> > scott.hale(a)oii.ox.ac.uk
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
Dear Mr. or Ms.,
I thank you for your support to my AICCSA paper. This is an honour of me. I have done quite all the required revisions. However, its English is still not proofread. I ask if you can verify and adjust the language quality of my AICCSA paper. It is currently available in https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu. As for the grant to attend AICCSA 2017, you can still endorse it in https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Csisc/Presenting_Wikid….<https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Csisc/Presenting_Wikid…>
Yours Sincerely,
Houcemeddine Turki
[https://6swnyw.bn1304.livefilestore.com/y4mYIqWzSGilmMo-Bfxw251baLYJvCcGZlQ…]<https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu>
[https://r1.res.office365.com/owa/prem/images/dc-docx_40.png]<https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu>
AICCSA-Copie-_2_ 2.docx<https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu>
Partagé via OneDrive
________________________________
De : ANLP2017 <anlp2017(a)easychair.org>
Envoyé : mercredi 16 août 2017 00:53
À : Houcemeddine Turki
Objet : ANLP2017 notification for paper 3
Dear Dr.Houcemeddine Turki
Congratulations! On behalf of the ANLP 2017 workshop and Conference Committees of the 14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017, October 30th to November 3rd, 2017.
we are happy to inform you that your paper entitled:
Using WikiData to create a multi-lingual multi-dialectal dictionary for Arabic dialects
has been accepted for presentation and inclusion in the Proceedings of AICCSA - ANLP 2017, published by IEEE.
Please see the reviewers’ comments below on your paper. These comments are intended to help you to improve your paper for final publication. The listed comments should be addressed, as final acceptance is conditional upon appropriate response to the requirements and comments. The conference committee retains a list of certain critical comments to be addressed by authors, and will control that these have been addressed in the camera-ready version.
What is next:
-------------
The AICCSA website is updated now with required information. Please find below the details for the camera ready submission and the registration.
Camera ready submission:
Due date: 31/8/2017
Submission information can be found at the following link: http://www.aiccsa.net/AICCSA2017/submission
Final Camera Ready and copyright instructions<http://www.aiccsa.net/AICCSA2017/submission>
www.aiccsa.net
14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017 October 30th to November 3rd, 2017
Paper Registration:
Due date: 8/9/2017
The registration information can be found at the following link: http://www.aiccsa.net/AICCSA2017/registration
Registration - AICCSA<http://www.aiccsa.net/AICCSA2017/registration>
www.aiccsa.net
14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017 October 30th to November 3rd, 2017
We are looking to meet you in AICCSA 2017.
Best Regards,
AICCSA - ANLP 2017 Organization Team.
==============================
----------------------- REVIEW 1 ---------------------
PAPER: 3
TITLE: Using WikiData to create a multi-lingual multi-dialectal dictionary for Arabic dialects
AUTHORS: Houcemeddine Turki, Denny Vrandečić, Helmi Hamdi and Imed Adel
Overall evaluation: 1 (weak accept)
----------- Overall evaluation -----------
It is an interesting work with valid assumption and its proposed ideas also in line with the expectation of the event. Overall, I think, this paper is interesting and has good contribution to this topic. However, the authors are advised to have the following points on their revised version:
-Please elaborate in details about the proposed approach with more focusing on the relations between its components, as they are the core of the solution and need more justification of why to use them.
-Overall technical exposition must be strengthened with more concrete examples.
-The authors are urged to summarize and list the key observations from the paper.
-The paper is readable but a language improvement using a native speaker is recommended.
-Some minor editorial issue, like enhancing the plots quality, the equations, etc...
-Many reference are with *incomplete* bibliographic information (like lack of publication venue, for instance). This must be corrected.
In summery, it a well prepared paper.
----------------------- REVIEW 2 ---------------------
PAPER: 3
TITLE: Using WikiData to create a multi-lingual multi-dialectal dictionary for Arabic dialects
AUTHORS: Houcemeddine Turki, Denny Vrandečić, Helmi Hamdi and Imed Adel
Overall evaluation: 0 (borderline paper)
----------- Overall evaluation -----------
The paper technical content is very marginal, the paper has many language and editorial issue,
a better results presentation and figs quality are needed. i guess the paper is not ready for publication yet.
Other issues to be considered too:
- The paper lacks clarity in motivating the proposed research and in stating its expected outcome
- the review of the state of the art lacks an analysis of the existing work and the positioning of the research in the state of the art
- The methodology is too general and does not convincingly show the feasibility of the proposed approach
- the research lacks a concrete illustration on a case study.
recommendation to the authors
- to clearly state the objective of the research in terms of problems to address and expected results and show how the proposed research will advance the state of the art by
overcoming the limitations of the existing work
- to preset an analysis of the state of the art and discuss the benefits/limitations of the existing approaches with respect to the addressed research problem
- to be more precise in the description of the methodology and show how the methodology would achieve the stated objectives
- to discuss the future plans with respect to the research state of progress and its limitations
There is serious issue with many editorials in the paper that need to be fixed, many of the figs are fuzzy and need to be reconsidered again.