Wiki-research-l August 2017

wiki-research-l@lists.wikimedia.org

33 participants
26 discussions

by song＠cs.umn.edu

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

9 months, 2 weeks

[Analytics] Beeline as Hive client

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 6 months

What percentage of digital assistants cite Wikipedia?

by Stella Yu

Curious, what percentage of digital assistants (Alexa, Siri, Cortana, Google) cite Wikipedia when a person asks a question? Does the current Wikipedia mobile app support voice search? Are there any reports on this? Thanks in advance! Sincere regards, Stella -- Stella Yu | STELLARESULTS | 415 690 7827 "Chronicling heritage brands and legendary people."

5 years, 10 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

stat1002 and stat1003 deprecated. Please use new stat boxes

by Andrew Otto

Hi all! tl;dr: Stop using stat100[23] by September 1st. We’re finally replacing stat1002 and stat1003. These boxes are out of warranty, and are running Ubuntu Trusty, while most of the production fleet is already on Debian Jessie or even Debian Stretch. stat1005 is the new stat1002 replacement. If you have access to stat1002, you also have access to stat1005. I’ve copied over home directories from stat1002. stat1006 is the new stat1003 replacement. If you have access to stat1003, you also have access to stat1006. I’ve copied over home directories from stat1003. I have not migrated any personal cron jobs running on stat1002 or stat1003. I need your help for this! Both of these boxes are running Debian Stretch. As such, packages that your work depends on may have upgraded. Please log into the new boxes and try stuff out! If you find anything that doesn’t work, please let me know by commenting on https://phabricator.wikimedia.org/T152712. Please be fully migrated to the new nodes by September 1st. This will give us enough time to fully decommission stat1002 and stat1003 by the end of this quarter. I’ve only done a single rsync of home directories. If there is new data on stat1002 or stat1003 that you want rsynced over, let me know on the ticket. A few notes: - stat1002 used to have /a. This has been removed in favor of /srv. /a no longer exists. - Home directories are now much larger. You no longer need to create personal directories in /srv. - /tmp is still small, so please be careful. If you are running long jobs that generate temporary data, please have those jobs write into your home directory, rather than /tmp. - We might implement user home directory quotas in the future. Thanks all! I’ll send another email in about a months time to remind you of the impending deadline of Sept 1. -Andrew Otto

6 years, 7 months

Fwd: Postdoctoral Fellowship at IU: Simulation of Information Diffusion in Online Social Networks

by Giovanni Luca Ciampaglia

---------- Forwarded message --------- From: Fil Menczer <filmenczer(a)gmail.com> Date: Wed, Aug 30, 2017 at 8:57 PM Subject: Postdoctoral Fellowship at IU: Simulation of Information Diffusion in Online Social Networks To: Combating Fake News: The Science of Misinformation < fakenewssci(a)googlegroups.com>, <pfgn(a)googlegroups.com> Please help spread the word about this position, which is highly relevant to the diffusion of fake news and misinformation: http://cnets.indiana.edu/blog/2017/08/30/socialsim-postdoc/ The Center for Complex Networks and Systems Research (cnets.indiana.edu) at Indiana University, Bloomington has an open postdoctoral position to study how information spreads through complex online social networks. The position funded by the DARPA program on Computational Simulation of Online Social Behavior (SocialSim). The anticipated start date for this position is January 1, 2018 (negotiable). This is an annual renewable appointment for up to 3 years subject to performance and funding. The postdoc will join a dynamic and interdisciplinary team that includes computer, physical, and cognitive scientists. The postdoc will work with PIs Filippo Menczer, Yong-Yeol Ahn and Alessandro Flammini, other postdocs, and several PhD students on analysis and modeling of social media data. Areas of focus will include empirical analysis of information diffusion patterns, agent-based models for the spread of information, and cognitive models of information processing. Go to osome.iuni.iu.edu for further details on the team and related current projects. The ideal candidate will have a PhD in computing, mathematical or physical sciences; a strong background in analysis and modeling of complex systems and networks; a strong interest in computational social science; and solid programming skills necessary to handle big data and develop large scale simulations. To apply, upload a letter of interest, a CV, a list of publications, and names and emails for three professional references using this online application link (http://indiana.peopleadmin.com/postings/4468). Questions may be sent to by email to Tara Holbrook (https://cnets.indiana.edu/contact/). For best consideration, apply by 8 October 2017. Indiana University is an equal employment and affirmative action employer and a provider of ADA services. All qualified applicants will receive consideration for employment without regard to age, ethnicity, color, race, religion, sex, sexual orientation or identity, national origin, disability status or protected veteran status. -Fil Filippo Menczer Professor of Informatics and Computer Science Center for Complex Networks and Systems Research Indiana University Network Science Institute http://cnets.indiana.edu/fil -- You received this message because you are subscribed to the Google Groups "Post Fact Global Network" group. To unsubscribe from this group and stop receiving emails from it, send an email to pfgn+unsubscribe(a)googlegroups.com. To post to this group, send email to pfgn(a)googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/pfgn/CAGu3Ug8E808x_S3CnRFU4JJDgEyHv1rS-Em… . For more options, visit https://groups.google.com/d/optout. -- Giovanni Luca Ciampaglia <http://glciampaglia.com/> *∙* Assistant Research Scientist, Indiana University SocInfo 2017 <http://socinfo2017.oii.ox.ac.uk/> *∙* Register NOW <http://socinfo2017.oii.ox.ac.uk/#registration> WWW 2018 <https://www2018.thewebconf.org/> *∙* Alternate track on Journalism, Misinformation, and Fact Checking <https://www2018.thewebconf.org/call-for-papers/misinformation-cfp/>

6 years, 7 months

feedback appreciated

by Caroline Sinders

hi all, i just started a column with fast co and wrote an article about elon musk's AI panic. https://www.fastcodesign.com/90137818/dear-elon-forget-killer-robots-heres-… would love some feedback :) best, caroline

6 years, 7 months

ground truth for section alignment across languages

by Leila Zia

Hi all, ==Question== Do you know of a dataset we can use as ground truth for aligning sections of one article in two languages? I'm thinking a tool such as Content Translation may capture this data somewhere, or there may be some other community initiative that has matched a subset of the sections between two versions of one article in two languages. Any insights/directions is appreciated. :) I'm not going to worry about what language pairs we do have this dataset in right now, the first question is: do we have anything? :) ==Context== As part of the research we are doing to build recommendation systems that can recommend sections (or templates) for already existing Wikipedia articles, we are looking at the problem of section alignment between languages, i.e., given two languages x and y and two version of article a in these two languages, can an algorithm (with relatively high accuracy) tell us which section in the article in language x correspond to which other section in the article in language y? Thanks, Leila -- Leila Zia Senior Research Scientist Wikimedia Foundation

6 years, 7 months

Re: [Wiki-research-l] ground truth for section alignment across languages

by Lucie-Aimée Kaffee

Hi Leila, >From the top of my head, I can think of this paper only I've read a while ago: https://eprints.soton.ac.uk/403386/1/tweb_gottschalk_demidova_multiwiki.pdf I assume what is to be considered is the (lack of) content overlap of articles in different languages in general as of, for example, http://dl.acm.org/citation.cfm?id=1753370 which also compares different language Wikipedias but more in the sense of completeness. Sounds like interesting work, looking forward to seeing what you come up with! All the best, Lucie On 30 August 2017 at 00:13, Leila Zia <leila(a)wikimedia.org> wrote: > Hi Scott, > > > On Mon, Aug 28, 2017 at 2:01 AM, Scott Hale <computermacgyver(a)gmail.com> > wrote: > > Dear Leila, > > > > ==Question== > >> Do you know of a dataset we can use as ground truth for aligning > >> sections of one article in two languages? > >> > > > > This question is super interesting to me. I am not aware of any ground > > truth data, but could imagine trying to build some from > > [[Template:Translated_page]]. At least on enwiki it has a "section" > > parameter that is to be set: > > nice! :) Thanks for sharing it. It is definitely worth looking into > it. I did some search across a few languages and the usage of it is > limited, in es, around 600, for example and once you start slice and > dicing it, the labels become too few. but still, we may be able to use > it now or in the future. > > >> ==Context== > >> As part of the research we are doing to build recommendation systems > >> that can recommend sections (or templates) for already existing > >> Wikipedia articles, we are looking at the problem of section alignment > >> between languages, i.e., given two languages x and y and two version > >> of article a in these two languages, can an algorithm (with relatively > >> high accuracy) tell us which section in the article in language x > >> correspond to which other section in the article in language y? > >> > > > > > > While I am not aware of research on Wikipedia section alignment per se, > > there is a good amount of work on sentence alignment and building > > parallel/bilingual corpora that seems relevant to to this [1-4]. I can > > imagine an approach that would look for near matches across two Wikipedia > > articles in different languages and then examine the distribution of > these > > sentences within sections to see if one or more sections looked to be > > omitted. One challenge is the sub-article problem [5], which of course > you > > are already familiar. I wonder whether computing the overlap in article > > links a la Omnipedia [6] and then examining the distribution of these > > between sections would work and be much less computationally intensive. I > > fear, however, that this could over identify sections further down an > > article as missing given (I believe) that article links are often > > concentrated towards the beginning of an article. > > exactly. > > a side note: we are trying to stay away, as much as possible, from > research/results that rely on NLP techniques as the introduction of > NLP will usually translate relatively quickly to limitations on what > languages our methodologies can scale to. > > Thanks, again! :) > > Leila > > > > > [1] Learning Joint Multilingual Sentence Representations with Neural > > Machine Translation. 2017 > > https://arxiv.org/abs/1704.04154 > > > > [2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002. > > https://www.microsoft.com/en-us/research/publication/fast- > and-accurate-sentence-alignment-of-bilingual-corpora/ > > > > [3] Large scale parallel document mining for machine translation. 2010. > > http://www.aclweb.org/anthology/C/C10/C10-1124.pdf > > > > [4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010. > > http://www.academia.edu/download/39073036/building_ > bilingual_parallel_corpora.pdf > > > > [5] Problematizing and Addressing the Article-as-Concept Assumption in > > Wikipedia. 2017 > > http://www.brenthecht.com/publications/cscw17_subarticles.pdf > > > > [6] Omnipedia: Bridging the Wikipedia Language Gap. 2012. > > http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf > > > > Best wishes, > > Scott > > > > -- > > Dr Scott Hale > > Senior Data Scientist > > Oxford Internet Institute, University of Oxford > > Turing Fellow, Alan Turing Institute > > http://www.scotthale.net/ > > scott.hale(a)oii.ox.ac.uk > > _______________________________________________ > > Wiki-research-l mailing list > > Wiki-research-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >

6 years, 7 months

English proofreading of my AICCSA paper

by abdelwaheb turki

Dear Mr. or Ms., I thank you for your support to my AICCSA paper. This is an honour of me. I have done quite all the required revisions. However, its English is still not proofread. I ask if you can verify and adjust the language quality of my AICCSA paper. It is currently available in https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu. As for the grant to attend AICCSA 2017, you can still endorse it in https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Csisc/Presenting_Wikid….<https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Csisc/Presenting_Wikid…> Yours Sincerely, Houcemeddine Turki [https://6swnyw.bn1304.livefilestore.com/y4mYIqWzSGilmMo-Bfxw251baLYJvCcGZlQ…]<https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu> [https://r1.res.office365.com/owa/prem/images/dc-docx_40.png]<https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu> AICCSA-Copie-_2_ 2.docx<https://1drv.ms/w/s!AiC69hcGxSVPl1UI1SV81mkr2uVu> Partagé via OneDrive ________________________________ De : ANLP2017 <anlp2017(a)easychair.org> Envoyé : mercredi 16 août 2017 00:53 À : Houcemeddine Turki Objet : ANLP2017 notification for paper 3 Dear Dr.Houcemeddine Turki Congratulations! On behalf of the ANLP 2017 workshop and Conference Committees of the 14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017, October 30th to November 3rd, 2017. we are happy to inform you that your paper entitled: Using WikiData to create a multi-lingual multi-dialectal dictionary for Arabic dialects has been accepted for presentation and inclusion in the Proceedings of AICCSA - ANLP 2017, published by IEEE. Please see the reviewers’ comments below on your paper. These comments are intended to help you to improve your paper for final publication. The listed comments should be addressed, as final acceptance is conditional upon appropriate response to the requirements and comments. The conference committee retains a list of certain critical comments to be addressed by authors, and will control that these have been addressed in the camera-ready version. What is next: ------------- The AICCSA website is updated now with required information. Please find below the details for the camera ready submission and the registration. Camera ready submission: Due date: 31/8/2017 Submission information can be found at the following link: http://www.aiccsa.net/AICCSA2017/submission Final Camera Ready and copyright instructions<http://www.aiccsa.net/AICCSA2017/submission> www.aiccsa.net 14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017 October 30th to November 3rd, 2017 Paper Registration: Due date: 8/9/2017 The registration information can be found at the following link: http://www.aiccsa.net/AICCSA2017/registration Registration - AICCSA<http://www.aiccsa.net/AICCSA2017/registration> www.aiccsa.net 14th ACS/IEEE International Conference on Computer Systems and Applications AICCSA 2017 October 30th to November 3rd, 2017 We are looking to meet you in AICCSA 2017. Best Regards, AICCSA - ANLP 2017 Organization Team. ============================== ----------------------- REVIEW 1 --------------------- PAPER: 3 TITLE: Using WikiData to create a multi-lingual multi-dialectal dictionary for Arabic dialects AUTHORS: Houcemeddine Turki, Denny Vrandečić, Helmi Hamdi and Imed Adel Overall evaluation: 1 (weak accept) ----------- Overall evaluation ----------- It is an interesting work with valid assumption and its proposed ideas also in line with the expectation of the event. Overall, I think, this paper is interesting and has good contribution to this topic. However, the authors are advised to have the following points on their revised version: -Please elaborate in details about the proposed approach with more focusing on the relations between its components, as they are the core of the solution and need more justification of why to use them. -Overall technical exposition must be strengthened with more concrete examples. -The authors are urged to summarize and list the key observations from the paper. -The paper is readable but a language improvement using a native speaker is recommended. -Some minor editorial issue, like enhancing the plots quality, the equations, etc... -Many reference are with *incomplete* bibliographic information (like lack of publication venue, for instance). This must be corrected. In summery, it a well prepared paper. ----------------------- REVIEW 2 --------------------- PAPER: 3 TITLE: Using WikiData to create a multi-lingual multi-dialectal dictionary for Arabic dialects AUTHORS: Houcemeddine Turki, Denny Vrandečić, Helmi Hamdi and Imed Adel Overall evaluation: 0 (borderline paper) ----------- Overall evaluation ----------- The paper technical content is very marginal, the paper has many language and editorial issue, a better results presentation and figs quality are needed. i guess the paper is not ready for publication yet. Other issues to be considered too: - The paper lacks clarity in motivating the proposed research and in stating its expected outcome - the review of the state of the art lacks an analysis of the existing work and the positioning of the research in the state of the art - The methodology is too general and does not convincingly show the feasibility of the proposed approach - the research lacks a concrete illustration on a case study. recommendation to the authors - to clearly state the objective of the research in terms of problems to address and expected results and show how the proposed research will advance the state of the art by overcoming the limitations of the existing work - to preset an analysis of the state of the art and discuss the benefits/limitations of the existing approaches with respect to the addressed research problem - to be more precise in the description of the methodology and show how the methodology would achieve the stated objectives - to discuss the future plans with respect to the research state of progress and its limitations There is serious issue with many editorials in the paper that need to be fixed, many of the figs are fuzzy and need to be reconsidered again.

6 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l August 2017