Analytics July 2013

analytics@lists.wikimedia.org

38 participants
38 discussions

Re: [Analytics] Visualizing Indic Wikipedia projects.
by sumandro 13 Mar '14

13 Mar '14

Erik, Thanks a lot for the appreciation. As Sajjad mentioned, we have already obtained a edit-per-location dataset from Evan (Rosen) that has the following column structure: *language,country,city,start,end,fraction,ts* *start* and *end* denote the beginning and ending date for counting the number of edits, and *ts* is time stamp. The *fraction*, however, gives a national ratio of edit activity, that is it gives the ratio of 'total edits from that city for that language Wikipedia project' divided 'total edits from that country for that language Wikipedia project'. Hence, it cannot be used to understand global edit contributions to a Wikipedia project (for a time period). It seems that the original data (from where this dataset is extracted) should also have the global fractions -- total edit from a city divided by total edit from the whole world, for a project, for a time period. Would you know if the global fractions can also be derived from the XML dumps? Or, even better, is the relevant raw data available in CSV form somewhere else? Bests, sumandro ------------- sumandro ajantriks.net On Wednesday 15 May 2013 12:32 AM, analytics-request(a)lists.wikimedia.org wrote: > Send Analytics mailing list submissions to > analytics(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/analytics > or, via email, send a message with subject or body 'help' to > analytics-request(a)lists.wikimedia.org > > You can reach the person managing the list at > analytics-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Analytics digest..." > > ---------------------------------------------------------------------- > > > Date: Tue, 14 May 2013 19:40:00 +0200 > From: "Erik Zachte" <ezachte(a)wikimedia.org> > To: "'A mailing list for the Analytics Team at WMF and everybody who > has an interest in Wikipedia and analytics.'" > <analytics(a)lists.wikimedia.org> > Subject: Re: [Analytics] Visualizing Indic Wikipedia projects. > Message-ID: <016f01ce50ca$0fe736b0$2fb5a410$(a)wikimedia.org> > Content-Type: text/plain; charset="iso-8859-1" > > Awesome work! I like the flexibility of the charts, easy to switch metrics > and presentation mode. > > > > 1. WMF has never captured ip->geo data on city level, but afaik this is > going to change with Kraken. > > > > 2. Total edits per article per year can be derived from the xml dumps. I may > have some csv data that come in handy. > > For edit wars you need track reverts on an per article basis, right? That > can also be derived from dumps. > > For long history you need full archive dumps and need to calc checksum per > revision text. (stub dumps have checksum but only for last year or two) > > > > Erik Zachte > > >

8 10

the use of the templates: comparison between different wikipedias
by Yury Katkov 11 Mar '14

11 Mar '14

Hi everyone! Has anyone tried to observer how different wikipedias use the templates: how often, what's the average depth of template calls, etc? ----- Yury Katkov, WikiVote

5 7

foundationwiki pageviews underreporting
by Federico Leva (Nemo) 31 Oct '13

31 Oct '13

Henrik updated the top view charts and few days ago foundationwiki was added to webstatscollector. http://stats.grok.se/www.f/top shows Most viewed articles in 201304 Rank Article Page views 1 Trang chủ 912 2 Portada galega 324 3 Home 182 4 Local chapters 172 etc. This seems highly unlikely, is the problem known? Nemo

3 3

Statistics on gadget & bot usage on all wikis
by Sumana Harihareswara 30 Sep '13

30 Sep '13

Summary: we have some new stats regarding gadget usage across WMF sites, but I'd like more analysis of gadget & bot usage. Oliver Keyes has some code and results up at https://github.com/Ironholds/MetaAnalysis/tree/master/GadgetUsage to analyze "data around gadgets being used on various wikimedia projects": "GadgetUsage.r is the generation script. It is dependent on (a) access to the analytics slaves and (b) the list of databases "gadget_data.tsv is the raw data, consisting of an aggregate number of users for each preference on each wiki, with preference, wiki and wiki type (source, wiki, versity, etc) defined. "gadgets_by_wikis.tsv is a rework of the data to look at what gadgets are used on multiple wikis, and how many wikis that is. It also includes an aggregate of the number of users across those wikis using the gadget. "wikis_by_gadgets.tsv is a rework that looks at the number of distinct gadgets on each individual wiki. Unsuprisingly there's a power law." This helps a lot with addressing one of the analytics "dreams" from https://www.mediawiki.org/wiki/Analytics/Dreams - "What proportion of logged-in editors have activated any gadgets at all? What are the most popular gadgets?" However, Oliver's data "is based on preference data - it may or may not include data for those gadgets set as defaults." So if someone could improve this to ensure that we appropriately count gadget usage for gadgets that default to on, that would be very helpful. My team would also like to know: * who maintains the most popular gadgets? (so we can invite them to hackathons, help get them training, get those gadgets localised and ported to other wikis, and so on) * when were the gadgets last updated? (so we can identify stale ones that enthusiastic volunteers could take over maintaining) * similar stats regarding bot usage -- what bots are making the most edits, or edits that in aggregate change the most bytes? who owns those bots? what wikis are they active on? (so we can help maintainers better, ensure they hear about API breaking changes, etc., and develop a bot inventory/directory to make it easier for other wikis' users to start using useful bots) If there's anyone interested in taking this on, either inside or outside WMF's Analytics team, that would be great. Otherwise I anticipate that Engineering Community Team will take it on sometime in the October-December 2013 period. -- Sumana Harihareswara Engineering Community Manager Wikimedia Foundation

6 7

Re: [Analytics] [wmfresearch] stat1 -> stat1002 migration of private data
by Andrew Otto 21 Aug '13

21 Aug '13

Hello again! Ok, we're actually going to do this this time. As far as we know, people who need access to private webrequest data have migrated their stuff over to stat1002.eqiad.wmnet. The private webrequest data that currently exists on stat1 will soon be deleted. Soon is August 7th. That's in 1 week. We announced this back in May, so there should have been plenty of notice. If you are still using the webrequest logs in /a/squid/archive on stat1, find me on IRC (ottomata) or email me and we can work together to make sure you can continue to do your work on stat1002. On Wednesday August 7th, we will be removing private webrequest logs from stat1. Thanks all! -Andrew Otto On May 20, 2013, at 2:13 PM, Andrew Otto <otto(a)wikimedia.org> wrote: > >> "Before that happens, you should make sure that any personal stuff on stat1 that you need for number crunching is copied over to stat1002. " > > from your note it looks like this is only related to webrequest data, is that correct? > > Yup! That is correct. stat1002 will be primarily used as a sensitive private data host. Only those users that have personal unpuppetized code and cronjobs that use this data need to worry about moving them from stat1 to stat1002. > > > > > what are the criteria for deciding who has access to stat1002? I see that contractors like Aaron Halfaker or Jonathan Morgan currently don't have access to it. > > The criteria will be the same as before: RT request + manager approval. However, the request should only be made if the user actually needs access to the webrequest logs to do analysis. For example, if the main reason someone already has access to stat1 is so that they can access the research slave databases, then they won't need access to stat1002. > > > > > can you give us more information on the long-term plans/scope of stat1 vs stat1002 (and update https://office.wikimedia.org/wiki/Data_access as needed)? > > I've added a small bit about stat1002 on that page. > > I don't know much about a long term plan for stat1. It is hosted at the Tampa datacenter, and in the long term (yearish?) all the machines there will have be be decommissioned or relocated elsewhere. When it finally does move, it will most likely no longer have a public IP. stat1 is intended to be used as a workspace for analysts to do their thing on non-private data. > > > -Ao >

3 5

zero_carrier vs. zero_country
by Christian Aistleitner 19 Aug '13

19 Aug '13

Hi, when doing some basic sanity checks between the output of the existing zero_country and zero_carrier Pig scripts, it seems that the sum of the number of requests of the output of zero_country per day is ~40k larger than for zero_carrier. First, I've been told that the sum of the number of requests has to match. Afterwards, I've been told that this is ok, as zero_country should hold all of the mobile requests from a country, and zero_carrier is a drill-down on the specific carriers. When reading the Pig scripts/Java code, it is obvious that the first explanation does not meet the code. The scripts take completely different paths through our code base and count completely different things :-( However, the latter explanation does not make much sense to me either, as it's hard to believe that the requests from our zero partners make up >90% of each countries mobile requests. Besides, this explanation would not meet how we generate the raw log files. Whom could I ask about what the desired semantics of zero_{carrier,country} are? Best regards, Christian -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

4 11

Chatting about the Wikimedia tech community metrics in Hong Kong
by Jesus M. Gonzalez-Barahona 05 Aug '13

05 Aug '13

Hi all, I'm one of the persons working in the Wikimedia tech community metrics [1] that Quim announced some days ago. I'm attending WikiSym+OpenSym, in Hong Kong, from Aug 4 to Aug 6, just the days before Wikimania. If any of you happen to be there, and want to chat about software development metrics, how they are being implemented for Wikimedia, or just have a coffee while talking about counting commits and estimating time-to-fix, please drop me a line. Saludos, Jesus. [1] http://korma.wmflabs.org/ (still pre-beta) -- -- Bitergia: http://bitergia.com http://blog.bitergia.com

2 2

Updates in Community Metrics: Databases for analytics
by Alvaro del Castillo 31 Jul '13

31 Jul '13

Hi guys, All SQL databases used for data analytics are published each day updated in the Community Metrics portal: http://korma.wmflabs.org/browser/data/db/ You can use this databases for doing your own analytics with the data. Do you need other formats than SQL to be more confortable for using the data? Cheers -- |\_____/| Alvaro del Castillo [o] [o] acs(a)bitergia.com - CTO, Software Engineer | V | http://www.bitergia.com | | -ooo-ooo-

1 0

Fwd: [PRESS] How can find out where stocks go? Search them Wikipedia (Hebrew)
by Itzik Edri 30 Jul '13

30 Jul '13

cross-posting. may be relevant also here. ---------- Forwarded message ---------- From: Itzik Edri <itzik(a)infra.co.il> Date: 2013/7/29 Subject: [PRESS] How can find out where stocks go? Search them Wikipedia (Hebrew) To: Communications Committee <wmfcc-l(a)lists.wikimedia.org> Translation of the article summery: Physicist Dr. Dror Kenneth and his colleagues have applied for capital markets research methods of physicists and mathematicians - and revealed surprising findings, among other things, they showed that interest financial article on Wikipedia may indicate the direction in which the stocks market goes http://www.themarker.com/markets/marketmoney/1.2082609 *מחקר // איך אפשר לגלות לאן הולכים שוקי המניות? חפשו בוויקיפדיה* הפיסיקאי ד"ר דרור קנת ושותפיו למחקר יישמו על שוקי ההון שיטות מחקר של פיסיקאים ומתמטיקאים - וגילו ממצאים מפתיעים; בין השאר הם גילו שעניין רב במושגים פיננסיים בוויקיפדיה יכול להצביע על הכיוון שאליו נע השוק 09:55 28.07.2013 מאת: אפרת נוימן לאן הולכים השווקים הפיננסיים? האם תימשך מגמת העליות בוול סטריט? האם הגיע הזמן להסיט כספים לאירופה? מה צפוי בשווקים המתפתחים? איך כדאי לחלק את תיק ההשקעות הגלובלי? מי שיידע לענות על השאלות האלה נכונה ולהשקיע את כספו בהתאם, יוכל להתעשר חיש קל. אלא שלאיש אין תשובות ודאיות. לכן יושבים בכל יום גדודי משקיעים ואנליסטים ודוגרים על ניתוחים ותחזיות כלכליות. את כל המידע שהם מפיקים הם משקללים אל מול העדפת הסיכון האישית שלהם - וכך בונים את תיק ההשקעות שלהם או של לקוחותיהם. אמנם אין תשובה חד־משמעית לשאלות האלה, אך יש אנשים שעדיין מנסים לחפש אחת כזאת - ולא כולם כלכלנים. בימים אלה עובדת קבוצה של פיסיקאים באוניברסיטה של בוסטון על מחקרים שיכולים לשפוך אור נוסף על השאלה לאן הולכים השווקים. הקבוצה, בהובלתו של פרופ' יוג'ין סטנלי, בודקת מהו הקשר בין המדיה החברתית לאירועים כלכליים. אחד החברים בקבוצה, הישראלי ד"ר דרור קנת, פירסם במאי מאמר במגזין Nature עם עמיתים מהקבוצה. המאמר דן במחקר שבדק אם צפייה בעמודי ויקיפדיה שעוסקים במונחים פיננסיים יכולה לחזות שינויים במדד דאו ג'ונס. "ויקיפדיה מאפשרת גישה למספר הצפיות לפי מושגים", מספר קנת. "בדקנו את הצפייה במושגים ברזולוציה שבועית וברזולוציה יומית והראינו שיש מושגים מסוימים - למשל חוב, משבר או בועה - שיכולים לתת ניבוי טוב של שינוי במדד". במקביל נעשתה עבודה דומה, של חבר אחר בקבוצה, על גוגל שפורסמה באפריל, ושגם בה היה קנת מעורב. "הגענו לאותה המסקנה גם בעבודה על גוגל, ועכשיו נעשית עבודה על טוויטר, שבוחנת אם ציוצים בנושאים פיננסיים יכולים לתת ניבוי של התנהגות השווקים" מה לפיסיקאי כמו קנת ולמחקר בעניינים פיננסיים? התחום הזה, של חיבור בין כלכלה לפיסיקה, נקרא "אקונופיסיקה", וקנת מספר כי הוא מקווה לקדם אותו בישראל. "כמו בפיסיקה, גם במחקר עם אוריינטציה פיננסית המטרה היא לחפש חוקים כלליים ופשוטים שנמצאים בתוך המערכת המורכבת", הוא מסביר. "פיסיקאים מביאים זווית ראייה ושיטות חדשות לעולם הפיננסי. הם מתעניינים במערכות גדולות ובאינטראקציה שבין הרכיבים שכלולים במערכות האלה. במקרה של שווקים פיננסיים, מדובר גם במערכת גדולה - שהרכיבים שלה הם אנשים". "לפתח כלים ליצירת יציבות" קנת החל לפעול בתחום האקונופיסיקה כחלק מהתואר השני שלו בפיסיקה, והמשיך לעסוק בתחום זה בעבודת הדוקטורט תחת הנחייתו של פרופסור אשל בן יעקב, במעבדה למערכות ביולוגיות מורכבות שבבית הספר לפיסיקה ואסטרונומיה של אוניברסיטת תל אביב. לפני כשנה פירסם קנת מאמר בעיתון המדעי PLoS ONE, בשיתוף עם פרופ' בן יעקב וחוקרים גרמנים, בעקבות מחקר שבחן את ההשפעה והתלות בין שוקי הון במדינות שונות. החוקרים בחנו נתונים של 1,124 מניות בין שנת 2000 ל–2010 בשישה שוקי הון - בבריטניה, בגרמניה, בארה"ב, בסין, ביפן ובהודו. "הרעיון בפרויקט הזה היה לבדוק ולהבין איך השווקים משפיעים זה על זה. המתודולוגיה השתמשה בבדיקת קורלציות ‏(מתאמים‏). מחקרים קודמים הראו שמתאם גבוה בין מניות מתקיים בדרך כלל בעת ירידות חדות בבורסה - ככל שהירידות מתחזקות המתאם מתחזק, מכיוון שכולם בורחים יחד. אנחנו בדקנו אם אירועים שמתואמים, כמו עלייה במניות בשוק אחד, זולגים גם לשוק אחר וגורמים בו לעלייה. מטרת המחקר היתה לבחון זליגה של משברים משוק אחד לאחר. "מבחינת המשקיעים, זה נותן אינדיקציה למה שעלול לקרות. אם קורה משהו ביוון, איך זה יכול להשפיע על השוק שאני נמצא בו? זה חשוב גם מהצד של הרגולטור. נכון שאפשר להשתמש בזה גם למטרות השקעה, אבל המטרה העיקרית שלנו היתה לפתח כלי ליצירת יציבות פיננסית - ולשם אנחנו שואפים במחקר ההמשך". האנליזות במחקר היו מורכבות. החוקרים הסתכלו בכל שוק על כ–200 מניות, חישבו את הקורלציות בין המניות ואת האופן שבו הן משתנות על פני זמן. "כבר בשלב הזה רואים דברים מעניינים - למשל שיש תקופות שבהן הקורלציות מתחזקות, יש התפרצויות של קורלציות. בדקנו אם יש דמיון בין השינויים בקורלציות בשוק א' לעומת שוק ב', האם כשהקורלציה בשוק א' עולה, זה מנבא גם שהקורלציה בשוק ב' הולכת לעלות או לרדת. "באופן כללי אפשר לומר שהמחקר מצא שיש השפעה בין שווקים, ועכשיו אנחנו מרחיבים את המחקר לשווקים נוספים ולרזולוציות זמן קצרות יותר - מה שיאפשר מעקב אחרי זרימת מידע כלכלי והשפעה כלכלית בין שוקי העולם", מסביר קנת. "הרעיון הוא לתאר כיצד אינטרקציה בין מרכיבי המערכת, למשל בין מניות, מביאה לשינוי בהתנהגות הכללית ‏(כלומר של כלל השוק המקומי‏), שגוררת התנהגות של הפרטים בשוק אחר. במלים אחרות, הכוונה היא לבחון כיצד הפרט משפיע על הכלל, שמשפיע בהמשך על הפרט, ולבחון את המשוב ביניהם. "פיתחנו מתודולוגיה שמתארת את זה. המסקנות היו שאפשר להשתמש במתודולוגיה הזאת כדי לזהות קשרים בין שווקים דרך הקורלציות, ויותר מכך - לבחון כיצד שוק אחד ישפיע על שוק אחר. למשל, ראינו בתבניות שיפן יותר דומה לשווקים מערביים כמו ארה"ב - ופחות לסין והודו. ראינו עדויות לכך שיש זמנים שבהם אפשר לזהות קורלציות בין מדינות באותו כיוון. מעבר לכך, תוצאות המחקר הראשוני מראות כי ניתן לזהות זמנים שבהם קורלציות בשוק אחד ינבאו קורלציות בשוק אחר, וזהו מידע משמעותי ביותר". מניה ומדד - מי מושך את מי? בעבודה חדשה יותר שפורסמה באחרונה במגזין Nature נבדק הקשר בין המדד למניות שמרכיבות אותו. העבודה נעשתה על ידי קנת כחלק מתוכנית מלגות מחקר של רשות ניירות ערך, בהנחייתה של ד"ר גתית גור־גרשגורן, הכלכלנית הראשית וראש מחלקת מחקר כלכלי ברשות, ובשיתוף פרופ' בן יעקב ופרופ' סטנלי מאוניברסיטת בוסטון. קנת מסביר שהמוטיבציה למחקר היא עבודות שנעשו בעבר, והראו כי ברזולוציות יומיות המדד משפיע ומושך במובן מסוים את המניות שמרכיבות אותו. המניות הן אלה שנוטות לעקוב אחרי השינויים במדד - ולא להפך. על פניו, זה יכול להישמע מוזר. המדד מורכב ממניות, ולכן כשהמניות עולות, המדד עולה. אבל מכיוון שהמדד נסחר בפני עצמו וכולם רואים אותו, עוקבים אחריו ומשקיעים במוצרים שעוקבים אחריו - זה גורר תגובה של משקיעים. ברזולוציה של היום הבודד, המדד הוא זה שמושך את המניות. שאלת המחקר של קנת ושותפיו היתה אם זה גם תקף ברזולוציות זמן קצרות - למשל של שניות בודדות. באותה תקופה שבה נבדקו הנתונים, 2006–2010, הערך של מדד ת"א 25 התפרסם בכל 30 שניות. החוקרים בדקו כל 15 שניות את המחירים של המניות שהרכיבו את המדד ולפי המשקל שלהן בנו מדד מדומה, סינתטי. את המדד הסינתטי שעודכן כל 15 שניות הם השוו לערך של ת"א 25, שהתפרסם כל 30 שניות. המטרה היתה לראות מי מושך את מי, והמסקנה היתה שהמניות מובילות את המדד - בניגוד למה שרואים ברזולוציה יומית. בטווח הזמן שבין פרסומי המדד יש לסוחרים חלון שניתן לנצל אותו - מעין ארביטראז'. "המחקר עצמו הוא תצפיתי, והוא מראה שלקבוצת הסוחרים הפועלים בתדירות גבוהה יש נגישות למידע לפני כלל ציבור הסוחרים. כשהם יודעים שהמניות מובילות את המדד, באותו פרק זמן עד לפרסום הבא הם יכולים לבצע פעולות מסחר שיניבו להם רווחים", מסביר קנת. "קצת אחרי שסיימנו את המחקר, ב–2010, הבורסה שינתה את טווחי פרסום המדד ל–15 שניות. בבורסה של ניו יורק, למשל, הטווח הוא בין שנייה ל–15 שניות, כדי למנוע מצבים שבהם משקיעים מתוחכמים יודעים לנצל את המצב".

3 2

Which list of zero carriers is current?
by Christian Aistleitner 30 Jul '13

30 Jul '13

Hello, during discussion of zero, I've been told that [1] should hold the current information about zero carriers. However, that list does not match what we use in Kraken [2]. Does an authorative, up-to-date, known-good list exist somewhere? If not, whom could I ask about mismatches? Best regards, Christian [1] https://wikimediafoundation.org/wiki/Mobile_partnerships#Where_is_Wikipedia… [2] For example Kraken lists that only “en” and “ru” are free for Montenegro's Telenor, while [1] claims all languages are free. -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Gruendbergstrasze 65a Email: christian(a)quelltextlich.at 4040 Linz, Austria Phone: +43 732 / 26 95 63 Fax: +43 732 / 26 95 63 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

2 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics July 2013