Hi all!
We’ve recently made Spark 2.1 available in the Analytics Hadoop cluster.
It is installed on stat1004 and stat1005 alongside Spark 1.6. To use Spark
2, you should access it via the spark2* (and pyspark2) executables, rather
than the usual spark-shell, spark-submit, etc.
I’ve added a little bit of documentation
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark> about
this on wikitech.
We’d like to deploy Spark 2.2, but we first need to upgrade Hadoop to use
Java 8 rather than Java 7. Hopefully this will happen in early 2018.
analytics/refinery/source
<https://github.com/wikimedia/analytics-refinery-source> still uses Spark
1, but we’d also like to update jobs and dependencies there to use Spark 2
soon.
Anyway, let me know if there are any questions. Enjoy!
- Andrew Otto
Systems Engineer, WMF
Hello,
The slide deck from today's quarterly metrics presentation of the Wikimedia
Foundation's Readers team (which is an appendix to the main quarterly
check-in presentation) has been published. [1] [2]
This deck gives an overview of the core metrics regarding readership of
Wikimedia sites and including data about search, maps, and the Wikipedia
portal from the Discovery team.
[1]
https://commons.wikimedia.org/wiki/File%3AWikimedia_Foundation_Readers_metr…
[2]
https://commons.wikimedia.org/wiki/File%3AAudiences_2_check-in_Q1_October_2…
--
deb tankersley
Product Manager, Discovery
Wikimedia Foundation
Hi,
When using the wikipedia dump files, I am unable to find many categories
and pages in the dump.
E.g. under the Areas_of_computer_science category I get only 13
subcategories and 2 pages instead of 17 subcategories, 2 pages.
Furthermore, 1 page "Computational_creativity" is not present as a
subcategory.
I am using the following wikipedia dump files to extract the categorylinks,
and page details:
1.6G Sep 21 00:45 enwiki-20170920-page.sql.gz
21M Sep 21 00:45 enwiki-20170920-category.sql.gz
113M Sep 21 00:55 enwiki-20170920-redirect.sql.gz
2.2G Sep 21 03:10 enwiki-20170920-categorylinks.sql.gz
221M Sep 21 03:13 enwiki-20170920-page_props.sql.gz
I use https://github.com/napsternxg/WikiUtils to parse the sql.gz dump
files, but I also tried searching in the sql.gz files and couldn't find any
entry for 16300571 in the page.sql.gz and in category.sql.gz
files. 16300571 supposedly refers to the Computational_creativity page as
the following categories are linked to this page:
16300571 'All_NPOV_disputes' 'page'
16300571 'All_articles_needing_additional_references' 'page'
16300571 'All_articles_with_dead_external_links' 'page'
16300571 'All_articles_with_unsourced_statements' 'page'
16300571 'Areas_of_computer_science' 'page'
16300571 'Articles_needing_additional_references_from_May_2013' 'page'
16300571 'Articles_with_French-language_external_links' 'page'
16300571 'Articles_with_dead_external_links_from_November_2016' 'page'
16300571 'Articles_with_permanently_dead_external_links' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2015' 'page'
16300571 'Articles_with_unsourced_statements_from_April_2016' 'page'
16300571 'Articles_with_unsourced_statements_from_December_2015'
'page'
16300571 'Articles_with_unsourced_statements_from_January_2010' 'page'
16300571 'Articles_with_unsourced_statements_from_October_2016' 'page'
16300571 'Artificial_intelligence' 'page'
16300571 'Arts' 'page'
16300571 'CS1_maint:_Extra_text:_authors_list' 'page'
16300571 'Cognitive_psychology' 'page'
16300571 'Computational_fields_of_study' 'page'
16300571 'Creativity_techniques' 'page'
16300571 'NPOV_disputes_from_January_2013' 'page'
16300571 'Philosophical_movements' 'page'
16300571 'Webarchive_template_wayback_links' 'page'
16300571 'Wikipedia_articles_needing_clarification_from_November_2008'
'page'
More details can be found at:
https://twitter.com/TheShubhanshu/status/925736635572072449
Is there something, I am doing wrong, or are these rows just missing from
the dumps.
*Regards,*
*Shubhanshu Mishra*
Research Assistant,
iSchool at University of Illinois at Urbana-Champaign
--------------------------------------------------
*Website:* http://shubhanshu.com
*LinkedIn Profile: *http://www.linkedin.com/in/shubhanshumishra
Blog <http://shubhanshu.com/blog> || Facebook
<http://www.facebook.com/shubhanshu.mishra> || Twitter
<http://www.twitter.com/TheShubhanshu> || LinkedIn
<http://www.linkedin.com/in/shubhanshumishra>
As was previously announced on the xmldatadumps-l list, the sql/xml dumps
generated twice a month will be written to an internal server, starting
with the November run. This is in part to reduce load on the web/rsync/nfs
server which has been doing this work also until now. We want separation
of roles for some other reasons too.
Because I want to get this right, and there are a lot of moving parts, and
I don't want to rsync all the prefetch data over to these boxes again next
month after cancelling the move:
********
If needed, the November full run will be delayed for a few days.
If the November full run takes too long, the partial run, usually starting
on the 20th of the month, will not take place.
*********
Additionally, as described in an earlier email on the xmldatadumps-l list:
*********
files will show up on the web server/rsync server with a substantial
delay. Initially this may be a day or more. This includes index.html and
other status files.
*********
You can keep track of developments here:
https://phabricator.wikimedia.org/T178893
If you know folks not on the lists in the recipients field for this email,
please forward it to them and suggest that they subscribe to this list.
Thanks,
Ariel