Xmldatadumps-l October 2011

xmldatadumps-l@lists.wikimedia.org

10 participants
9 discussions

by Greg Morrison

I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things. (1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as "[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]" If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content. Thanks for your help!

12 years, 5 months

first mirror of most recent XML dumps, at C3SL in Brazil

by Ariel T. Glenn

As the subject says, the first mirror of our XML dumps is up, hosted at C3Sl in BRazil. We're really excited about it. Details are listed on the main index page on our download server ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's convenience: Site: Centro de Computação Científica e Software Livre (C3SL), at the Universidade Federal do Paraná in Brazil. Contents: the 5 most current complete and successful dumps of each project Access: HTTP: http://wikipedia.c3sl.ufpr.br/ FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/ rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/ A big thank you to the folks there for providing the space and working with us to make it happen. Please forward this on to researchers or others who might want to know about it but aren't on these lists. Ariel Glenn Software Developer / Systems Engineer Wikimedia Foundation ariel(a)wikimedia.org

12 years, 6 months

Re: [Xmldatadumps-l] lessons learned from cloning Wikipedia and indexing with Solr

by Platonides

On 23/10/11 23:19, Fred Zimmerman wrote: > Good points! As the post indicates, I'm not very experienced with any > of these tools and made a lot of dummy mistakes. Part of the point of > the post is that even if you are pretty dumb, you can get this done if > you persevere! > > The SQL import of the links would have been helpful; most of the > instructions I found recommended using pages-articles.xml, which is a > good place to start. There's no need for two copies of the big wiki.xml > file, I would suggest just re-naming the original download to enwiki.xml > right at the outset. The file I mentioned to skip was enwiki.sql, which is not xml, but a copy (in SQL queries) of the xml.

12 years, 6 months

Off-topic, but super-cool free and unique software that's somewhat relevant (Language Analysis of Wikipedia)

by Thomas Stowe

So, I figured I'd share a commercial tool that's turned freeware recently. I've been analyzing my own writing to figure out what it conveys and it's more accurate than I'd expect for free without a human being. I love data mining and analysis software. I'm not selling this, I'm not affiliated with them or anything and I'm not advertising it but if you're actually ripping wikipedia or working with XML and large amounts of text, most likely you'd be interested in this. I've been running it on Wikipedia pages and other websites as well, too. I've generated very, very interesting and cool results with it. Without further adieu and long-windedness, Designed for Information Science, Market Research, Sociological Analysis and Scientific studies, Tropes is a Natural Language Processing and Semantic Classification software that guarantees *pertinence and quality in Text Analysis*. http://www.semantic-knowledge.com/ http://www.semantic-knowledge.com/download.htm There are French, Brazilian, Portugese and Spanish versions. It's definitely not a waste of your time to test it out. I'm going to be use it often as I write often for myself and as a freelancer. I'm also using it to proof and edit a book I've been writing for the past month, too before I publish it in a few weeks. It's too bad there's only a Windows binary version and no source code available. >From their landing page ( http://www.semantic-knowledge.com/ ): Semantic-Knowledge <http://www.semantic-knowledge.com/company.htm> is a leading provider of Natural Language Processing (NLP) software, including Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text Mining, Knowledge Discovery and Classification systems: ------------------------------ [image: Tropes Text Analysis] <http://www.semantic-knowledge.com/tropes.htm> Tropes <http://www.semantic-knowledge.com/tropes.htm> - High Performance Text Analysis Designed for Semantic Classification, Keyword Extraction, Linguistic and Qualitative Analysis, Tropes software is a perfect tool for Information Science, Market Research, Sociological Analysis, Scientific and Medical studies, and more.. ------------------------------ [image: Zoom Semantic Search Engine]<http://www.semantic-knowledge.com/zoom.htm> Zoom <http://www.semantic-knowledge.com/zoom.htm> - Semantic Search Engine With its fast Natural Language Information Retrieval system, Integrated Web Spider, built-in Semantic Networks and on-the-fly Semantic classifications, Zoom is a powerful Windows Search Engine designed for Document Management, Competitive Intelligence, Press Analysis and Text Mining. ------------------------------ [image: Overtext Index Semantic Indexing]<http://www.semantic-knowledge.com/indexing-components.htm> Overtext Index <http://www.semantic-knowledge.com/indexing-components.htm> - Semantic Data Processing for Servers Overtext Index is a classification system designed for large-scale Customer Relationship Management (CRM), Natural Language Third Party Information Retrieval, Knowledge Management, Business Intelligence and Strategic Watch systems ------------------------------ This software is designed to help you to face a increasingly dense information flood: - accelerate your reading rate, - analyze in-depth and objectively, - extract relevant information, - classify automatically, therefore structure information. Because they offer considerable time savings and enhanced visibility of strategic data, Tropes <http://www.semantic-knowledge.com/tropes.htm> and Zoom <http://www.semantic-knowledge.com/zoom.htm> software yield an exceptional Return On Investments (ROI). They generally show a profit as of their first use, sometimes in a matter of hours! You don't believe us? *Try the free version of Tropes Zoom in our download area*<http://www.semantic-knowledge.com/download.htm> . Our software is based on powerful text analysis technology, using dictionaries that contain hundreds of thousands of preset semantic classifications, and reliable analysis techniques resulting from years of scientific research. Products available now for the English, French, Spanish, Portuguese and Brazilian languages. Enjoy! Thomas C. Stowe Email/GChat/MS Live Messenger: stowe.thomas(a)gmail.com Texas Computer Services: http://www.txpcservices.com Portfolio/VCard/Resume: http://www.thomasstowe.info Blog: http://www.sc3ne.com Survive2 SHTF/Disaster Prep/Homesteading Information: http://www.survive2.com Phone/SMS/VoiceMail: +1-210-704-7289 Skype: thomasstowe

12 years, 6 months

lessons learned from cloning Wikipedia and indexing with Solr

by Fred Zimmerman

for future reference http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-… ----------------------------------------------------- Subscribe to the Nimble Books Mailing List http://eepurl.com/czS- for monthly updates

12 years, 6 months

change to index.html files: relative urls for files

by Ariel T. Glenn

I'm trying out relative links in the index.html files. For the next few days expect that some jobs will complete with the old absolute urls and I'll have to fix them by hand. New jobs should "just work". I've converted most of the older index pages for the last five dumps from each project. Let me know if there are any problems. I'm hoping that nobody has scripts that rely on having the full url in there. This change was needed for mirror sites to function properly. At some point it will be helpful for https access as well. Ariel

12 years, 6 months

Re: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil

by Hydriz Wikipedia

It doesn't really look like a mirror to me. The download links still point to download.wikimedia.org. Regards, Hydriz http://simple.wikipedia.org/wiki/User:Hydriz > From: ariel(a)wikimedia.org > To: xmldatadumps-l(a)lists.wikimedia.org; wikitech-l(a)lists.wikimedia.org > Date: Thu, 13 Oct 2011 21:20:12 +0300 > Subject: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil > > As the subject says, the first mirror of our XML dumps is up, hosted at > C3Sl in BRazil. We're really excited about it. Details are listed on > the main index page on our download server > ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's > convenience: > > Site: Centro de Computação Científica e Software Livre (C3SL), at the > Universidade Federal do Paraná in Brazil. > Contents: the 5 most current complete and successful dumps of each > project > Access: HTTP: http://wikipedia.c3sl.ufpr.br/ > FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/ > rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/ > > A big thank you to the folks there for providing the space and working > with us to make it happen. > > Please forward this on to researchers or others who might want to know > about it but aren't on these lists. > > Ariel Glenn > Software Developer / Systems Engineer > Wikimedia Foundation > ariel(a)wikimedia.org > > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

12 years, 6 months

post-1.18 dumps starting up

by Ariel T. Glenn

I've started up the smaller wikis in 4 processes. I did a few manual inspections but as always the eyes of the community are much sharper thn mine, at the very least because there are more of them! So please get your eyes on these and let me know if you see anything odd. I'll likely start several larger wiki jobs in a little while and if I hear nothing bad by the end of the day I'll fire off en wiki as well. Happy 1.18, Ariel

12 years, 6 months

help with my.cnf for local Wiki mirror

by Fred Zimmerman

Hi, I am hoping that someone here can help me. THere is a lot of conflicting advice on the mirroring wikipedia page. I am setting up a local mirror of english Wikipedia pages-articles...xml and I have been stopped by repeated failures in mySQL configuration. I have downloaded the .gz data from the dumped and run it through mwdumper to create an SQL file without problems, but things keep breaking down on the way into mysql. I have had a lot of agony with inno_db log files, etc. and have learned how to make them bigger, but I'm still apaprently missing some pieces. my target machine is an AWS instance 32 bit Ubuntu 1.7GB RAM, 1 Core, 200 GB, which is only for this project. I can make it bigger if necessary. Can someone take a look at this my.cnf file and tell me what I need to change to get this to work? this is what my.cnf file looks like: [mysqladmin] user= [mysqld] basedir=/opt/bitnami/mysql datadir=/opt/bitnami/mysql/data port=3306 socket=/opt/bitnami/mysql/tmp/mysql.sock tmpdir=/opt/bitnami/mysql/tmp character-set-server=UTF8 collation-server=utf8_general_ci max_allowed_packet=128M wait_timeout = 120 long_query_time = 1 log_slow_queries log_queries_not_using_indexes query_cache_limit=2M query_cache_type=1 query_cache_size=128M innodb_additional_mem_pool_size=8M innodb_buffer_pool_size=256M innodb_log_file_size=128M #tmp_table_size=64M #max_connections = 2500 #max_user_connections = 2500 innodb_flush_method=O_DIRECT #key_buffer_size=64M [mysqld_safe] mysqld=mysqld.bin [client] default-character-set=UTF8 port=3306 socket=/opt/bitnami/mysql/tmp/mysql.sock [manager] port=3306 socket=/opt/bitnami/mysql/tmp/mysql.sock pid-file=/opt/bitnami/mysql/tmp/manager.pid default-mysqld-path=/opt/bitnami/mysql/bin/mysqld.bin

12 years, 6 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l October 2011