I am interested in looking at the links between webpages on wikipedia
for scientific research. I have been to
which suggested that the latest pages-articles is likely the one
people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information. Is there a description of what each file
(2) The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB. Is it correct that this contains the current snapshot of all
pages and articles in wikipedia? (I only ask because this seems
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link. It would appear that links are
denoted by [[link]] or [[link | word]]. Such patterns would be fairly
easy to find using perl. However, I've noticed some odd cases, such
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
first to formulate ...... in his
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file? I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
Thanks for your help!
As the subject says, the first mirror of our XML dumps is up, hosted at
C3Sl in BRazil. We're really excited about it. Details are listed on
the main index page on our download server
( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's
Site: Centro de Computação Científica e Software Livre (C3SL), at the
Universidade Federal do Paraná in Brazil.
Contents: the 5 most current complete and successful dumps of each
Access: HTTP: http://wikipedia.c3sl.ufpr.br/
A big thank you to the folks there for providing the space and working
with us to make it happen.
Please forward this on to researchers or others who might want to know
about it but aren't on these lists.
Software Developer / Systems Engineer
On 23/10/11 23:19, Fred Zimmerman wrote:
> Good points! As the post indicates, I'm not very experienced with any
> of these tools and made a lot of dummy mistakes. Part of the point of
> the post is that even if you are pretty dumb, you can get this done if
> you persevere!
> The SQL import of the links would have been helpful; most of the
> instructions I found recommended using pages-articles.xml, which is a
> good place to start. There's no need for two copies of the big wiki.xml
> file, I would suggest just re-naming the original download to enwiki.xml
> right at the outset.
The file I mentioned to skip was enwiki.sql, which is not xml, but a
copy (in SQL queries) of the xml.
So, I figured I'd share a commercial tool that's turned freeware recently.
I've been analyzing my own writing to figure out what it conveys and it's
more accurate than I'd expect for free without a human being. I love data
mining and analysis software. I'm not selling this, I'm not affiliated with
them or anything and I'm not advertising it but if you're actually ripping
wikipedia or working with XML and large amounts of text, most likely you'd
be interested in this. I've been running it on Wikipedia pages and other
websites as well, too. I've generated very, very interesting and cool
results with it. Without further adieu and long-windedness,
Designed for Information Science, Market Research, Sociological Analysis and
Scientific studies, Tropes is a Natural Language Processing and Semantic
Classification software that guarantees *pertinence and quality in Text
There are French, Brazilian, Portugese and Spanish versions. It's definitely
not a waste of your time to test it out. I'm going to be use it often as I
write often for myself and as a freelancer. I'm also using it to proof and
edit a book I've been writing for the past month, too before I publish it in
a few weeks. It's too bad there's only a Windows binary version and no
source code available.
>From their landing page ( http://www.semantic-knowledge.com/ ):
Semantic-Knowledge <http://www.semantic-knowledge.com/company.htm> is a
leading provider of Natural Language Processing (NLP) software, including
Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text
Mining, Knowledge Discovery and Classification systems:
[image: Tropes Text Analysis] <http://www.semantic-knowledge.com/tropes.htm>
Tropes <http://www.semantic-knowledge.com/tropes.htm> - High Performance
Designed for Semantic Classification, Keyword Extraction, Linguistic and
Qualitative Analysis, Tropes software is a perfect tool for Information
Science, Market Research, Sociological Analysis, Scientific and Medical
studies, and more..
[image: Zoom Semantic Search
Zoom <http://www.semantic-knowledge.com/zoom.htm> - Semantic Search Engine
With its fast Natural Language Information Retrieval system, Integrated Web
Spider, built-in Semantic Networks and on-the-fly Semantic classifications,
Zoom is a powerful Windows Search Engine designed for Document Management,
Competitive Intelligence, Press Analysis and Text Mining.
[image: Overtext Index Semantic
Index <http://www.semantic-knowledge.com/indexing-components.htm> - Semantic
Data Processing for Servers
Overtext Index is a classification system designed for large-scale Customer
Relationship Management (CRM), Natural Language Third Party Information
Retrieval, Knowledge Management, Business Intelligence and Strategic Watch
This software is designed to help you to face a increasingly dense
- accelerate your reading rate,
- analyze in-depth and objectively,
- extract relevant information,
- classify automatically, therefore structure information.
Because they offer considerable time savings and enhanced visibility of
strategic data, Tropes <http://www.semantic-knowledge.com/tropes.htm> and
Zoom <http://www.semantic-knowledge.com/zoom.htm> software yield an
exceptional Return On Investments (ROI). They generally show a profit as of
their first use, sometimes in a matter of hours! You don't believe us? *Try
the free version of Tropes Zoom in our download
Our software is based on powerful text analysis technology, using
dictionaries that contain hundreds of thousands of preset semantic
classifications, and reliable analysis techniques resulting from years of
Products available now for the English, French, Spanish, Portuguese and
Thomas C. Stowe
Email/GChat/MS Live Messenger: stowe.thomas(a)gmail.com
Texas Computer Services: http://www.txpcservices.com
Survive2 SHTF/Disaster Prep/Homesteading Information:
I'm trying out relative links in the index.html files. For the next few
days expect that some jobs will complete with the old absolute urls and
I'll have to fix them by hand. New jobs should "just work". I've
converted most of the older index pages for the last five dumps from
each project. Let me know if there are any problems. I'm hoping that
nobody has scripts that rely on having the full url in there.
This change was needed for mirror sites to function properly. At some
point it will be helpful for https access as well.
It doesn't really look like a mirror to me. The download links still point to download.wikimedia.org.
> From: ariel(a)wikimedia.org
> To: xmldatadumps-l(a)lists.wikimedia.org; wikitech-l(a)lists.wikimedia.org
> Date: Thu, 13 Oct 2011 21:20:12 +0300
> Subject: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil
> As the subject says, the first mirror of our XML dumps is up, hosted at
> C3Sl in BRazil. We're really excited about it. Details are listed on
> the main index page on our download server
> ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's
> Site: Centro de Computação Científica e Software Livre (C3SL), at the
> Universidade Federal do Paraná in Brazil.
> Contents: the 5 most current complete and successful dumps of each
> Access: HTTP: http://wikipedia.c3sl.ufpr.br/
> FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/
> rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/
> A big thank you to the folks there for providing the space and working
> with us to make it happen.
> Please forward this on to researchers or others who might want to know
> about it but aren't on these lists.
> Ariel Glenn
> Software Developer / Systems Engineer
> Wikimedia Foundation
> Xmldatadumps-l mailing list
I've started up the smaller wikis in 4 processes. I did a few manual
inspections but as always the eyes of the community are much sharper thn
mine, at the very least because there are more of them! So please get
your eyes on these and let me know if you see anything odd. I'll likely
start several larger wiki jobs in a little while and if I hear nothing
bad by the end of the day I'll fire off en wiki as well.
I am hoping that someone here can help me. THere is a lot of conflicting
advice on the mirroring wikipedia page. I am setting up a local mirror of
english Wikipedia pages-articles...xml and I have been stopped by repeated
failures in mySQL configuration.
I have downloaded the .gz data from the dumped and run it through mwdumper
to create an SQL file without problems, but things keep breaking down on the
way into mysql. I have had a lot of agony with inno_db log files, etc. and
have learned how to make them bigger, but I'm still apaprently missing some
my target machine is an AWS instance 32 bit Ubuntu 1.7GB RAM, 1 Core, 200
GB, which is only for this project. I can make it bigger if necessary.
Can someone take a look at this my.cnf file and tell me what I need to
change to get this to work?
this is what my.cnf file looks like:
wait_timeout = 120
long_query_time = 1
#max_connections = 2500
#max_user_connections = 2500