I am interested in looking at the links between webpages on wikipedia
for scientific research. I have been to
http://en.wikipedia.org/wiki/Wikipedia:Database_download
which suggested that the latest pages-articles is likely the one
people want. However, I'm unclear on some things.
(1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different
files, and I can't actually tell if one of them would actually contain
only link information. Is there a description of what each file
contains?
(2) The enwiki-latest-pages-articles.xml file uncompresses as
31.55GB. Is it correct that this contains the current snapshot of all
pages and articles in wikipedia? (I only ask because this seems
small)
(3) If I am constrained to use latest-pages-articles.xml, I'm unclear
on the method used to denote a link. It would appear that links are
denoted by [[link]] or [[link | word]]. Such patterns would be fairly
easy to find using perl. However, I've noticed some odd cases, such
as
"[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the
first to formulate ...... in his
work".<refname="EB1910" />]]"
If I must search through the page-articles file, and if the [[ ]]
notation is overloaded, is there a description of the patterns that
are used in this file? I.e. a way for me to ensure that I'm only
grabbing links, not figure captions or some other content.
Thanks for your help!
As the subject says, the first mirror of our XML dumps is up, hosted at
C3Sl in BRazil. We're really excited about it. Details are listed on
the main index page on our download server
( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's
convenience:
Site: Centro de Computação Científica e Software Livre (C3SL), at the
Universidade Federal do Paraná in Brazil.
Contents: the 5 most current complete and successful dumps of each
project
Access: HTTP: http://wikipedia.c3sl.ufpr.br/
FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/
rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/
A big thank you to the folks there for providing the space and working
with us to make it happen.
Please forward this on to researchers or others who might want to know
about it but aren't on these lists.
Ariel Glenn
Software Developer / Systems Engineer
Wikimedia Foundation
ariel(a)wikimedia.org
On 23/10/11 23:19, Fred Zimmerman wrote:
> Good points! As the post indicates, I'm not very experienced with any
> of these tools and made a lot of dummy mistakes. Part of the point of
> the post is that even if you are pretty dumb, you can get this done if
> you persevere!
>
> The SQL import of the links would have been helpful; most of the
> instructions I found recommended using pages-articles.xml, which is a
> good place to start. There's no need for two copies of the big wiki.xml
> file, I would suggest just re-naming the original download to enwiki.xml
> right at the outset.
The file I mentioned to skip was enwiki.sql, which is not xml, but a
copy (in SQL queries) of the xml.
So, I figured I'd share a commercial tool that's turned freeware recently.
I've been analyzing my own writing to figure out what it conveys and it's
more accurate than I'd expect for free without a human being. I love data
mining and analysis software. I'm not selling this, I'm not affiliated with
them or anything and I'm not advertising it but if you're actually ripping
wikipedia or working with XML and large amounts of text, most likely you'd
be interested in this. I've been running it on Wikipedia pages and other
websites as well, too. I've generated very, very interesting and cool
results with it. Without further adieu and long-windedness,
Designed for Information Science, Market Research, Sociological Analysis and
Scientific studies, Tropes is a Natural Language Processing and Semantic
Classification software that guarantees *pertinence and quality in Text
Analysis*.
http://www.semantic-knowledge.com/http://www.semantic-knowledge.com/download.htm
There are French, Brazilian, Portugese and Spanish versions. It's definitely
not a waste of your time to test it out. I'm going to be use it often as I
write often for myself and as a freelancer. I'm also using it to proof and
edit a book I've been writing for the past month, too before I publish it in
a few weeks. It's too bad there's only a Windows binary version and no
source code available.
>From their landing page ( http://www.semantic-knowledge.com/ ):
Semantic-Knowledge <http://www.semantic-knowledge.com/company.htm> is a
leading provider of Natural Language Processing (NLP) software, including
Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text
Mining, Knowledge Discovery and Classification systems:
------------------------------
[image: Tropes Text Analysis] <http://www.semantic-knowledge.com/tropes.htm>
Tropes <http://www.semantic-knowledge.com/tropes.htm> - High Performance
Text Analysis
Designed for Semantic Classification, Keyword Extraction, Linguistic and
Qualitative Analysis, Tropes software is a perfect tool for Information
Science, Market Research, Sociological Analysis, Scientific and Medical
studies, and more..
------------------------------
[image: Zoom Semantic Search
Engine]<http://www.semantic-knowledge.com/zoom.htm>
Zoom <http://www.semantic-knowledge.com/zoom.htm> - Semantic Search Engine
With its fast Natural Language Information Retrieval system, Integrated Web
Spider, built-in Semantic Networks and on-the-fly Semantic classifications,
Zoom is a powerful Windows Search Engine designed for Document Management,
Competitive Intelligence, Press Analysis and Text Mining.
------------------------------
[image: Overtext Index Semantic
Indexing]<http://www.semantic-knowledge.com/indexing-components.htm>
Overtext
Index <http://www.semantic-knowledge.com/indexing-components.htm> - Semantic
Data Processing for Servers
Overtext Index is a classification system designed for large-scale Customer
Relationship Management (CRM), Natural Language Third Party Information
Retrieval, Knowledge Management, Business Intelligence and Strategic Watch
systems
------------------------------
This software is designed to help you to face a increasingly dense
information flood:
- accelerate your reading rate,
- analyze in-depth and objectively,
- extract relevant information,
- classify automatically, therefore structure information.
Because they offer considerable time savings and enhanced visibility of
strategic data, Tropes <http://www.semantic-knowledge.com/tropes.htm> and
Zoom <http://www.semantic-knowledge.com/zoom.htm> software yield an
exceptional Return On Investments (ROI). They generally show a profit as of
their first use, sometimes in a matter of hours! You don't believe us? *Try
the free version of Tropes Zoom in our download
area*<http://www.semantic-knowledge.com/download.htm>
.
Our software is based on powerful text analysis technology, using
dictionaries that contain hundreds of thousands of preset semantic
classifications, and reliable analysis techniques resulting from years of
scientific research.
Products available now for the English, French, Spanish, Portuguese and
Brazilian languages.
Enjoy!
Thomas C. Stowe
Email/GChat/MS Live Messenger: stowe.thomas(a)gmail.com
Texas Computer Services: http://www.txpcservices.com
Portfolio/VCard/Resume: http://www.thomasstowe.info
Blog: http://www.sc3ne.com
Survive2 SHTF/Disaster Prep/Homesteading Information:
http://www.survive2.com
Phone/SMS/VoiceMail: +1-210-704-7289
Skype: thomasstowe
I'm trying out relative links in the index.html files. For the next few
days expect that some jobs will complete with the old absolute urls and
I'll have to fix them by hand. New jobs should "just work". I've
converted most of the older index pages for the last five dumps from
each project. Let me know if there are any problems. I'm hoping that
nobody has scripts that rely on having the full url in there.
This change was needed for mirror sites to function properly. At some
point it will be helpful for https access as well.
Ariel
It doesn't really look like a mirror to me. The download links still point to download.wikimedia.org.
Regards,
Hydriz
http://simple.wikipedia.org/wiki/User:Hydriz
> From: ariel(a)wikimedia.org
> To: xmldatadumps-l(a)lists.wikimedia.org; wikitech-l(a)lists.wikimedia.org
> Date: Thu, 13 Oct 2011 21:20:12 +0300
> Subject: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil
>
> As the subject says, the first mirror of our XML dumps is up, hosted at
> C3Sl in BRazil. We're really excited about it. Details are listed on
> the main index page on our download server
> ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's
> convenience:
>
> Site: Centro de Computação Científica e Software Livre (C3SL), at the
> Universidade Federal do Paraná in Brazil.
> Contents: the 5 most current complete and successful dumps of each
> project
> Access: HTTP: http://wikipedia.c3sl.ufpr.br/
> FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/
> rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/
>
> A big thank you to the folks there for providing the space and working
> with us to make it happen.
>
> Please forward this on to researchers or others who might want to know
> about it but aren't on these lists.
>
> Ariel Glenn
> Software Developer / Systems Engineer
> Wikimedia Foundation
> ariel(a)wikimedia.org
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
I've started up the smaller wikis in 4 processes. I did a few manual
inspections but as always the eyes of the community are much sharper thn
mine, at the very least because there are more of them! So please get
your eyes on these and let me know if you see anything odd. I'll likely
start several larger wiki jobs in a little while and if I hear nothing
bad by the end of the day I'll fire off en wiki as well.
Happy 1.18,
Ariel
Hi,
I am hoping that someone here can help me. THere is a lot of conflicting
advice on the mirroring wikipedia page. I am setting up a local mirror of
english Wikipedia pages-articles...xml and I have been stopped by repeated
failures in mySQL configuration.
I have downloaded the .gz data from the dumped and run it through mwdumper
to create an SQL file without problems, but things keep breaking down on the
way into mysql. I have had a lot of agony with inno_db log files, etc. and
have learned how to make them bigger, but I'm still apaprently missing some
pieces.
my target machine is an AWS instance 32 bit Ubuntu 1.7GB RAM, 1 Core, 200
GB, which is only for this project. I can make it bigger if necessary.
Can someone take a look at this my.cnf file and tell me what I need to
change to get this to work?
this is what my.cnf file looks like:
[mysqladmin]
user=
[mysqld]
basedir=/opt/bitnami/mysql
datadir=/opt/bitnami/mysql/data
port=3306
socket=/opt/bitnami/mysql/tmp/mysql.sock
tmpdir=/opt/bitnami/mysql/tmp
character-set-server=UTF8
collation-server=utf8_general_ci
max_allowed_packet=128M
wait_timeout = 120
long_query_time = 1
log_slow_queries
log_queries_not_using_indexes
query_cache_limit=2M
query_cache_type=1
query_cache_size=128M
innodb_additional_mem_pool_size=8M
innodb_buffer_pool_size=256M
innodb_log_file_size=128M
#tmp_table_size=64M
#max_connections = 2500
#max_user_connections = 2500
innodb_flush_method=O_DIRECT
#key_buffer_size=64M
[mysqld_safe]
mysqld=mysqld.bin
[client]
default-character-set=UTF8
port=3306
socket=/opt/bitnami/mysql/tmp/mysql.sock
[manager]
port=3306
socket=/opt/bitnami/mysql/tmp/mysql.sock
pid-file=/opt/bitnami/mysql/tmp/manager.pid
default-mysqld-path=/opt/bitnami/mysql/bin/mysqld.bin