On 23/10/11 23:19, Fred Zimmerman wrote:
> Good points! As the post indicates, I'm not very experienced with any
> of these tools and made a lot of dummy mistakes. Part of the point of
> the post is that even if you are pretty dumb, you can get this done if
> you persevere!
>
> The SQL import of the links would have been helpful; most of the
> instructions I found recommended using pages-articles.xml, which is a
> good place to start. There's no need for two copies of the big wiki.xml
> file, I would suggest just re-naming the original download to enwiki.xml
> right at the outset.
The file I mentioned to skip was enwiki.sql, which is not xml, but a
copy (in SQL queries) of the xml.
So, I figured I'd share a commercial tool that's turned freeware recently.
I've been analyzing my own writing to figure out what it conveys and it's
more accurate than I'd expect for free without a human being. I love data
mining and analysis software. I'm not selling this, I'm not affiliated with
them or anything and I'm not advertising it but if you're actually ripping
wikipedia or working with XML and large amounts of text, most likely you'd
be interested in this. I've been running it on Wikipedia pages and other
websites as well, too. I've generated very, very interesting and cool
results with it. Without further adieu and long-windedness,
Designed for Information Science, Market Research, Sociological Analysis and
Scientific studies, Tropes is a Natural Language Processing and Semantic
Classification software that guarantees *pertinence and quality in Text
Analysis*.
http://www.semantic-knowledge.com/http://www.semantic-knowledge.com/download.htm
There are French, Brazilian, Portugese and Spanish versions. It's definitely
not a waste of your time to test it out. I'm going to be use it often as I
write often for myself and as a freelancer. I'm also using it to proof and
edit a book I've been writing for the past month, too before I publish it in
a few weeks. It's too bad there's only a Windows binary version and no
source code available.
>From their landing page ( http://www.semantic-knowledge.com/ ):
Semantic-Knowledge <http://www.semantic-knowledge.com/company.htm> is a
leading provider of Natural Language Processing (NLP) software, including
Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text
Mining, Knowledge Discovery and Classification systems:
------------------------------
[image: Tropes Text Analysis] <http://www.semantic-knowledge.com/tropes.htm>
Tropes <http://www.semantic-knowledge.com/tropes.htm> - High Performance
Text Analysis
Designed for Semantic Classification, Keyword Extraction, Linguistic and
Qualitative Analysis, Tropes software is a perfect tool for Information
Science, Market Research, Sociological Analysis, Scientific and Medical
studies, and more..
------------------------------
[image: Zoom Semantic Search
Engine]<http://www.semantic-knowledge.com/zoom.htm>
Zoom <http://www.semantic-knowledge.com/zoom.htm> - Semantic Search Engine
With its fast Natural Language Information Retrieval system, Integrated Web
Spider, built-in Semantic Networks and on-the-fly Semantic classifications,
Zoom is a powerful Windows Search Engine designed for Document Management,
Competitive Intelligence, Press Analysis and Text Mining.
------------------------------
[image: Overtext Index Semantic
Indexing]<http://www.semantic-knowledge.com/indexing-components.htm>
Overtext
Index <http://www.semantic-knowledge.com/indexing-components.htm> - Semantic
Data Processing for Servers
Overtext Index is a classification system designed for large-scale Customer
Relationship Management (CRM), Natural Language Third Party Information
Retrieval, Knowledge Management, Business Intelligence and Strategic Watch
systems
------------------------------
This software is designed to help you to face a increasingly dense
information flood:
- accelerate your reading rate,
- analyze in-depth and objectively,
- extract relevant information,
- classify automatically, therefore structure information.
Because they offer considerable time savings and enhanced visibility of
strategic data, Tropes <http://www.semantic-knowledge.com/tropes.htm> and
Zoom <http://www.semantic-knowledge.com/zoom.htm> software yield an
exceptional Return On Investments (ROI). They generally show a profit as of
their first use, sometimes in a matter of hours! You don't believe us? *Try
the free version of Tropes Zoom in our download
area*<http://www.semantic-knowledge.com/download.htm>
.
Our software is based on powerful text analysis technology, using
dictionaries that contain hundreds of thousands of preset semantic
classifications, and reliable analysis techniques resulting from years of
scientific research.
Products available now for the English, French, Spanish, Portuguese and
Brazilian languages.
Enjoy!
Thomas C. Stowe
Email/GChat/MS Live Messenger: stowe.thomas(a)gmail.com
Texas Computer Services: http://www.txpcservices.com
Portfolio/VCard/Resume: http://www.thomasstowe.info
Blog: http://www.sc3ne.com
Survive2 SHTF/Disaster Prep/Homesteading Information:
http://www.survive2.com
Phone/SMS/VoiceMail: +1-210-704-7289
Skype: thomasstowe
I'm trying out relative links in the index.html files. For the next few
days expect that some jobs will complete with the old absolute urls and
I'll have to fix them by hand. New jobs should "just work". I've
converted most of the older index pages for the last five dumps from
each project. Let me know if there are any problems. I'm hoping that
nobody has scripts that rely on having the full url in there.
This change was needed for mirror sites to function properly. At some
point it will be helpful for https access as well.
Ariel
It doesn't really look like a mirror to me. The download links still point to download.wikimedia.org.
Regards,
Hydriz
http://simple.wikipedia.org/wiki/User:Hydriz
> From: ariel(a)wikimedia.org
> To: xmldatadumps-l(a)lists.wikimedia.org; wikitech-l(a)lists.wikimedia.org
> Date: Thu, 13 Oct 2011 21:20:12 +0300
> Subject: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil
>
> As the subject says, the first mirror of our XML dumps is up, hosted at
> C3Sl in BRazil. We're really excited about it. Details are listed on
> the main index page on our download server
> ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's
> convenience:
>
> Site: Centro de Computação Científica e Software Livre (C3SL), at the
> Universidade Federal do Paraná in Brazil.
> Contents: the 5 most current complete and successful dumps of each
> project
> Access: HTTP: http://wikipedia.c3sl.ufpr.br/
> FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/
> rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/
>
> A big thank you to the folks there for providing the space and working
> with us to make it happen.
>
> Please forward this on to researchers or others who might want to know
> about it but aren't on these lists.
>
> Ariel Glenn
> Software Developer / Systems Engineer
> Wikimedia Foundation
> ariel(a)wikimedia.org
>
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
I've started up the smaller wikis in 4 processes. I did a few manual
inspections but as always the eyes of the community are much sharper thn
mine, at the very least because there are more of them! So please get
your eyes on these and let me know if you see anything odd. I'll likely
start several larger wiki jobs in a little while and if I hear nothing
bad by the end of the day I'll fire off en wiki as well.
Happy 1.18,
Ariel
Hi,
I am hoping that someone here can help me. THere is a lot of conflicting
advice on the mirroring wikipedia page. I am setting up a local mirror of
english Wikipedia pages-articles...xml and I have been stopped by repeated
failures in mySQL configuration.
I have downloaded the .gz data from the dumped and run it through mwdumper
to create an SQL file without problems, but things keep breaking down on the
way into mysql. I have had a lot of agony with inno_db log files, etc. and
have learned how to make them bigger, but I'm still apaprently missing some
pieces.
my target machine is an AWS instance 32 bit Ubuntu 1.7GB RAM, 1 Core, 200
GB, which is only for this project. I can make it bigger if necessary.
Can someone take a look at this my.cnf file and tell me what I need to
change to get this to work?
this is what my.cnf file looks like:
[mysqladmin]
user=
[mysqld]
basedir=/opt/bitnami/mysql
datadir=/opt/bitnami/mysql/data
port=3306
socket=/opt/bitnami/mysql/tmp/mysql.sock
tmpdir=/opt/bitnami/mysql/tmp
character-set-server=UTF8
collation-server=utf8_general_ci
max_allowed_packet=128M
wait_timeout = 120
long_query_time = 1
log_slow_queries
log_queries_not_using_indexes
query_cache_limit=2M
query_cache_type=1
query_cache_size=128M
innodb_additional_mem_pool_size=8M
innodb_buffer_pool_size=256M
innodb_log_file_size=128M
#tmp_table_size=64M
#max_connections = 2500
#max_user_connections = 2500
innodb_flush_method=O_DIRECT
#key_buffer_size=64M
[mysqld_safe]
mysqld=mysqld.bin
[client]
default-character-set=UTF8
port=3306
socket=/opt/bitnami/mysql/tmp/mysql.sock
[manager]
port=3306
socket=/opt/bitnami/mysql/tmp/mysql.sock
pid-file=/opt/bitnami/mysql/tmp/manager.pid
default-mysqld-path=/opt/bitnami/mysql/bin/mysqld.bin
Hi,
I'd like to know if there is a webpage that is more of a walk-through on
how o create a mirror for wikipedia. I have had a few people asked me how
this can be done here at CLEF2011 and I was not able to give them a
satisfactory answer.
-- とある白い猫 (To Aru Shiroi Neko)
I've shot all but a couple jobs, and those will die before I go to bed.
Tomorrow after the first round of deployments and fixes has gone around
and the dust has stelled, I'll crank the dumps back up until it's time
for the second round.
Ariel
As people prepare for the "het deployment" of mw 1.18 you are going to
see some interruptions of the dumps while I get these hosts ready for
the switch. As a side effect of one of the configuration changes, the
host running the "large-ish" wikis is temporarily not running any jobs.
I should be starting those jobs again on Moday. The four "small wiki"
processes should be ok.
When we get around to the actual deployment, I will likely stop all jobs
til deployment is complete, as I cannot begin to guess what the output
would look like if the codebase were switched out underneath the dumpers
in the middle of a run.
Ariel