Xmldatadumps-l

xmldatadumps-l@lists.wikimedia.org

720 discussions

Start a nNew thread

No new german dump
by Andreas Meier 21 Nov '11

21 Nov '11

Hello, today we ja-dump was finished, so a new de-dump should start. Best regards Andreas

2 1

inter-page links in the data dump
by Greg Morrison 17 Nov '11

17 Nov '11

I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things. (1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as "[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]" If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content. Thanks for your help!

5 5

dumps halted for updates
by Ariel T. Glenn 15 Nov '11

15 Nov '11

Some OS security updates were installed and there's code updates that have just been deployed, so dumps are halted for a little while; I'll be resuming them later today. Ariel

1 1

dump titles form other namespaces than 0?
by Ariel T. Glenn 02 Nov '11

02 Nov '11

A while back (over 2 years ago, urk!) we had a request for dumps of titles of things other than articles [1]. I haven't seen that request repeated, but I'm wondering how useful that would be to folks and which namespaces we should dump, if we were going to add a few. Article talk pages? Other? Ariel [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=19542

1 0

first mirror of most recent XML dumps, at C3SL in Brazil
by Ariel T. Glenn 24 Oct '11

24 Oct '11

As the subject says, the first mirror of our XML dumps is up, hosted at C3Sl in BRazil. We're really excited about it. Details are listed on the main index page on our download server ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's convenience: Site: Centro de Computação Científica e Software Livre (C3SL), at the Universidade Federal do Paraná in Brazil. Contents: the 5 most current complete and successful dumps of each project Access: HTTP: http://wikipedia.c3sl.ufpr.br/ FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/ rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/ A big thank you to the folks there for providing the space and working with us to make it happen. Please forward this on to researchers or others who might want to know about it but aren't on these lists. Ariel Glenn Software Developer / Systems Engineer Wikimedia Foundation ariel(a)wikimedia.org

2 5

Re: [Xmldatadumps-l] lessons learned from cloning Wikipedia and indexing with Solr
by Platonides 23 Oct '11

23 Oct '11

On 23/10/11 23:19, Fred Zimmerman wrote: > Good points! As the post indicates, I'm not very experienced with any > of these tools and made a lot of dummy mistakes. Part of the point of > the post is that even if you are pretty dumb, you can get this done if > you persevere! > > The SQL import of the links would have been helpful; most of the > instructions I found recommended using pages-articles.xml, which is a > good place to start. There's no need for two copies of the big wiki.xml > file, I would suggest just re-naming the original download to enwiki.xml > right at the outset. The file I mentioned to skip was enwiki.sql, which is not xml, but a copy (in SQL queries) of the xml.

1 0

Off-topic, but super-cool free and unique software that's somewhat relevant (Language Analysis of Wikipedia)
by Thomas Stowe 23 Oct '11

23 Oct '11

So, I figured I'd share a commercial tool that's turned freeware recently. I've been analyzing my own writing to figure out what it conveys and it's more accurate than I'd expect for free without a human being. I love data mining and analysis software. I'm not selling this, I'm not affiliated with them or anything and I'm not advertising it but if you're actually ripping wikipedia or working with XML and large amounts of text, most likely you'd be interested in this. I've been running it on Wikipedia pages and other websites as well, too. I've generated very, very interesting and cool results with it. Without further adieu and long-windedness, Designed for Information Science, Market Research, Sociological Analysis and Scientific studies, Tropes is a Natural Language Processing and Semantic Classification software that guarantees *pertinence and quality in Text Analysis*. http://www.semantic-knowledge.com/ http://www.semantic-knowledge.com/download.htm There are French, Brazilian, Portugese and Spanish versions. It's definitely not a waste of your time to test it out. I'm going to be use it often as I write often for myself and as a freelancer. I'm also using it to proof and edit a book I've been writing for the past month, too before I publish it in a few weeks. It's too bad there's only a Windows binary version and no source code available. >From their landing page ( http://www.semantic-knowledge.com/ ): Semantic-Knowledge <http://www.semantic-knowledge.com/company.htm> is a leading provider of Natural Language Processing (NLP) software, including Semantic Search Engine, Text Analysis, Intelligent Desktop Search, Text Mining, Knowledge Discovery and Classification systems: ------------------------------ [image: Tropes Text Analysis] <http://www.semantic-knowledge.com/tropes.htm> Tropes <http://www.semantic-knowledge.com/tropes.htm> - High Performance Text Analysis Designed for Semantic Classification, Keyword Extraction, Linguistic and Qualitative Analysis, Tropes software is a perfect tool for Information Science, Market Research, Sociological Analysis, Scientific and Medical studies, and more.. ------------------------------ [image: Zoom Semantic Search Engine]<http://www.semantic-knowledge.com/zoom.htm> Zoom <http://www.semantic-knowledge.com/zoom.htm> - Semantic Search Engine With its fast Natural Language Information Retrieval system, Integrated Web Spider, built-in Semantic Networks and on-the-fly Semantic classifications, Zoom is a powerful Windows Search Engine designed for Document Management, Competitive Intelligence, Press Analysis and Text Mining. ------------------------------ [image: Overtext Index Semantic Indexing]<http://www.semantic-knowledge.com/indexing-components.htm> Overtext Index <http://www.semantic-knowledge.com/indexing-components.htm> - Semantic Data Processing for Servers Overtext Index is a classification system designed for large-scale Customer Relationship Management (CRM), Natural Language Third Party Information Retrieval, Knowledge Management, Business Intelligence and Strategic Watch systems ------------------------------ This software is designed to help you to face a increasingly dense information flood: - accelerate your reading rate, - analyze in-depth and objectively, - extract relevant information, - classify automatically, therefore structure information. Because they offer considerable time savings and enhanced visibility of strategic data, Tropes <http://www.semantic-knowledge.com/tropes.htm> and Zoom <http://www.semantic-knowledge.com/zoom.htm> software yield an exceptional Return On Investments (ROI). They generally show a profit as of their first use, sometimes in a matter of hours! You don't believe us? *Try the free version of Tropes Zoom in our download area*<http://www.semantic-knowledge.com/download.htm> . Our software is based on powerful text analysis technology, using dictionaries that contain hundreds of thousands of preset semantic classifications, and reliable analysis techniques resulting from years of scientific research. Products available now for the English, French, Spanish, Portuguese and Brazilian languages. Enjoy! Thomas C. Stowe Email/GChat/MS Live Messenger: stowe.thomas(a)gmail.com Texas Computer Services: http://www.txpcservices.com Portfolio/VCard/Resume: http://www.thomasstowe.info Blog: http://www.sc3ne.com Survive2 SHTF/Disaster Prep/Homesteading Information: http://www.survive2.com Phone/SMS/VoiceMail: +1-210-704-7289 Skype: thomasstowe

1 0

lessons learned from cloning Wikipedia and indexing with Solr
by Fred Zimmerman 23 Oct '11

23 Oct '11

for future reference http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-… ----------------------------------------------------- Subscribe to the Nimble Books Mailing List http://eepurl.com/czS- for monthly updates

2 1

change to index.html files: relative urls for files
by Ariel T. Glenn 17 Oct '11

17 Oct '11

I'm trying out relative links in the index.html files. For the next few days expect that some jobs will complete with the old absolute urls and I'll have to fix them by hand. New jobs should "just work". I've converted most of the older index pages for the last five dumps from each project. Let me know if there are any problems. I'm hoping that nobody has scripts that rely on having the full url in there. This change was needed for mirror sites to function properly. At some point it will be helpful for https access as well. Ariel

1 0

Re: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil
by Hydriz Wikipedia 14 Oct '11

14 Oct '11

It doesn't really look like a mirror to me. The download links still point to download.wikimedia.org. Regards, Hydriz http://simple.wikipedia.org/wiki/User:Hydriz > From: ariel(a)wikimedia.org > To: xmldatadumps-l(a)lists.wikimedia.org; wikitech-l(a)lists.wikimedia.org > Date: Thu, 13 Oct 2011 21:20:12 +0300 > Subject: [Xmldatadumps-l] first mirror of most recent XML dumps, at C3SL in Brazil > > As the subject says, the first mirror of our XML dumps is up, hosted at > C3Sl in BRazil. We're really excited about it. Details are listed on > the main index page on our download server > ( http://dumps.wikimedia.org/ ) and are reproduced below for everyone's > convenience: > > Site: Centro de Computação Científica e Software Livre (C3SL), at the > Universidade Federal do Paraná in Brazil. > Contents: the 5 most current complete and successful dumps of each > project > Access: HTTP: http://wikipedia.c3sl.ufpr.br/ > FTP: ftp://wikipedia.c3sl.ufpr.br/wikipedia/ > rsync: rsync://wikipedia.c3sl.ufpr.br/wikipedia/ > > A big thank you to the folks there for providing the space and working > with us to make it happen. > > Please forward this on to researchers or others who might want to know > about it but aren't on these lists. > > Ariel Glenn > Software Developer / Systems Engineer > Wikimedia Foundation > ariel(a)wikimedia.org > > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

2 2

← Newer
1
...
57
58
59
60
61
62
63
...
72
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l