Xmldatadumps-l

xmldatadumps-l@lists.wikimedia.org

720 discussions

post-1.18 dumps starting up
by Ariel T. Glenn 07 Oct '11

07 Oct '11

I've started up the smaller wikis in 4 processes. I did a few manual inspections but as always the eyes of the community are much sharper thn mine, at the very least because there are more of them! So please get your eyes on these and let me know if you see anything odd. I'll likely start several larger wiki jobs in a little while and if I hear nothing bad by the end of the day I'll fire off en wiki as well. Happy 1.18, Ariel

1 0

help with my.cnf for local Wiki mirror
by Fred Zimmerman 06 Oct '11

06 Oct '11

Hi, I am hoping that someone here can help me. THere is a lot of conflicting advice on the mirroring wikipedia page. I am setting up a local mirror of english Wikipedia pages-articles...xml and I have been stopped by repeated failures in mySQL configuration. I have downloaded the .gz data from the dumped and run it through mwdumper to create an SQL file without problems, but things keep breaking down on the way into mysql. I have had a lot of agony with inno_db log files, etc. and have learned how to make them bigger, but I'm still apaprently missing some pieces. my target machine is an AWS instance 32 bit Ubuntu 1.7GB RAM, 1 Core, 200 GB, which is only for this project. I can make it bigger if necessary. Can someone take a look at this my.cnf file and tell me what I need to change to get this to work? this is what my.cnf file looks like: [mysqladmin] user= [mysqld] basedir=/opt/bitnami/mysql datadir=/opt/bitnami/mysql/data port=3306 socket=/opt/bitnami/mysql/tmp/mysql.sock tmpdir=/opt/bitnami/mysql/tmp character-set-server=UTF8 collation-server=utf8_general_ci max_allowed_packet=128M wait_timeout = 120 long_query_time = 1 log_slow_queries log_queries_not_using_indexes query_cache_limit=2M query_cache_type=1 query_cache_size=128M innodb_additional_mem_pool_size=8M innodb_buffer_pool_size=256M innodb_log_file_size=128M #tmp_table_size=64M #max_connections = 2500 #max_user_connections = 2500 innodb_flush_method=O_DIRECT #key_buffer_size=64M [mysqld_safe] mysqld=mysqld.bin [client] default-character-set=UTF8 port=3306 socket=/opt/bitnami/mysql/tmp/mysql.sock [manager] port=3306 socket=/opt/bitnami/mysql/tmp/mysql.sock pid-file=/opt/bitnami/mysql/tmp/manager.pid default-mysqld-path=/opt/bitnami/mysql/bin/mysqld.bin

1 0

Mirroring wikipedia
by とある白い猫 29 Sep '11

29 Sep '11

Hi, I'd like to know if there is a webpage that is more of a walk-through on how o create a mirror for wikipedia. I have had a few people asked me how this can be done here at CLEF2011 and I was not able to give them a satisfactory answer. -- とある白い猫 (To Aru Shiroi Neko)

2 2

het deployment round one tonight
by Ariel T. Glenn 26 Sep '11

26 Sep '11

I've shot all but a couple jobs, and those will die before I go to bed. Tomorrow after the first round of deployments and fixes has gone around and the dust has stelled, I'll crank the dumps back up until it's time for the second round. Ariel

1 0

dump bumps in the road
by Ariel T. Glenn 24 Sep '11

24 Sep '11

As people prepare for the "het deployment" of mw 1.18 you are going to see some interruptions of the dumps while I get these hosts ready for the switch. As a side effect of one of the configuration changes, the host running the "large-ish" wikis is temporarily not running any jobs. I should be starting those jobs again on Moday. The four "small wiki" processes should be ok. When we get around to the actual deployment, I will likely stop all jobs til deployment is complete, as I cannot begin to guess what the output would look like if the codebase were switched out underneath the dumpers in the middle of a run. Ariel

2 2

Re: [Xmldatadumps-l] [Foundation-l] Request: WMF commitment as a long term cultural archive?
by emijrp 21 Sep '11

21 Sep '11

Hi all; Just like the scripts to preserve wikis[1], I'm working in a new script to download all Wikimedia Commons images packed by day. But I have limited spare time. Sad that volunteers have to do this without any help from Wikimedia Foundation. I started too an effort in meta: (with low activity) to mirror XML dumps.[2] If you know about universities or research groups which works with Wiki[pm]edia XML dumps, they would be a possible successful target to mirror them. If you want to download the texts into your PC, you only need 100GB free and to run this Python script.[3] I heard that Internet Archive saves XML dumps quarterly or so, but no official announcement. Also, I heard about Library of Congress wanting to mirror the dumps, but not news since a long time. L'Encyclopédie has an "uptime"[4] of 260 years[5] and growing. Will Wiki[pm]edia projects reach that? Regards, emijrp [1] http://code.google.com/p/wikiteam/ [2] http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps [3] http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloader.py [4] http://en.wikipedia.org/wiki/Uptime [5] http://en.wikipedia.org/wiki/Encyclop%C3%A9die 2011/6/2 Fae <faenwp(a)gmail.com> > Hi, > > I'm taking part in an images discussion workshop with a number of > academics tomorrow and could do with a statement about the WMF's long > term commitment to supporting Wikimedia Commons (and other projects) > in terms of the public availability of media. Is there an official > published policy I can point to that includes, say, a 10 year or 100 > commitment? > > If it exists, this would be a key factor for researchers choosing > where to share their images with the public. > > Thanks, > Fae > -- > http://enwp.org/user_talk:fae > Guide to email tags: http://j.mp/faetags > > _______________________________________________ > foundation-l mailing list > foundation-l(a)lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l >

1 0

another month, another dump. ho hum :-P
by Ariel T. Glenn 19 Sep '11

19 Sep '11

The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additional string: -p<first-page-id-contained>p<last-pageid-contained> Expect the specific groupings to change from one run to the next; it's time-based, rather than based on the number of pages or revisions. You may notice a gap of a few numbers between files; this would indicate that those pages were deleted and not included in the dump at all. Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days. Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works? Ariel

5 7

dump of ptwikibooks broken
by Bernd Fehling 09 Sep '11

09 Sep '11

Dear list, I noticed that ptwikibooks has some problems since end of May. Is someone able to fix this? Regards, Bernd Fehling

1 0

another month...
by Ariel T. Glenn 02 Sep '11

02 Sep '11

...another dump. August is done, July 7z are done, the last of the May history and 7z are done. That brings us up to date. I expect to test new code with production of many small files, as previously discussed on this list, starting within the next few days. This test will be for en wikipedia only, as that's the dump that's hardest to run to completion. The results might be a perfectly good dump, or not. Even if they are, I do not plan to try running en wikipedia dumps twice a month, so don't get your hopes up. (Who would process all that data every two weeks anyways?) Ariel

3 3

Category pages in enwiki-latest-pages-articles.xml
by Imre Kovács 24 Aug '11

24 Aug '11

Hello everyone, I hope I'm not disturbing you too much, I have the following question: I'm considering to download the enwiki-latest-pages-articles.xml, but I need to know if this contains enough information to rebuild the category structure (parent categories, subcategories, including the Category:Contents, etc.). Does the dump include the category pages or only the articles? Thank you very much, Imre

2 1

← Newer
1
...
58
59
60
61
62
63
64
...
72
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l