I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
That's fantastic news, and just in time for the 10th anniversary too, when I'm sure the early days of Wikipedia will be in the limelight. Great find Tim!
Would it be at all possible to import these into the current system? I know someone was importing edits from the Nostalgia wiki. It would be wonderful to finally have a complete article history.
Pete / the wub
On 14 December 2010 15:54, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Hi,
I am thinking of recommending a wiki database to a research project planned at Erfurt University. The group I have to advise is planning to edit late 17th and early 18th century letters of the "republic of letters" with the aim to reconstruct the flow of ideas and the personal networks that generated this flow. A wiki should be a superb tool for the editing process the project will have to get through. Yet I am more interested in tools we would later on use to analyse our data (we will prabably create pages of individual letters, other pages on authors and topics, and, of course, categories etc.).
My question is now: I have seen exploits (yet never taken any notes) that analysed Wikis and gave net-work structures of the interrelated pages and category trees. One such thing was shown here only recently:
http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2010-11-29/News_an...
...yet the digest given here would be too vague for our purposes. We would probably have to plan the entire wiki in a way that we could get defiinite pictures of the development of 17th century intellectual networks (how do they spread on the European map? Who is communicating with whom? Who is playing what role in the process?), and of the flow of topics within these networks.
Ideas of who would provide technical solutions and give advise on how to create such wiki in a manner that it can be analysed fruitfully, would be most welcome,
regards Olaf Simons
Gotha Research Centre, Germany ...and Germany's wikipedia
Hi Olaf,
This would be a good WikiProject within Wikisource, or on top of Wikisource.
Do you have scans of the letters?
http://en.wikipedia.org/wiki/Wikisource
Wikisource is already set up to manage the transcription and presentation of the letters, pages about authors, etc., and the community will pitch in with setting up your data.
You can focus on the linking between texts, analysis, etc.
The wiki-research-l list may be of interest to you.
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Wed, Dec 15, 2010 at 5:15 AM, Olaf Simons olaf.simons@pierre-marteau.com wrote:
Hi,
I am thinking of recommending a wiki database to a research project planned at Erfurt University. The group I have to advise is planning to edit late 17th and early 18th century letters of the "republic of letters" with the aim to reconstruct the flow of ideas and the personal networks that generated this flow. A wiki should be a superb tool for the editing process the project will have to get through. Yet I am more interested in tools we would later on use to analyse our data (we will prabably create pages of individual letters, other pages on authors and topics, and, of course, categories etc.).
My question is now: I have seen exploits (yet never taken any notes) that analysed Wikis and gave net-work structures of the interrelated pages and category trees. One such thing was shown here only recently:
http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2010-11-29/News_an...
...yet the digest given here would be too vague for our purposes. We would probably have to plan the entire wiki in a way that we could get defiinite pictures of the development of 17th century intellectual networks (how do they spread on the European map? Who is communicating with whom? Who is playing what role in the process?), and of the flow of topics within these networks.
Ideas of who would provide technical solutions and give advise on how to create such wiki in a manner that it can be analysed fruitfully, would be most welcome,
regards Olaf Simons
Gotha Research Centre, Germany ...and Germany's wikipedia
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
It's quite interesting that this topic has surfaced. The applications of such software might be of great interest in many areas. Some of those applications seem so powerful that it seems likely that this might be already well developed. The application mentioned in the opening of this thread concerns "late 17th and early 18th century letters of the "republic of letters" with the aim to reconstruct the flow of ideas and the personal networks that generated this flow" and the "tools we would later on use to analyze our data." Reference was made to analysis of wikis that "gave network structures of the interrelated pages and category trees" while recognizing the need to go much further, in order to "get definite pictures of the development of 17th century intellectual networks (how do they spread on the European map? Who is communicating with whom? Who is playing what role in the process?), and of the flow of topics within these networks."
Consider now a different study object: foreign diplomatic relations, drug trafficking (no pun intended), global warfare development, political intrigue or, at a smaller scale, organizational intrigue. From an historic point of view the results might provide great depth of knowledge. In real time, as the events unfold, this could be a powerful tool to understand how things evolve in a certain direction.
The Wikimedia projects power structure is definitely a serious candidate for such analysis.
Sincerely,
Virgilio A. P. Machado (Vapmachado)
On Tue, Dec 14, 2010 at 10:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
I have to say this is super cool. It's like digging up a time capsule right before the 10th anniversary. One of my favorite early edits:
"This is the new WikiPedia! The idea here is to write a complete encyclopedia from scratch, without peer review process, etc. Some people think that this may be a hopeless endeavor, that the result will necessarily suck. We aren't so sure. So, let's get to work!"
-Chad
This is fantastic, and the timing could not be better.
If anyone finds anything noteworthy, please add it to the timeline of Wikipedia that we're building at the 10th anniversary wiki,[1] as well as the other tools for cataloging interesting tidbits from our history.[2]
1. http://ten.wikipedia.org/wiki/Wikipedia_timeline 2. http://ten.wikipedia.org/wiki/Share
On Tue, Dec 14, 2010 at 8:11 AM, Chad innocentkiller@gmail.com wrote:
On Tue, Dec 14, 2010 at 10:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
I have to say this is super cool. It's like digging up a time capsule right before the 10th anniversary. One of my favorite early edits:
"This is the new WikiPedia! The idea here is to write a complete encyclopedia from scratch, without peer review process, etc. Some people think that this may be a hopeless endeavor, that the result will necessarily suck. We aren't so sure. So, let's get to work!"
-Chad
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Great news indeed!
Now I can finally figure out when my first edit was :-)
Magnus
On Tue, Dec 14, 2010 at 3:54 PM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Tim,
wonderful news! Thank you for making them publicly available!
Of course I immediately downloaded them, and I must have a look at them later this week. Though they are from before I became active (2003) I am very curious if the articles in these files still exist, and how much they changed.
teun spaans
On Tue, Dec 14, 2010 at 4:54 PM, Tim Starling tstarling@wikimedia.orgwrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7zhttp://noc.wikimedia.org/%7Etstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On 12/14/2010 7:54 AM, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
I guess producing database dumps was easier in those days. Seriously though, this is absolutely fantastic news!
--Michael Snow
Wow, Tim. Just wow!
Is it just me who sees NYT carrying a headline, "On eve of 10th anniversary, WIkipedia developers turn up earliest records" ?
FT2
On Tue, Dec 14, 2010 at 3:54 PM, Tim Starling tstarling@wikimedia.orgwrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Hi;
Thanks Tim. Congratulations.
Is Wikipedia:UuU[1] now out-of-date?
Regards, emijrp
[1] http://en.wikipedia.org/wiki/Wikipedia:UuU
2010/12/14 Tim Starling tstarling@wikimedia.org
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7zhttp://noc.wikimedia.org/%7Etstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
AWESOME. This is so cool. I've copied the research list too, since there's many Wikipedia historians that will be eager to see the older versions.
I hope we can get them up in a browsable way, like nostalgia.wikipedia.org!
-- phoebe
This is definitely a tremendous asset leading up to our big bday in January. I hope we can extract and post some of the real gems.
Thanks for the resourcefulness and the sharing, Tim.
On Dec 14, 2010, at 10:04 AM, phoebe ayers wrote:
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
AWESOME. This is so cool. I've copied the research list too, since there's many Wikipedia historians that will be eager to see the older versions.
I hope we can get them up in a browsable way, like nostalgia.wikipedia.org!
-- phoebe
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Would prefer on its own wiki as this is comprehensive up to a given date. Maybe January2001.wikipedia.org -- immediate impact.
(DNS software cannot handle 2001.wikipedia.org)
FT2
On Tue, Dec 14, 2010 at 6:04 PM, phoebe ayers phoebe.wiki@gmail.com wrote:
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
I've long been interested in Wikipedia's history, and I've tried in the past to locate such backups. I asked various people who might have had one. I had given up hope.
The history of particularly old Wikipedia articles, as seen in the present Wikipedia database, is incomplete, due to Usemod's policy of deleting old revisions of pages after about a month. The script which Brion wrote to import the article histories from UseMod to MediaWiki only fetched those revisions which hadn't been purged yet.
I didn't want to believe that those revisions had been lost forever, and I even opened the UseMod source code and stared forlornly at the unlink() call. What I (and Brion before) missed is that UseMod appends a record of every change made to two files, called diff_log and rclog. In these two files is a record of every change made to Wikipedia from January 15 to August 17, 2001.
I've put the two log files up on the web, at:
http://noc.wikimedia.org/~tstarling/wikipedia-logs-2001-08-17.7z
The 7-zip archive is only 8.4MB -- much more manageable than today's backups.
rclog contains IP addresses. The Usemod software made IP addresses of logged-in users public, so the people who made these edits had no expectation that their IP address would be kept private. That, coupled with the passage of time, makes me think that no harm to user privacy can come from releasing these files.
-- Tim Starling
AWESOME. This is so cool. I've copied the research list too, since there's many Wikipedia historians that will be eager to see the older versions.
I hope we can get them up in a browsable way, like nostalgia.wikipedia.org !
-- phoebe
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Dec 14, 2010 at 7:54 AM, Tim Starling tstarling@wikimedia.org wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
This is exciting, because there is lots of article history in here which was assumed to be lost forever.
Wow, this is really, really amazing! I'm not sure just how you avoided having a heart attack after seeing this:
HomePage|979586833 1c1
< Describe the new page here.
This is the new WikiPedia!
Great work!
Rob
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
Ciao Henning
On Tue, Dec 14, 2010 at 8:36 PM, Henning Schlottmann h.schlottmann@gmx.net wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
There was only English back in the day...
Hi Magnus,
On 14.12.2010 22:35, Magnus Manske wrote:
On Tue, Dec 14, 2010 at 8:36 PM, Henning Schlottmann h.schlottmann@gmx.net wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
There was only English back in the day...
Not true. The first other languages were introduced on March 15 and could be part of this archive if the different Wikipedias were in one database under UseMod.
Do you remember how this worked?
Ciao Henning
On Tue, Dec 14, 2010 at 9:49 PM, Henning Schlottmann h.schlottmann@gmx.net wrote:
Hi Magnus,
On 14.12.2010 22:35, Magnus Manske wrote:
On Tue, Dec 14, 2010 at 8:36 PM, Henning Schlottmann h.schlottmann@gmx.net wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
There was only English back in the day...
Not true. The first other languages were introduced on March 15 and could be part of this archive if the different Wikipedias were in one database under UseMod.
My earliest recorded entry in de.wikipedia dates September 2001 (and I have a low two-digit user ID, which was created upon the switch to MediaWiki), so there seem to be some versions missing indeed. Do you know the oldest preserved esit on de.wp?
Do you remember how this worked?
AFAIR, every language had its own UseMod setup. My import script only took the last version; Brion later wrote one that filled in the previous ones from the stored diffs.
Magnus
On 14.12.2010 23:47, Magnus Manske wrote:
On Tue, Dec 14, 2010 at 9:49 PM, Henning Schlottmann
Not true. The first other languages were introduced on March 15 and could be part of this archive if the different Wikipedias were in one database under UseMod.
My earliest recorded entry in de.wikipedia dates September 2001 (and I have a low two-digit user ID, which was created upon the switch to MediaWiki), so there seem to be some versions missing indeed. Do you know the oldest preserved esit on de.wp?
Local lore claims it is your edit http://de.wikipedia.org/w/index.php?title=Polymerase-Kettenreaktion&oldid=2613 in Polymerase-Kettenreaktion. But I never checked that.
Do you remember how this worked?
AFAIR, every language had its own UseMod setup. My import script only took the last version; Brion later wrote one that filled in the previous ones from the stored diffs.
That's unfortunate but only a small dent in the wonderful news that Wikipedia has its very first (English) edits back.
Ciao Henning
On 15/12/10 07:36, Henning Schlottmann wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
Just English, unfortuately.
You may find this interesting:
http://web.archive.org/web/20020817032335/www.nupedia.com/pipermail/intlwiki-l.mbox/intlwiki-l.mbox
-- Tim Starling
I hope some of you may have seen/discussed these pages (as well as the connected pages):
http://web.archive.org/web/20010418152404/www.nupedia.com/
upto
http://web.archive.org/web/20030730075209/http://www.nupedia.org/
Of course the domain name then, was nupedia.org.
-vp
On Wed, Dec 15, 2010 at 02:30, Tim Starling tstarling@wikimedia.org wrote:
On 15/12/10 07:36, Henning Schlottmann wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
Just English, unfortuately.
You may find this interesting:
< http://web.archive.org/web/20030318055654/http://nupedia.com/pipermail/inter...
< http://web.archive.org/web/20020817032335/www.nupedia.com/pipermail/intlwiki...
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
And here is the first http://wikipedia.com archive link available at web archive.
http://web.archive.org/web/20010727112808/http://www.wikipedia.org/
2010/12/15 ViswaPrabha (വിശ്വപ്രഭ) vp2007@gmail.com
I hope some of you may have seen/discussed these pages (as well as the connected pages):
http://web.archive.org/web/20010418152404/www.nupedia.com/
upto
http://web.archive.org/web/20030730075209/http://www.nupedia.org/
Of course the domain name then, was nupedia.org.
-vp
On Wed, Dec 15, 2010 at 02:30, Tim Starling tstarling@wikimedia.orgwrote:
On 15/12/10 07:36, Henning Schlottmann wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
Just English, unfortuately.
You may find this interesting:
< http://web.archive.org/web/20030318055654/http://nupedia.com/pipermail/inter...
< http://web.archive.org/web/20020817032335/www.nupedia.com/pipermail/intlwiki...
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Is there any database backup of Nupedia? Or the articles were posted as HTML pages?
2010/12/15 ViswaPrabha (വിശ്വപ്രഭ) vp2007@gmail.com
And here is the first http://wikipedia.com archive link available at web archive.
http://web.archive.org/web/20010727112808/http://www.wikipedia.org/
2010/12/15 ViswaPrabha (വിശ്വപ്രഭ) vp2007@gmail.com
I hope some of you may have seen/discussed these pages (as well as the connected pages):
http://web.archive.org/web/20010418152404/www.nupedia.com/
upto
http://web.archive.org/web/20030730075209/http://www.nupedia.org/
Of course the domain name then, was nupedia.org.
-vp
On Wed, Dec 15, 2010 at 02:30, Tim Starling <tstarling@wikimedia.org wrote:
On 15/12/10 07:36, Henning Schlottmann wrote:
On 14.12.2010 16:54, Tim Starling wrote:
I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August
2001!
That's wonderful news. Is this for enWP only or were all languages in one database back then?
Just English, unfortuately.
You may find this interesting:
<
http://web.archive.org/web/20030318055654/http://nupedia.com/pipermail/inter...
<
http://web.archive.org/web/20020817032335/www.nupedia.com/pipermail/intlwiki...
-- Tim Starling
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
ViswaPrabha (വിശ്വപ്രഭ), 15/12/2010 01:03:
And here is the first http://wikipedia.com archive link available at web archive.
http://web.archive.org/web/20010727112808/http://www.wikipedia.org/
No, the first is http://web.archive.org/web/20010331173908/http://www.wikipedia.com/
Tim Starling, 15/12/2010 00:30:
You may find this interesting:
Uh, didn't know anything about it.
http://web.archive.org/web/20020817032335/www.nupedia.com/pipermail/intlwiki-l.mbox/intlwiki-l.mbox
Isn't intlwiki-l completely archived on gmane? http://blog.gmane.org/gmane.science.linguistics.wikipedia.international If not, we could import this mbox.
Nemo
Good news from Wiki-research-l in case you're not subscribed to it...
Nemo
-------- Messaggio Originale -------- Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered Data: Thu, 16 Dec 2010 13:53:14 -0500 Da: Joseph Reagle
I have the first 10K edits up reconstructed in their various pages at: http://cyber.law.harvard.edu/~reagle/wp-redux/
-------- Messaggio Originale -------- Oggetto: Re: [Wiki-research-l] [WikiEN-l] Old Wikipedia backups discovered Data: Fri, 17 Dec 2010 00:03:00 +1100 Da: Tim Starling
On 16/12/10 23:10, Joseph Reagle wrote:
On Wednesday, December 15, 2010, Tim Starling wrote:
There were some changes made to the page text that weren't represented in diff_log, specifically changing certain camel-case links to free links.
It appears my problems were related to some CR/LF issues not
round-tripping between diff and patch, but I hope to be able to address that. And yes, in addition to some of the CamelCase issues, I expect another problem is that if a page is blanked "Describe the new page here." will reappear outside of the diff_log.
I don't think that will be a problem. But there are other problems that I've encountered.
UseMod had a deletion feature. It turns out to be easy enough to skip deleted pages, since they don't have a corresponding entry in rclog.
It also had an admin-only rename feature, which optionally fixed links in all pages. This accounts for the free link changes I was seeing earlier. And it had a link replacement feature which could be invoked without a page move. These features were rarely used, due to the arcane interface, usually people just moved pages by copying and pasting. But during the free-link conversion, a lot of pages were renamed using the admin-only feature.
All these admin-only features were unlogged, but it turns out to be possible to reconstruct page moves, because when a page was moved, its name was updated in rclog but not in diff_log. By finding the first diff_log entry with the new name, you can roughly work out when the page moves were done.
Anyway, I'm developing a script which will import the dump into a modified MediaWiki instance, the idea being that I can then export XML from it. Once it works, I'll upload the XML to somewhere. I'm not sure when that will be.
-- Tim Starling
wikimedia-l@lists.wikimedia.org