Hello
There are many different archives available for download from Wikipedia so I am
not sure which one to get.
Please tell me which one has a list of page titles and category
For example
Sit-Verb
Cat-Noun-Animal
Him-Pronoun
Dog-Noun-Animal
Happy-Adjective
Washington-Noun-Place
Metallica-Noun-Band
What is the URL to the archive that I need to download to get this kind of
information?
I don't even care what format it is in. XML, CSV, SQL, I don't care which as
long as it's parse-able.
aaron(a)svn.wikimedia.org schreef:
> Revision: 34543
> Author: aaron
> Date: 2008-05-10 00:48:07 +0000 (Sat, 10 May 2008)
>
> Log Message:
> -----------
> Add a way to do different JOINs with $tables
Whoo hoo! This'll make implementing JOINs in the API much easier, and
provides a way to combine FORCE INDEX with JOINs, which used to be
impossible to do portably.
Roan Kattouw (Catrope)
[courtesy copy to foundation-l, though I suggest that discussion, if any, be
centralised on wikitech-l]
Hi all,
the search index for the mailinglist archives was last rebuilt in January.
Now, after having made quite a few queries about this here and at other
places, I learnt (and obviously had to accept) that rebuilding the search
index is quite a resources-consuming process which resulted in crashes.
To put it bluntly, I dare suggest from a non-technical POV that the "htdig"
(that's the name, isn't it?) experiment has failed. If we can only update
our search index every 6 months or so, it is pointless to have it.
Instead, I suggest that http://lists.wikimedia.org/robots.txt be modified as
to allow Google (and other search engines) to crawl /pipermail/ again. I do
not really see the privacy issues of this, nabble, gmane etc. are
google-searchable as well and I really don't see the point in barring Google
from our own archive.
If I am very honest, I do not even remember anymore, why we decided to bar
Google from http://lists.wikimedia.org/pipermail.
Was it due to privacy concerns? If so, which, and why is
lists.wikimedia.orgas an archive different from Nabble/Gmane?
Thanks,
Michael
--
Michael Bimmler
mbimmler(a)gmail.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
catrope(a)svn.wikimedia.org wrote:
> * Removed FORCE INDEX (rc_timestamp) from ApiQueryRecentchanges: it's
> nigh impossible to integrate with addJoin() and it doesn't seem to be
> necessary anyway (my MySQL instance automatically chooses
> rc_timestamp)
Got hit by a large rash of filesorts bringing a couple of DB servers to
a crawl due to this.
Since it's mixed in with a large number of other changes, I just
reverted the entire API directory to r34430 for now (in r34527).
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkgkkmwACgkQwRnhpk1wk44IcACeNrBVBGVSJlntItHlsHFD3Dge
bqcAnRAsDGoWWzXml4mfyHJO21vB2eLP
=ieh5
-----END PGP SIGNATURE-----
On Thu, May 8, 2008 at 2:33 PM, <catrope(a)svn.wikimedia.org> wrote:
> Revision: 34431
> Author: catrope
> Date: 2008-05-08 12:33:20 +0000 (Thu, 08 May 2008)
>
> Log Message:
> -----------
> API:
> * Added ApiQueryBase::addJoin() which provides a cleaner interface to construct JOIN queries. Behind the scenes this still uses the old, ugly way, but it'll be easy to rewrite when/if the Database class gets its own function for JOINs
> * Used addJoin() in query modules where necessary
> * Removed FORCE INDEX (rc_timestamp) from ApiQueryRecentchanges: it's nigh impossible to integrate with addJoin() and it doesn't seem to be necessary anyway (my MySQL instance automatically chooses rc_timestamp)
>
I've heard rumours on #wikimedia-tech that MySQL 4 sometimes chooses
the wrong index unless forced. So I don't know whether this actually
works.
Bryan
> Revision: 34388
> `ug1`.`ug_user`=`user_id`";
Ugh. Please don't hardcode backticks: not only are they unneeded here, but
they horribly break any chance we have of cross-database compatibility.
--
Greg Sabino Mullane greg(a)turnstep.com
PGP Key: 0x14964AC8 200805071519
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
nad(a)svn.wikimedia.org wrote:
> Add SQLite database class
Cool! :D
Couple notes after a quick pass over it...
> + if ("$wgSQLiteDataDir" == '') $wgSQLiteDataDir = dirname($_SERVER['DOCUMENT_ROOT']).'/data';
> + if (!is_dir($wgSQLiteDataDir)) mkdir($wgSQLiteDataDir,0700);
This default sounds a bit insecure, as the raw database files would be
exposed to the web unless PHP's running as a different user from the
static web server.
That means deleted data, user email-addresses, password hashes, etc
would be exposed to download.
> + /**
> + * Use MySQL's naming (accounts for prefix etc) but remove surrounding backticks
> + */
> + function tableName($name) {
> + $t = parent::tableName($name);
> + if (!empty($t)) $t = substr($t,1,-1);
I believe this will produce bad output for anything using an explicit
DB, eg `dbname`.`prefix_table`.
Dunno whether that'd actually work here anyway, though. :)
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkgiQRsACgkQwRnhpk1wk45EZQCg0OKYyarZJ7lTXgqn28W9/YHU
4ZQAniCaE+x+dNhh6E8kV+sIj2LsKv9u
=Wd1O
-----END PGP SIGNATURE-----