Hello all.
I'm putting the finishing touches on a script that exports the wikipedia in a format that can be directly imported to Yahoo!'s (and other's) search engine. It's nothing pretty (in fact, it's my first PHP), but I'd be grateful if 2 things would happen:
* Someone would look at it (I attached it) and say "this sucks because..."
* Someone would give me cvs write access, so I could add the file to the repository.
Eventually, this should be on a cron, updating periodically so that the search engines which use these results stay up-to-date. I saved a copy of the spec here:
http://www.bomis.com/idif/spec.pdf
I didn't come up with any good way of getting keywords for a given page. Using the linked page titles was a suggestion. Other ideas?
Thanks.
Another thing; our HTML output is not yet guaranteed to be valid XML. This could cause a problem for literal inclusion of the HTML content into the XML output.
-- brion vibber (brion @ pobox.com)
I figured as much. The way the spec is worded, I'm not sure it needs to be valid XML. They basically say that anything aside from "</CONTENT>" can be between the CONTENT tags.
Jason
Brion Vibber wrote:
Another thing; our HTML output is not yet guaranteed to be valid XML. This could cause a problem for literal inclusion of the HTML content into the XML output.
-- brion vibber (brion @ pobox.com)
Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Apr 1, 2004, at 18:20, Jason Richey wrote:
I'm putting the finishing touches on a script that exports the wikipedia in a format that can be directly imported to Yahoo!'s (and other's) search engine. It's nothing pretty (in fact, it's my first PHP), but I'd be grateful if 2 things would happen:
- Someone would look at it (I attached it) and say "this sucks because..."
A couple notes...
Unless a charset is explicitly set, XML is assumed to be Unicode (auto-detected between UTF-8, UTF-16 big-ending, and UTF-16 little endian). The output should either be converted to UTF-8 or marked as ISO-8859-1 like this: <?xml version="1.0" charset="ISO-8859-1" ?>
Actually it might be better to mark it as Windows-1252 rather than ISO-8859-1, since sometimes bad Windows-specific characters sneak in and could make the feed invalid if included literally.
Page titles may contain ampersands and some other funky chars, and need to be escaped in a URL; you can use urlencode() to get it URL-encoded and htmlspecialchars() to make it safe for literal inclusion in XML. The $title_text should also be run through htmlspecialchars().
Otherwise it looks like it ought to be more or less functional, if a bit hard-coded in some places. (I haven't tested it yet.)
I didn't come up with any good way of getting keywords for a given page. Using the linked page titles was a suggestion. Other ideas?
I believe the keywords Magnus recently added to the CVS code grabs additional keywords from the set of links in the page (which you could pull out of a join on links and cur). That's one possibility...
-- brion vibber (brion @ pobox.com)
Jason Richey wrote:
- Someone would look at it (I attached it) and say "this sucks because..."
OK, since you asked for it... :)
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia. Do you know how many there are? ...
The Main Page states 239180, and that's only articles that meet certain criteria...
$data = getPageData($s->title);
It seems that getPageData() retrieves the text of a page. In other words, it performs yet another database query. And you're calling that FOR EVERY ARTICLE in Wikipedia!
I'm afraid I don't understand the purpose of the script. It seems to me that it is generating one ridiculously huge file that contains all of Wikipedia. What use would such a file be to anyone, even Yahoo?
I stress I don't really understand the purpose of the script, nor do I know exactly what Yahoo!'s (or anyone else's) requirements are, but it would seem way more sensible to me to have several smaller files, each of which containing maybe at most 100 articles or perhaps at most 1 MB of data or something. Each file should then contain a list of cur_ids, and then you can easily check for each file if any of the articles therein have changed since the last update.
Of course, that's just a suggestion.
Greetings, Timwi
Timwi wrote:
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia. Do you know how many there are? ...
Right. The purpose here is to make a friendly giant XML file to enable Yahoo (and presumably other likeminded whoevers) to grab a single giant document to study rather than having to crawl over the whole site. The purpose (from our point of view) is to have search engines more up to date, since this file can be downloaded by them once per day or hour rather than a crawl taking weeks.
I'm afraid I don't understand the purpose of the script. It seems to me that it is generating one ridiculously huge file that contains all of Wikipedia. What use would such a file be to anyone, even Yahoo?
*nod* It's so they can do the same thing they would do with a crawl of the site, but virtually instantaneously.
I stress I don't really understand the purpose of the script, nor do I know exactly what Yahoo!'s (or anyone else's) requirements are, but it would seem way more sensible to me to have several smaller files, each of which containing maybe at most 100 articles or perhaps at most 1 MB of data or something. Each file should then contain a list of cur_ids, and then you can easily check for each file if any of the articles therein have changed since the last update.
It does seem that rather than feeding them One Big File, we could feed them files of diffs or whatever. But that'd be more complex and require greater co-ordination. This at least has the virtue of simplicity.
It shouldn't run more than once per day at first. I'm not sure what their goals are with respect to how often they would *like* to receive it, but daily is a fine start.
--Jimbo
On Apr 2, 2004, at 00:53, Jimmy Wales wrote:
It shouldn't run more than once per day at first. I'm not sure what their goals are with respect to how often they would *like* to receive it, but daily is a fine start.
It would take hours just to run a complete dump, which would be the equivalent of a sizeable fraction of our total daily page views. (Best case might be 100ms per page for 240,000 pages =~ 6 hours 40 minutes)
If we're going to run something like this daily, some sort of incremental updates are a must, though we can probably get away with stuffing the saved data per page in a database or such and slurping it back out fairly quickly.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
It would take hours just to run a complete dump, which would be the equivalent of a sizeable fraction of our total daily page views. (Best case might be 100ms per page for 240,000 pages =~ 6 hours 40 minutes)
If we're going to run something like this daily, some sort of incremental updates are a must, though we can probably get away with stuffing the saved data per page in a database or such and slurping it back out fairly quickly.
*nod* That's very interesting.
As of this weekend, assuming things go well, we will have geoffrin and suda as database servers, and there's no problem in the short run with having gunther (still semi-on-loan from bomis) continue to run a 3rd replicator. So perhaps gunther could be tasked with generating this daily, so as to make sure it isn't interfering with anything else.
Am I right in assuming that the load on the main db of having a replicator attached is quite small?
Although we are not a business, so that pageviews don't equal revenue, I still like to make decisions as if that were true. For the organization as a whole, with my grand longterm vision of the Wikimedia Foundation being an organization similar in size and scope to say the National Geographic Society or Consumers Union, that's not a bad approximation. More pageviews means more fame which will ultimately mean more donations, more book sales, whatever.
Therefore, feeding data in a helpful way to major search engines is a very inexpensive form of advertising. It's worth doing, I think.
--Jimbo
Brion Vibber wrote:
On Apr 2, 2004, at 00:53, Jimmy Wales wrote:
It shouldn't run more than once per day at first. I'm not sure what their goals are with respect to how often they would *like* to receive it, but daily is a fine start.
It would take hours just to run a complete dump, which would be the equivalent of a sizeable fraction of our total daily page views. (Best case might be 100ms per page for 240,000 pages =~ 6 hours 40 minutes)
If we're going to run something like this daily, some sort of incremental updates are a must, though we can probably get away with stuffing the saved data per page in a database or such and slurping it back out fairly quickly.
I can't see where the proposal that some have made to have all the projects on one database would help this situation.
Ec
On Apr 2, 2004, at 10:14, Ray Saintonge wrote:
Brion Vibber wrote:
On Apr 2, 2004, at 00:53, Jimmy Wales wrote:
It shouldn't run more than once per day at first. I'm not sure what their goals are with respect to how often they would *like* to receive it, but daily is a fine start.
It would take hours just to run a complete dump, which would be the equivalent of a sizeable fraction of our total daily page views. (Best case might be 100ms per page for 240,000 pages =~ 6 hours 40 minutes)
If we're going to run something like this daily, some sort of incremental updates are a must, though we can probably get away with stuffing the saved data per page in a database or such and slurping it back out fairly quickly.
I can't see where the proposal that some have made to have all the projects on one database would help this situation.
I can't see how it would hurt, either. In fact it doesn't seem to have any bearing on this at all.
-- brion vibber (brion @ pobox.com)
On Apr 2, 2004, at 00:35, Timwi wrote:
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia.
That's kinda the point, yeah. It might be better to skip redirects, though; otherwise they should be handled in some distinct way.
It seems that getPageData() retrieves the text of a page. In other words, it performs yet another database query. And you're calling that FOR EVERY ARTICLE in Wikipedia!
That's obviously a bit inefficient, but yes. Incremental updates of only changed pages could hypothetically lead to faster output generation after the first run, though this would require some intermediate storage (since we don't yet have a running parser cache).
I'm afraid I don't understand the purpose of the script. It seems to me that it is generating one ridiculously huge file that contains all of Wikipedia. What use would such a file be to anyone, even Yahoo?
(It would produce a series of files up to about 12.5 megabytes in length, not one big file.)
A text base without the unnecessary UI elements could improve search results, and I suppose can be kept more complete more easily than constant spidering of a 200k+ page site. *shrug* If that's the data format they want, hey fine, though having to download the entire set of a couple hundred megabytes for every update doesn't sound ideal.
Jason, would each output need to be self-contained, or can they accept incremental updates in IDIF? How often would they pull updates?
-- brion vibber (brion @ pobox.com)
On Fri, 2 Apr 2004, Brion Vibber wrote:
On Apr 2, 2004, at 00:35, Timwi wrote:
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia.
That's kinda the point, yeah. It might be better to skip redirects, though; otherwise they should be handled in some distinct way.
It seems that getPageData() retrieves the text of a page. In other words, it performs yet another database query. And you're calling that FOR EVERY ARTICLE in Wikipedia!
Is it possible to leverage the already existing periodic database dump, for example importing it into some machine not live on the web, and generating there the necessary XML dumps and diffs? If it's only working on the cur table, it's not even so heavy on memory usage
Alfio
MySQL has a facility for distributing databases. It sounds as though you all are trying to solve a problem at the application level, when it is a data distribution, caching and integrity problem. And this is exactly the kind of problem that databases can solve.
If the wiki database is synchronized out, clients of that synchronization can do stuff for Yahoo or do whatever they want.
MySQL handles all this "changed since" checking and such. It also does it more efficiently than a PHP script. So, what problem are you all trying to solve?
- ray
On Apr 2, 2004, at 1:19 AM, Alfio Puglisi wrote:
On Fri, 2 Apr 2004, Brion Vibber wrote:
On Apr 2, 2004, at 00:35, Timwi wrote:
$sql = "SELECT cur_title as title from cur where cur_namespace=0";
This query sucks big time.
Do you know what this does? This retrieves the titles of ALL ARTICLES in Wikipedia.
That's kinda the point, yeah. It might be better to skip redirects, though; otherwise they should be handled in some distinct way.
It seems that getPageData() retrieves the text of a page. In other words, it performs yet another database query. And you're calling that FOR EVERY ARTICLE in Wikipedia!
Is it possible to leverage the already existing periodic database dump, for example importing it into some machine not live on the web, and generating there the necessary XML dumps and diffs? If it's only working on the cur table, it's not even so heavy on memory usage
Alfio
Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Ray Kiddy wrote:
MySQL handles all this "changed since" checking and such. It also does it more efficiently than a PHP script. So, what problem are you all trying to solve?
We have another organization (Yahoo in this instance, but to make the problem more general, let's assume that other cases will arise) which we would like to accomodate by supplying for them a feed that is ready made for their existing software. They have a specification for input which we can easily meet. We want to do so in the most efficient way.
Yahoo wants an XML file of a particular kind. It's not complex for us to generate it. We may wonder why they want this file, and of course based on feedback from here, either Jason or I will talk to them to propose something else if it seems helpful to do so. But, this is the format that they want, and we want them to have it.
--Jimbo
On Apr 3, 2004, at 1:38 PM, Jimmy Wales wrote:
Ray Kiddy wrote:
MySQL handles all this "changed since" checking and such. It also does it more efficiently than a PHP script. So, what problem are you all trying to solve?
We have another organization (Yahoo in this instance, but to make the problem more general, let's assume that other cases will arise) which we would like to accomodate by supplying for them a feed that is ready made for their existing software. They have a specification for input which we can easily meet. We want to do so in the most efficient way.
But there are people at Yahoo who know MySQL. Jeremy Zawodny comes to mind. He works there and his O'Reilly book called 'High Performance MySQL' is coming out soon. The word "Replication" is in the title of the book, so they could probably even help set this up so as not to impact the wikipedia over-all.
It will probably not be too difficult to come up with a feed oriented to them, but what happens when another organization wants the same feed. But, oh yes, in a slightly different form. And another. And another.
Yahoo wants an XML file of a particular kind. It's not complex for us to generate it. We may wonder why they want this file, and of course based on feedback from here, either Jason or I will talk to them to propose something else if it seems helpful to do so. But, this is the format that they want, and we want them to have it.
Again, the people you might be talking to might not be the database people, so maybe they think that the solution they need is not a database solution.
But you can let them replicate the database. It will be as efficient as any application-level solution and they can do anything they want with that data, spiced to their tastes.
Respectfully, I would say that if there are two things that wiki people could spend their time on:
1) growing the wiki community and the database that keeps the collective corpus together, and
2) feeding this to various corporate clients, again and again and again, with all their ever-changing requirements
then one of these seems to be closer to the core mission than the other.
- ray
Ray Kiddy wrote:
But there are people at Yahoo who know MySQL. Jeremy Zawodny comes to mind. He works there and his O'Reilly book called 'High Performance MySQL' is coming out soon. The word "Replication" is in the title of the book, so they could probably even help set this up so as not to impact the wikipedia over-all.
O.k., I'll ask, but...
It will probably not be too difficult to come up with a feed oriented to them, but what happens when another organization wants the same feed. But, oh yes, in a slightly different form. And another. And another.
My view is that if each individual feed is of benefit to us, then we provide what each of them needs. :-) To me, your question is like "What if after this one person gives you an enormous free benefit, someone else wants to give you even more? And someone else?" It's delightful, that's what. :-)
Having said that, of course we want to be efficient, but search engines who want to send us traffic are not leeches, they are beneficiaries to our cause.
Again, the people you might be talking to might not be the database people, so maybe they think that the solution they need is not a database solution.
That's right. The people I'm talking to are not the database people. I can ask, but really, I don't think it really makes sense for *them* to have to run different solutions for every different site that they are dealing with. For them, the easy thing is to have a standard XML format and go from there.
- growing the wiki community and the database that keeps the
collective corpus together, and
- feeding this to various corporate clients, again and again and
again, with all their ever-changing requirements
then one of these seems to be closer to the core mission than the other.
I agree completely, but I don't really see how these compete with each other.
--Jimbo
Brion Vibber wrote:
A text base without the unnecessary UI elements could improve search results, and I suppose can be kept more complete more easily than constant spidering of a 200k+ page site. *shrug* If that's the data format they want, hey fine, though having to download the entire set of a couple hundred megabytes for every update doesn't sound ideal.
If the problem is that each article has navigational elements on the top, bottom, and left, which multiplies the time required to crawl things, then how about allowing something like &raw=1 which would output only the parsed page, without a skin or anything, with all the links in it also pointing to &raw=1, and then arranging with Yahoo to have them spider these pages, but still send people to the URL without &raw=1?
On Apr 2, 2004, at 01:39, Timwi wrote:
If the problem is that each article has navigational elements on the top, bottom, and left, which multiplies the time required to crawl things, then how about allowing something like &raw=1 which would output only the parsed page, without a skin or anything, with all the links in it also pointing to &raw=1, and then arranging with Yahoo to have them spider these pages, but still send people to the URL without &raw=1?
Then they'll still have to make 240,000+ HTTP connections to check every individual page for updates, which can take days or weeks depending on the crawl delay.
-- brion vibber (brion @ pobox.com)
On Fri, 02 Apr 2004 01:55:31 -0800, Brion Vibber wrote:
Then they'll still have to make 240,000+ HTTP connections to check every individual page for updates, which can take days or weeks depending on the crawl delay.
What about adding a squid running at yahoo to our cache purge list? They could constantly crawl that one, and only purged pages would be fetched (from the wp caches configured as parent to the yahoo box, no extra db access).
That's zero extra db usage and no scripting on our part required. And the wp html should be pretty easy to parse for them by cutting out the content of <div id='content'>.
On Fri, 02 Apr 2004 14:29:19 +0200, Gabriel Wicke wrote:
Sorry for responding to myself, but..
What about just recording the updated (purged) url's in a text file? Could be on the wp servers, could be done at yahoo if they want real-time. No need for a squid on their part in that case, a simple http daemon logging the requests would be enough to build the list at yahoo. Or a five-line or so function to appends the url to an hourly-rotated log file at wikipedia that can be fetched from yahoo or others.
wikitech-l@lists.wikimedia.org