On 08/06/12 19:00, Lars Aronsson wrote:
One problem is that I can't see which namespaces
are "content" namespaces
in any of the database dumps. I can only see this from the API,
http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop…
The API only provides the current value, which can change over time. I
can't
get the value that was in effect when the database dump was generated.
That's not a problem. You can process anold dump with current
contentness values.
What could have changed?
* Namespace wasn't listed as content, but it is now.
As the content of a namespace will be of the same kind, it was a bug
that it wasn't flagged as content, so it's right to use the newer value.
* Namespace didn't exist before, it was added as content.
No problem, a missing namespace doesn't affect results either way it's
marked.
* Namespace was listed as content, but now it isn't.
Wouldn't happen, unless it was a shortly-lived config typo. Still, it's
better to use newer data.
You should of course store the ns you treated as content with the
filtered external link, but you're generating that file.
The biggest problem may be if many pages with external links in a
pseudo-namespace moved to a real non-content namespace, as that would be
detected as a lose of those links (because they shouldn't have been there).
Another problem is that I want to count links that I
find in the File:
(ns=6)
There's usually no content there (license templates, fair use
rationales...). Given that you won't be correctly computing the external
links from transcluded commons inmages, I wouldn't count it (except for
commons, where images are the content).
and Portal: (mostly ns=100) namespaces,
I think
these could be argued both ways. Do you see many externallinks
from portal pages? I think they should list pages in the current wiki,
so I wouldn't expect external links there.
but these aren't marked as
content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should
count as content?
Should I submit my script (300 lines of Perl)
somewhere?
Yes. Probably somewhere at
svn.wikimedia.org/mediawiki/trunk/tools