On 08/06/12 19:00, Lars Aronsson wrote:
One problem is that I can't see which namespaces are "content" namespaces in any of the database dumps. I can only see this from the API, http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=...
The API only provides the current value, which can change over time. I can't get the value that was in effect when the database dump was generated.
That's not a problem. You can process anold dump with current contentness values.
What could have changed? * Namespace wasn't listed as content, but it is now. As the content of a namespace will be of the same kind, it was a bug that it wasn't flagged as content, so it's right to use the newer value.
* Namespace didn't exist before, it was added as content. No problem, a missing namespace doesn't affect results either way it's marked.
* Namespace was listed as content, but now it isn't. Wouldn't happen, unless it was a shortly-lived config typo. Still, it's better to use newer data.
You should of course store the ns you treated as content with the filtered external link, but you're generating that file.
The biggest problem may be if many pages with external links in a pseudo-namespace moved to a real non-content namespace, as that would be detected as a lose of those links (because they shouldn't have been there).
Another problem is that I want to count links that I find in the File: (ns=6)
There's usually no content there (license templates, fair use rationales...). Given that you won't be correctly computing the external links from transcluded commons inmages, I wouldn't count it (except for commons, where images are the content).
and Portal: (mostly ns=100) namespaces,
I think these could be argued both ways. Do you see many externallinks from portal pages? I think they should list pages in the current wiki, so I wouldn't expect external links there.
but these aren't marked as content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should count as content?
Should I submit my script (300 lines of Perl) somewhere?
Yes. Probably somewhere at svn.wikimedia.org/mediawiki/trunk/tools