TL;DR: Look for protocol relative links.
For those wikis I have tested, the output looks very
similar to
what I got from looking at the replicated database, except that
all external links to WMF sites seem to have been removed from
the SQL dumps.
That's interesting. There are such links in the dumps:
gzip -d <
/mnt/user-store/dumps/frwiki-group-writable/frwiki-20111123-externallinks.sql.gz
| grep -oP ".{17}commons.wikimedia.org[^']*" | more
(
572952,'http://commons.wikimedia.org/wiki/Image:Georges_Seurat_043.jpg
(
572952,'http://commons.wikimedia.org/wiki/Image:Tour_Eiffel_exposition_…
(
603411,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Im…
(
150418,'http://commons.wikimedia.org/wiki/Category:The_Yorck_Project
(
564205,'http://commons.wikimedia.org/wiki/Category:Icons
(
446364,'http://commons.wikimedia.org/wiki/Image:Viaduc_Millau_France.PNG
(
446364,'http://commons.wikimedia.org/wiki/Image:La2-demis-france.png
(
339872,'http://commons.wikimedia.org/wiki/Category:Flags_of_municipalit…
(
339872,'http://commons.wikimedia.org/wiki/Category:Flags_of_municipalit…
(
339872,'http://commons.wikimedia.org/wiki/Category:Maps_of_municipaliti…
(
339872,'http://commons.wikimedia.org/wiki/Accueil
(
518924,'http://commons.wikimedia.org/wiki/Special:Search?search=tintagel
(
488187,'http://commons.wikimedia.org/skins-1.5/common/metadata.js
(
342102,'http://commons.wikimedia.org/wiki/Category:Photography
(
547524,'http://commons.wikimedia.org/wiki/Special:Contributions/Pereubu
(
601588,'http://commons.wikimedia.org/wiki/Main_Page
(
601588,'http://commons.wikimedia.org/w/index.php?title=Image:Uz%C3%A8s_…
(
366318,'http://commons.wikimedia.org/wiki/Category:Chinese_stroke_order
(
208246,'http://commons.wikimedia.org/wiki/Category:SVG_flags
(
208246,'http://commons.wikimedia.org/wiki/Category:Rolleiflex
(
350993,'http://commons.wikimedia.org/wiki/Paris
(
232033,'http://commons.wikimedia.org/wiki/Category:Bolivia
(
488180,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Im…
(
488180,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Im…
(
209441,'http://commons.wikimedia.org/wiki/Accueil etc.
Although few of them seem to come from NS_MAIN:
sql frwiki_p "select page_namespace from page where page_id IN (0$(gzip
-d < frwiki-20111123-externallinks.sql.gz | grep --line-buffered -oP
"([0-9]+)(?=,'http://commons.wikimedia.org[^']*)" | uniq | head -500 |
sed s/^/,/g ))"
The trick seems to be that most of them they will be present through
//commons.wikimedia.org, not
http://commons.wikimedia.org
Talk pages will have more url-copying, while articles will use templates
like {{Autres projets}}
sql frwiki_p "select page_namespace from page
where page_id IN (0$(gzip -d < frwiki-20111123-externallinks.sql.gz | grep
--line-buffered -oP "([0-9]+)(?=,'//commons.wikimedia.org[^']*)" | uniq
| head -500 | sed s/^/,/g ))"
does show many NS_MAIN entries.
I suspect you are only taking into accoutn http: and https: links with
your perl script, while all of them with your sql query.
select count(*) from externallinks join page on
(el_from = page_id) where el_to like '//commons.wikimedia.org/%' and
page_namespace = 0;
+----------+
| count(*) |
+----------+
| 101439 |
+----------+
select count(*) from externallinks join page on
(el_from = page_id) where el_to like 'http://commons.wikimedia.org/%' and
page_namespace = 0;
+----------+
| count(*) |
+----------+
| 399 |
+----------+
101439 + 399 = 101838 ~= 101619 which is your db result.