Hi, I am trying to work with the Pagelinks file in order to extract the number of articles that link to a specific Wiki page (the same idea as can be seen in the "What-links-here" tool: https://en.wikipedia.org/wiki/Help:What_links_here).
However, when I loop over this SQL-like file, I find very weird cases of links that don't seem to exist in Wikipedia.
For example, the first line in the file indicates that there is a link between page ids 1939 and 2. PageID 2 doesn't even seem to exist.
Even when I look at pages that do exist, the link indicated in the file does not exist on the actual wiki page.
Am I missing anything?
Hi, Please note that the source and the target point to ids in different tables. pl_from points to id in page table (page_id) but pl_target_id points to id in linktarget table (lt_id). Also note that pages can link to non-existent pages (that's one of the reasons the target doesn't point to page_id but to a different table).
Hope that helps, let me know if it doesn't fix your problem.
Best
Am Fr., 28. Feb. 2025 um 19:45 Uhr schrieb Abraham Israeli < isabrah@umich.edu>:
Hi, I am trying to work with the Pagelinks file in order to extract the number of articles that link to a specific Wiki page (the same idea as can be seen in the "What-links-here" tool: https://en.wikipedia.org/wiki/Help:What_links_here).
However, when I loop over this SQL-like file, I find very weird cases of links that don't seem to exist in Wikipedia.
For example, the first line in the file indicates that there is a link between page ids 1939 and 2. PageID 2 doesn't even seem to exist.
Even when I look at pages that do exist, the link indicated in the file does not exist on the actual wiki page.
Am I missing anything? _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
Yes, this is very helpful! Thanks a lot! So, if I get it right, I actually need some mapping from the link target table to the page table in order to have both IDs to be "consistent". Maybe I'm doing something too complicated here -- what I'm actually looking for is a way to know the "in degree" of wiki articles. This is the number of articles that point to a specific article. I'm only interested in existing articles from namespace 0. Any better approach to do it?
thanks again!
On Fri, Feb 28, 2025 at 2:09 PM Amir Sarabadani ladsgroup@gmail.com wrote:
Hi, Please note that the source and the target point to ids in different tables. pl_from points to id in page table (page_id) but pl_target_id points to id in linktarget table (lt_id). Also note that pages can link to non-existent pages (that's one of the reasons the target doesn't point to page_id but to a different table).
Hope that helps, let me know if it doesn't fix your problem.
Best
Am Fr., 28. Feb. 2025 um 19:45 Uhr schrieb Abraham Israeli < isabrah@umich.edu>:
Hi, I am trying to work with the Pagelinks file in order to extract the number of articles that link to a specific Wiki page (the same idea as can be seen in the "What-links-here" tool: https://en.wikipedia.org/wiki/Help:What_links_here).
However, when I loop over this SQL-like file, I find very weird cases of links that don't seem to exist in Wikipedia.
For example, the first line in the file indicates that there is a link between page ids 1939 and 2. PageID 2 doesn't even seem to exist.
Even when I look at pages that do exist, the link indicated in the file does not exist on the actual wiki page.
Am I missing anything? _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
-- Amir (he/him)
You look up the linked page in linktarget, then count links to that using pagelinks.
If you want to look the most linked articles by articles, you could do: SELECT lt_title, COUNT(pl_from) AS count FROM linktarget JOIN pagelinks ON (lt_id=pl_target_id) WHERE pl_from_namespace =0 AND lt_namespace=0 GROUP BY pl_target_id ORDER BY count DESC LIMIT 100;
which for enwiki returns the below result.
Beware that some templates present on many articles may bias the output (depending on what you want to measure). For instance, it is obvious that the high number of links to identifiers (ISBN, GPS, Doi...) are related to articles including coordinates, including ISBN in their references... not necessarily *discussing* GPS. Depending on what kind of you conclusions you seek, that may be useful or noise. Also note that this doesn't include links through the category system, which would need to be accounted separatedly.
Regards
+-------------------------------------------------+---------+ | ISBN_(identifier) | 1557626 | | Geographic_coordinate_system | 1242077 | | Doi_(identifier) | 646084 | | Wayback_Machine | 566217 | | Wikidata | 473357 | | Taxonomy_(biology) | 469525 | | Time_zone | 444403 | | Global_Biodiversity_Information_Facility | 441781 | | United_States | 436111 | | ISSN_(identifier) | 435609 | | ASCII | 418733 | | Typographic_ligature | 416779 | | Greek_alphabet | 416495 | | Diacritical_mark | 416336 | | Open_Tree_of_Life | 380712 | | IMDb_(identifier) | 370081 | | Binomial_nomenclature | 366398 | | Animal | 351794 | | Catalogue_of_Life | 351292 | | Eukaryote | 342025 | | Interim_Register_of_Marine_and_Nonmarine_Genera | 305561 | | INaturalist | 283754 | | Daylight_saving_time | 276830 | | Encyclopedia_of_Life | 264940 | | France | 258105 | | S2CID_(identifier) | 254469 | | Wikispecies | 248856 | | Association_football | 243125 | | Record_label | 235444 | | OCLC_(identifier) | 232519 | | National_Center_for_Biotechnology_Information | 230716 | | Arthropod | 220520 | | Music_genre | 216694 | | PMID_(identifier) | 214443 | | Germany | 208326 | | United_Kingdom | 202985 | | Insect | 196545 | | List_of_sovereign_states | 196378 | | Record_producer | 193718 | | India | 185757 | | Italy | 183172 | | Surname | 182182 | | World_War_II | 181185 | | Australia | 175278 | | Synonym_(taxonomy) | 174819 | | The_New_York_Times | 166959 | | Integrated_Taxonomic_Information_System | 166733 | | Japan | 164887 | | Canada | 164580 | | Russia | 161388 | | Poland | 157299 | | Spain | 156856 | | England | 156133 | | Bibcode_(identifier) | 146095 | | JSTOR_(identifier) | 142740 | | Common_name | 141434 | | New_York_City | 135577 | | Biological_nomenclature | 133976 | | PMC_(identifier) | 132931 | | China | 132623 | | Central_European_Time | 129616 | | London | 127722 | | Central_European_Summer_Time | 126416 | | Brazil | 125904 | | Netherlands | 124340 | | Album | 124049 | | Single_(music) | 121842 | | Sweden | 117744 | | Barcode_of_Life_Data_System | 116870 | | English_language | 112510 | | Switzerland | 111163 | | U.S._state | 106763 | | Russian_language | 105655 | | Given_name | 104402 | | Iran | 101650 | | Billboard_(magazine) | 101132 | | Lepidoptera | 100443 | | Village | 99583 | | Turkey | 96563 | | Alma_mater | 96208 | | The_Guardian | 95350 | | Belgium | 94848 | | National_Register_of_Historic_Places | 94507 | | County_seat | 92805 | | World_Register_of_Marine_Species | 92677 | | Austria | 90980 | | World_War_I | 90910 | | Public_domain | 90410 | | Midfielder | 90404 | | South_Africa | 90243 | | National_Park_Service | 89946 | | AllMusic | 89904 | | Argentina | 89179 | | Mexico | 88627 | | California | 88542 | | IUCN_Red_List | 88078 | | Soviet_Union | 86969 | | Plant | 86862 | | YouTube | 86425 | | H:S | 86326 | +-------------------------------------------------+---------+ 100 rows in set (1 hour 2 min 7.110 sec)
On Sat, 1 Mar 2025 at 04:23, Abraham Israeli isabrah@umich.edu wrote:
Yes, this is very helpful! Thanks a lot! So, if I get it right, I actually need some mapping from the link target table to the page table in order to have both IDs to be "consistent". Maybe I'm doing something too complicated here -- what I'm actually looking for is a way to know the "in degree" of wiki articles. This is the number of articles that point to a specific article. I'm only interested in existing articles from namespace 0. Any better approach to do it?
thanks again!
On Fri, Feb 28, 2025 at 2:09 PM Amir Sarabadani ladsgroup@gmail.com wrote:
Hi, Please note that the source and the target point to ids in different tables. pl_from points to id in page table (page_id) but pl_target_id points to id in linktarget table (lt_id). Also note that pages can link to non-existent pages (that's one of the reasons the target doesn't point to page_id but to a different table).
Hope that helps, let me know if it doesn't fix your problem.
Best
Am Fr., 28. Feb. 2025 um 19:45 Uhr schrieb Abraham Israeli < isabrah@umich.edu>:
Hi, I am trying to work with the Pagelinks file in order to extract the number of articles that link to a specific Wiki page (the same idea as can be seen in the "What-links-here" tool: https://en.wikipedia.org/wiki/Help:What_links_here).
However, when I loop over this SQL-like file, I find very weird cases of links that don't seem to exist in Wikipedia.
For example, the first line in the file indicates that there is a link between page ids 1939 and 2. PageID 2 doesn't even seem to exist.
Even when I look at pages that do exist, the link indicated in the file does not exist on the actual wiki page.
Am I missing anything? _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
-- Amir (he/him)
-- Best, Abraham
Abraham I. Postdoc Researcher University of Michigan | School of Information pronouns: he/him abraham.com https://www.avrahami-israeli.com/ _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
wow! this is really useful! Thanks so much!
I actually need to have the list of all Wikipedia articles (namespace 0). I know it's an endless list of a few million articles, but I'm looking for a way to asses articles through the number of "inner links" they have overall Wikipedia. Is there any best tool to use for that? Will PAWS handle this?
On Sun, Mar 2, 2025 at 6:22 PM Platonides platonides@gmail.com wrote:
You look up the linked page in linktarget, then count links to that using pagelinks.
If you want to look the most linked articles by articles, you could do: SELECT lt_title, COUNT(pl_from) AS count FROM linktarget JOIN pagelinks ON (lt_id=pl_target_id) WHERE pl_from_namespace =0 AND lt_namespace=0 GROUP BY pl_target_id ORDER BY count DESC LIMIT 100;
which for enwiki returns the below result.
Beware that some templates present on many articles may bias the output (depending on what you want to measure). For instance, it is obvious that the high number of links to identifiers (ISBN, GPS, Doi...) are related to articles including coordinates, including ISBN in their references... not necessarily *discussing* GPS. Depending on what kind of you conclusions you seek, that may be useful or noise. Also note that this doesn't include links through the category system, which would need to be accounted separatedly.
Regards
+-------------------------------------------------+---------+ | ISBN_(identifier) | 1557626 | | Geographic_coordinate_system | 1242077 | | Doi_(identifier) | 646084 | | Wayback_Machine | 566217 | | Wikidata | 473357 | | Taxonomy_(biology) | 469525 | | Time_zone | 444403 | | Global_Biodiversity_Information_Facility | 441781 | | United_States | 436111 | | ISSN_(identifier) | 435609 | | ASCII | 418733 | | Typographic_ligature | 416779 | | Greek_alphabet | 416495 | | Diacritical_mark | 416336 | | Open_Tree_of_Life | 380712 | | IMDb_(identifier) | 370081 | | Binomial_nomenclature | 366398 | | Animal | 351794 | | Catalogue_of_Life | 351292 | | Eukaryote | 342025 | | Interim_Register_of_Marine_and_Nonmarine_Genera | 305561 | | INaturalist | 283754 | | Daylight_saving_time | 276830 | | Encyclopedia_of_Life | 264940 | | France | 258105 | | S2CID_(identifier) | 254469 | | Wikispecies | 248856 | | Association_football | 243125 | | Record_label | 235444 | | OCLC_(identifier) | 232519 | | National_Center_for_Biotechnology_Information | 230716 | | Arthropod | 220520 | | Music_genre | 216694 | | PMID_(identifier) | 214443 | | Germany | 208326 | | United_Kingdom | 202985 | | Insect | 196545 | | List_of_sovereign_states | 196378 | | Record_producer | 193718 | | India | 185757 | | Italy | 183172 | | Surname | 182182 | | World_War_II | 181185 | | Australia | 175278 | | Synonym_(taxonomy) | 174819 | | The_New_York_Times | 166959 | | Integrated_Taxonomic_Information_System | 166733 | | Japan | 164887 | | Canada | 164580 | | Russia | 161388 | | Poland | 157299 | | Spain | 156856 | | England | 156133 | | Bibcode_(identifier) | 146095 | | JSTOR_(identifier) | 142740 | | Common_name | 141434 | | New_York_City | 135577 | | Biological_nomenclature | 133976 | | PMC_(identifier) | 132931 | | China | 132623 | | Central_European_Time | 129616 | | London | 127722 | | Central_European_Summer_Time | 126416 | | Brazil | 125904 | | Netherlands | 124340 | | Album | 124049 | | Single_(music) | 121842 | | Sweden | 117744 | | Barcode_of_Life_Data_System | 116870 | | English_language | 112510 | | Switzerland | 111163 | | U.S._state | 106763 | | Russian_language | 105655 | | Given_name | 104402 | | Iran | 101650 | | Billboard_(magazine) | 101132 | | Lepidoptera | 100443 | | Village | 99583 | | Turkey | 96563 | | Alma_mater | 96208 | | The_Guardian | 95350 | | Belgium | 94848 | | National_Register_of_Historic_Places | 94507 | | County_seat | 92805 | | World_Register_of_Marine_Species | 92677 | | Austria | 90980 | | World_War_I | 90910 | | Public_domain | 90410 | | Midfielder | 90404 | | South_Africa | 90243 | | National_Park_Service | 89946 | | AllMusic | 89904 | | Argentina | 89179 | | Mexico | 88627 | | California | 88542 | | IUCN_Red_List | 88078 | | Soviet_Union | 86969 | | Plant | 86862 | | YouTube | 86425 | | H:S | 86326 | +-------------------------------------------------+---------+ 100 rows in set (1 hour 2 min 7.110 sec)
On Sat, 1 Mar 2025 at 04:23, Abraham Israeli isabrah@umich.edu wrote:
Yes, this is very helpful! Thanks a lot! So, if I get it right, I actually need some mapping from the link target table to the page table in order to have both IDs to be "consistent". Maybe I'm doing something too complicated here -- what I'm actually looking for is a way to know the "in degree" of wiki articles. This is the number of articles that point to a specific article. I'm only interested in existing articles from namespace 0. Any better approach to do it?
thanks again!
On Fri, Feb 28, 2025 at 2:09 PM Amir Sarabadani ladsgroup@gmail.com wrote:
Hi, Please note that the source and the target point to ids in different tables. pl_from points to id in page table (page_id) but pl_target_id points to id in linktarget table (lt_id). Also note that pages can link to non-existent pages (that's one of the reasons the target doesn't point to page_id but to a different table).
Hope that helps, let me know if it doesn't fix your problem.
Best
Am Fr., 28. Feb. 2025 um 19:45 Uhr schrieb Abraham Israeli < isabrah@umich.edu>:
Hi, I am trying to work with the Pagelinks file in order to extract the number of articles that link to a specific Wiki page (the same idea as can be seen in the "What-links-here" tool: https://en.wikipedia.org/wiki/Help:What_links_here).
However, when I loop over this SQL-like file, I find very weird cases of links that don't seem to exist in Wikipedia.
For example, the first line in the file indicates that there is a link between page ids 1939 and 2. PageID 2 doesn't even seem to exist.
Even when I look at pages that do exist, the link indicated in the file does not exist on the actual wiki page.
Am I missing anything? _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
-- Amir (he/him)
-- Best, Abraham
Abraham I. Postdoc Researcher University of Michigan | School of Information pronouns: he/him abraham.com https://www.avrahami-israeli.com/ _______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-leave@lists.wikimedia.org
xmldatadumps-l@lists.wikimedia.org