This information is available mostly pre-calculated in the CirrusSearch
dumps at
Each article is represented by a line of json in those dumps. There is a
field called 'incoming_links' which is the number of unique articles with
links from the content namespace(s) to that article. Each article
additionally contains an `outgoing_link` field which contains a list of
strings representing the pages the article links to (incoming_links is
calculated by querying the outgoing_link field). I've done graph work on
wikipedia before using this and the outgoing_link field is typically enough
to build a full graph.
On Sun, Mar 18, 2018 at 2:18 PM, John <phoenixoverride(a)gmail.com> wrote:
I would second the recommendation of using the dumps
for such a large
graphing project. If it's more than a couple hundred pages the API/database
queries can get bulky
On Sun, Mar 18, 2018 at 5:07 PM Brian Wolff <bawolff(a)gmail.com> wrote:
Hi,
You can run longer queries by getting access to toolforge (
https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from
the
command line.
However the query in question might still take an excessively long time
(if you are doing all of wikipedia). I would expect that query to result
in
about 150mb of data and maybe take days to
complete.
You can also break it down into parts by adding WHERE page_title >='a'
AND
page_title < 'b'
Note, also of interest: full dumps of all the links is available at
https://dumps.wikimedia.org/enwiki/20180301/enwiki- 20180301-pagelinks.sql.gz
to
convert page ids to page names)
--
Brian
On Sunday, March 18, 2018, Nick Bell <bhink03(a)gmail.com> wrote:
Hi there,
I'm a final year Mathematics student at the University of Bristol, and
I'm
> studying Wikipedia as a graph for my project.
>
> I'd like to get data regarding the number of outgoing links on each
page,
and the
number of pages with links to each page. I have already
inquired about this with the Analytics Team mailing list, who gave me a
few
suggestions.
One of these was to run the code at this link
https://quarry.wmflabs.org/
> query/25400
> with these instructions:
>
> "You will have to fork it and remove the "LIMIT 10" to get it to run
on
> all the English Wikipedia articles. It may take too long or produce
> too much data, in which case please ask on this list for someone who
> can run it for you."
>
> I ran the code as instructed, but the query was killed as it took
longer
than 30
minutes to run. I asked if anyone on the mailing list could run
it
> for me, but no one replied saying they could. The guy who wrote the
code
suggested
I try this mailing list to see if anyone can help.
I'm a beginner in programming and coding etc., so any and all help you
can
give me would be greatly appreciated.
Many thanks,
Nick Bell
University of Bristol
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l