<div dir="ltr">I load from a file "page_titles" and "page_ids" and put them in a dictionary. One option I haven't used would be putting than into a database and INNER Joining with the pagelinks table to just obtain the links for those articles. Still, if the list is 300.000, even this is just 20% of the database, it is still a lot.<div><br></div><div>Marc</div><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px; max-height:0px;" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=2c33b224-abba-4648-a1bb-c9ffcfb474f5"><font color="#ffffff" size="1">ᐧ</font></div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:51 GMT+01:00 John <span dir="ltr"><<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Where are you getting your list of pages from?<br></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <span dir="ltr"><<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi John,<div><br></div><div><br></div><div>My queries are to obtain "inlinks" and "outlinks" for some articles I have in a group (x). Then I check (using python) if they have inlinks and outlinks from another group of articles. By now I am doing a query for each article. I wanted to obtain all links for group (x) and then do this comprovation....But getting all links for groups as big as 300000 articles would imply 6 million links. Is it possible to obtain all this or is there a MySQL/RAM limit?</div><div><br></div><div>Thanks.</div><div><br></div><div>Marc</div><div><br><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px;max-height:0px" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=0eace8e0-4563-4a92-82a4-91855a778d29"><font color="#ffffff" size="1">ᐧ</font></div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:29 GMT+01:00 <span dir="ltr"><<a href="mailto:labs-l-request@lists.wikimedia.org" target="_blank">labs-l-request@lists.wikimedia.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send Labs-l mailing list submissions to<br>
<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:labs-l-request@lists.wikimedia.org" target="_blank">labs-l-request@lists.wikimedia.org</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:labs-l-owner@lists.wikimedia.org" target="_blank">labs-l-owner@lists.wikimedia.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Labs-l digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. dimension well my queries for very large tables like<br>
pagelinks - Tool Labs (Marc Miquel)<br>
2. Re: dimension well my queries for very large tables like<br>
pagelinks - Tool Labs (John)<br>
3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)<br>
4. Re: Questions regarding the Labs Terms of use (Ryan Lane)<br>
5. Re: Questions regarding the Labs Terms of use (Pine W)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Fri, 13 Mar 2015 17:59:09 +0100<br>
From: Marc Miquel <<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>><br>
To: "<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>" <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: [Labs-l] dimension well my queries for very large tables like<br>
pagelinks - Tool Labs<br>
Message-ID:<br>
<CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=<a href="mailto:naEa9d6%2Bw@mail.gmail.com" target="_blank">naEa9d6+w@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Hello guys,<br>
<br>
I have a question regarding Tool Labs. I am doing research on links and<br>
although I know very well what I am looking for I struggle in how to get it<br>
effectively...<br>
<br>
I need to know your opinion because you know very well the system and<br>
what's feasible and what is not.<br>
<br>
I explain you what I need to do:<br>
I have a list of articles for different languages which I need to check<br>
their pagelinks and see where they point to and from where they point at<br>
them.<br>
<br>
I now do a query for each article id in this list of articles, which goes<br>
from 80000 in some wikipedias to 300000 in other and more. I have to do it<br>
several times and it is very time consuming (several days). I wish I could<br>
only count the total of links for each case but I need to see only some of<br>
the links per article.<br>
<br>
I was thinking about getting all pagelinks and iterating using python<br>
(which is the language I use for all this). This would be much faster<br>
because I'd save all the queries, one per article, I am doing now. But<br>
pagelinks table has millions of rows and I cannot load that because mysql<br>
would die. I could buffer, but I haven't tried if it works also.<br>
<br>
I am considering creating a personal table in the database with titles,<br>
ids, and inner joining to just obtain the pagelinks for these 300.000<br>
articles. With this I would just retrieve 20% of the database instead of<br>
the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one<br>
of both per row), or even more... loaded into python dictionaries and<br>
lists. Would that be a problem...? I have no idea of how much RAM this<br>
implies and how much I can use in Tool labs.<br>
<br>
I am totally lost when I get these problems related to scale...I thought<br>
about writing to the IRC channel but I thought it was maybe too long and<br>
too specific. If you give me any hint that would really help.<br>
<br>
Thank you very much!<br>
<br>
Cheers,<br>
<br>
Marc Miquel<br>
ᐧ<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Fri, 13 Mar 2015 13:07:20 -0400<br>
From: John <<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>><br>
To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: Re: [Labs-l] dimension well my queries for very large tables<br>
like pagelinks - Tool Labs<br>
Message-ID:<br>
<CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=<a href="mailto:YSbFQ@mail.gmail.com" target="_blank">YSbFQ@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
what kind of queries are you doing? odds are they can be optimized.<br>
<br>
On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>> wrote:<br>
<br>
> Hello guys,<br>
><br>
> I have a question regarding Tool Labs. I am doing research on links and<br>
> although I know very well what I am looking for I struggle in how to get it<br>
> effectively...<br>
><br>
> I need to know your opinion because you know very well the system and<br>
> what's feasible and what is not.<br>
><br>
> I explain you what I need to do:<br>
> I have a list of articles for different languages which I need to check<br>
> their pagelinks and see where they point to and from where they point at<br>
> them.<br>
><br>
> I now do a query for each article id in this list of articles, which goes<br>
> from 80000 in some wikipedias to 300000 in other and more. I have to do it<br>
> several times and it is very time consuming (several days). I wish I could<br>
> only count the total of links for each case but I need to see only some of<br>
> the links per article.<br>
><br>
> I was thinking about getting all pagelinks and iterating using python<br>
> (which is the language I use for all this). This would be much faster<br>
> because I'd save all the queries, one per article, I am doing now. But<br>
> pagelinks table has millions of rows and I cannot load that because mysql<br>
> would die. I could buffer, but I haven't tried if it works also.<br>
><br>
> I am considering creating a personal table in the database with titles,<br>
> ids, and inner joining to just obtain the pagelinks for these 300.000<br>
> articles. With this I would just retrieve 20% of the database instead of<br>
> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one<br>
> of both per row), or even more... loaded into python dictionaries and<br>
> lists. Would that be a problem...? I have no idea of how much RAM this<br>
> implies and how much I can use in Tool labs.<br>
><br>
> I am totally lost when I get these problems related to scale...I thought<br>
> about writing to the IRC channel but I thought it was maybe too long and<br>
> too specific. If you give me any hint that would really help.<br>
><br>
> Thank you very much!<br>
><br>
> Cheers,<br>
><br>
> Marc Miquel<br>
> ᐧ<br>
><br>
> _______________________________________________<br>
> Labs-l mailing list<br>
> <a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Fri, 13 Mar 2015 17:36:00 +0000<br>
From: Tim Landscheidt <<a href="mailto:tim@tim-landscheidt.de" target="_blank">tim@tim-landscheidt.de</a>><br>
To: <a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a><br>
Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>
Message-ID: <<a href="mailto:878uf0vlz3.fsf@passepartout.tim-landscheidt.de" target="_blank">878uf0vlz3.fsf@passepartout.tim-landscheidt.de</a>><br>
Content-Type: text/plain<br>
<br>
(anonymous) wrote:<br>
<br>
> [...]<br>
<br>
> To be clear: I'm not going to make my code proprietary in<br>
> any way. I just wanted to know whether I'm entitled to ask<br>
> for the source of every Labs bot ;-)<br>
<br>
Everyone is entitled to /ask/, but I don't think you have a<br>
right to /receive/ the source :-).<br>
<br>
AFAIK, there are two main reasons for the clause:<br>
<br>
a) WMF doesn't want to have to deal with individual licences<br>
that may or may not have the potential for litigation<br>
("The Software shall be used for Good, not Evil"). With<br>
requiring OSI-approved, tried and true licences, the risk<br>
is negligible.<br>
<br>
b) Bots and tools running on an infrastructure financed by<br>
donors, like contributions to Wikipedia & Co., shouldn't<br>
be usable for blackmail. Noone should be in a legal po-<br>
sition to demand something "or else ..." The perpetuity<br>
of OS licences guarantees that everyone can be truly<br>
thankful to developers without having to fear that other-<br>
wise they shut down devices, delete content, etc.<br>
<br>
But the nice thing about collaboratively developed open<br>
source software is that it usually is of a better quality,<br>
so clandestine code is often not that interesting.<br>
<br>
Tim<br>
<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 4<br>
Date: Fri, 13 Mar 2015 11:52:18 -0600<br>
From: Ryan Lane <<a href="mailto:rlane32@gmail.com" target="_blank">rlane32@gmail.com</a>><br>
To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>
Message-ID:<br>
<<a href="mailto:CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx%2BactaKMDuXmg@mail.gmail.com" target="_blank">CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <<a href="mailto:ricordisamoa@openmailbox.org" target="_blank">ricordisamoa@openmailbox.org</a>><br>
wrote:<br>
<br>
> From <a href="https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use" target="_blank">https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use</a><br>
> (verbatim): "Do not use or install any software unless the software is<br>
> licensed under an Open Source license".<br>
> What about tools and services made up of software themselves? Do they have<br>
> to be Open Source?<br>
> Strictly speaking, do the Terms of use require that all code be made<br>
> available to the public?<br>
> Thanks in advance.<br>
><br>
><br>
As the person who wrote the initial terms and included this I can speak to<br>
the spirit of the term (I'm not a lawyer, so I won't try to go into any<br>
legal issues).<br>
<br>
I created Labs with the intent that it could be used as a mechanism to fork<br>
the projects as a whole, if necessary. A means to this end was including<br>
non-WMF employees in the process of infrastructure operations (which is<br>
outside the goals of the tools project in Labs). Tools/services that are<br>
can't be distributed publicly harm that goal. Tools/services that aren't<br>
open source completely break that goal. It's fine if you wish to not<br>
maintain the code in a public git repo, but if another tool maintainer<br>
wishes to publish your code, there should be nothing blocking that.<br>
<br>
Depending on external closed source services is a debatable topic. I know<br>
in the past we've decided to allow it. It goes against the spirit of the<br>
project, but it doesn't require us to distribute close sourced software in<br>
the case of a fork.<br>
<br>
My personal opinion is that your code should be in a public repository to<br>
encourage collaboration. As the terms are written, though, your code is<br>
required to be open source, and any libraries it depends on must be as well.<br>
<br>
- Ryan<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 5<br>
Date: Fri, 13 Mar 2015 11:29:47 -0700<br>
From: Pine W <<a href="mailto:wiki.pine@gmail.com" target="_blank">wiki.pine@gmail.com</a>><br>
To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>
Message-ID:<br>
<CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=<a href="mailto:P%2BiCaA@mail.gmail.com" target="_blank">P+iCaA@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Question: are there heightened security or privacy risks posed by having<br>
non-open-source code running in Labs?<br>
<br>
Is anyone proactively auditing Labs software for open source compliance,<br>
and if not, should this be done?<br>
<br>
Pine<br>
On Mar 13, 2015 10:52 AM, "Ryan Lane" <<a href="mailto:rlane32@gmail.com" target="_blank">rlane32@gmail.com</a>> wrote:<br>
<br>
> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <<br>
> <a href="mailto:ricordisamoa@openmailbox.org" target="_blank">ricordisamoa@openmailbox.org</a>> wrote:<br>
><br>
>> From <a href="https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use" target="_blank">https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use</a><br>
>> (verbatim): "Do not use or install any software unless the software is<br>
>> licensed under an Open Source license".<br>
>> What about tools and services made up of software themselves? Do they<br>
>> have to be Open Source?<br>
>> Strictly speaking, do the Terms of use require that all code be made<br>
>> available to the public?<br>
>> Thanks in advance.<br>
>><br>
>><br>
> As the person who wrote the initial terms and included this I can speak to<br>
> the spirit of the term (I'm not a lawyer, so I won't try to go into any<br>
> legal issues).<br>
><br>
> I created Labs with the intent that it could be used as a mechanism to<br>
> fork the projects as a whole, if necessary. A means to this end was<br>
> including non-WMF employees in the process of infrastructure operations<br>
> (which is outside the goals of the tools project in Labs). Tools/services<br>
> that are can't be distributed publicly harm that goal. Tools/services that<br>
> aren't open source completely break that goal. It's fine if you wish to not<br>
> maintain the code in a public git repo, but if another tool maintainer<br>
> wishes to publish your code, there should be nothing blocking that.<br>
><br>
> Depending on external closed source services is a debatable topic. I know<br>
> in the past we've decided to allow it. It goes against the spirit of the<br>
> project, but it doesn't require us to distribute close sourced software in<br>
> the case of a fork.<br>
><br>
> My personal opinion is that your code should be in a public repository to<br>
> encourage collaboration. As the terms are written, though, your code is<br>
> required to be open source, and any libraries it depends on must be as well.<br>
><br>
> - Ryan<br>
><br>
> _______________________________________________<br>
> Labs-l mailing list<br>
> <a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html</a>><br>
<br>
------------------------------<br>
<br>
_______________________________________________<br>
Labs-l mailing list<br>
<a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
<br>
<br>
End of Labs-l Digest, Vol 39, Issue 13<br>
**************************************<br>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>