<div dir="ltr">I load from a file "page_titles" and "page_ids" and put them in a dictionary. One option I haven't used would be putting than into a database and INNER Joining with the pagelinks table to just obtain the links for those articles. Still, if the list is 300.000, even this is just 20% of the database, it is still a lot.<div><br></div><div>Marc</div><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px; max-height:0px;" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=2c33b224-abba-4648-a1bb-c9ffcfb474f5"><font color="#ffffff" size="1">ᐧ</font></div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:51 GMT+01:00 John <span dir="ltr"><<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Where are you getting your list of pages from?<br></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <span dir="ltr"><<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi John,<div><br></div><div><br></div><div>My queries are to obtain "inlinks" and "outlinks" for some articles I have in a group (x). Then I check (using python) if they have inlinks and outlinks from another group of articles. By now I am doing a query for each article. I wanted to obtain all links for group (x) and then do this comprovation....But getting all links for groups as big as 300000 articles would imply 6 million links. Is it possible to obtain all this or is there a MySQL/RAM limit?</div><div><br></div><div>Thanks.</div><div><br></div><div>Marc</div><div><br><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px;max-height:0px" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=0eace8e0-4563-4a92-82a4-91855a778d29"><font color="#ffffff" size="1">ᐧ</font></div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:29 GMT+01:00  <span dir="ltr"><<a href="mailto:labs-l-request@lists.wikimedia.org" target="_blank">labs-l-request@lists.wikimedia.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send Labs-l mailing list submissions to<br>

        <a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:labs-l-request@lists.wikimedia.org" target="_blank">labs-l-request@lists.wikimedia.org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:labs-l-owner@lists.wikimedia.org" target="_blank">labs-l-owner@lists.wikimedia.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of Labs-l digest..."<br>

<br>

<br>

Today's Topics:<br>

<br>

   1. dimension well my queries for very large tables like<br>

      pagelinks - Tool Labs (Marc Miquel)<br>

   2. Re: dimension well my queries for very large tables like<br>

      pagelinks - Tool Labs (John)<br>

   3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)<br>

   4. Re: Questions regarding the Labs Terms of use (Ryan Lane)<br>

   5. Re: Questions regarding the Labs Terms of use (Pine W)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Fri, 13 Mar 2015 17:59:09 +0100<br>

From: Marc Miquel <<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>><br>

To: "<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>" <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>

Subject: [Labs-l] dimension well my queries for very large tables like<br>

        pagelinks - Tool Labs<br>

Message-ID:<br>

        <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=<a href="mailto:naEa9d6%2Bw@mail.gmail.com" target="_blank">naEa9d6+w@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Hello guys,<br>

<br>

I have a question regarding Tool Labs. I am doing research on links and<br>

although I know very well what I am looking for I struggle in how to get it<br>

effectively...<br>

<br>

I need to know your opinion because you know very well the system and<br>

what's feasible and what is not.<br>

<br>

I explain you what I need to do:<br>

I have a list of articles for different languages which I need to check<br>

their pagelinks and see where they point to and from where they point at<br>

them.<br>

<br>

I now do a query for each article id in this list of articles, which goes<br>

from 80000 in some wikipedias to 300000 in other and more. I have to do it<br>

several times and it is very time consuming (several days). I wish I could<br>

only count the total of links for each case but I need to see only some of<br>

the links per article.<br>

<br>

I was thinking about getting all pagelinks and iterating using python<br>

(which is the language I use for all this). This would be much faster<br>

because I'd save all the queries, one per article, I am doing now. But<br>

pagelinks table has millions of rows and I cannot load that because mysql<br>

would die. I could buffer, but I haven't tried if it works also.<br>

<br>

I am considering creating a personal table in the database with titles,<br>

ids, and inner joining to just obtain the pagelinks for these 300.000<br>

articles. With this I would just retrieve 20% of the database instead of<br>

the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one<br>

of both per row), or even more... loaded into python dictionaries and<br>

lists. Would that be a problem...? I have no idea of how much RAM this<br>

implies and how much I can use in Tool labs.<br>

<br>

I am totally lost when I get these problems related to scale...I thought<br>

about writing to the IRC channel but I thought it was maybe too long and<br>

too specific. If you give me any hint that would really help.<br>

<br>

Thank you very much!<br>

<br>

Cheers,<br>

<br>

Marc Miquel<br>

ᐧ<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Fri, 13 Mar 2015 13:07:20 -0400<br>

From: John <<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>><br>

To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>

Subject: Re: [Labs-l] dimension well my queries for very large tables<br>

        like pagelinks - Tool Labs<br>

Message-ID:<br>

        <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=<a href="mailto:YSbFQ@mail.gmail.com" target="_blank">YSbFQ@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

what kind of queries are you doing? odds are they can be optimized.<br>

<br>

On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>> wrote:<br>

<br>

> Hello guys,<br>

><br>

> I have a question regarding Tool Labs. I am doing research on links and<br>

> although I know very well what I am looking for I struggle in how to get it<br>

> effectively...<br>

><br>

> I need to know your opinion because you know very well the system and<br>

> what's feasible and what is not.<br>

><br>

> I explain you what I need to do:<br>

> I have a list of articles for different languages which I need to check<br>

> their pagelinks and see where they point to and from where they point at<br>

> them.<br>

><br>

> I now do a query for each article id in this list of articles, which goes<br>

> from 80000 in some wikipedias to 300000 in other and more. I have to do it<br>

> several times and it is very time consuming (several days). I wish I could<br>

> only count the total of links for each case but I need to see only some of<br>

> the links per article.<br>

><br>

> I was thinking about getting all pagelinks and iterating using python<br>

> (which is the language I use for all this). This would be much faster<br>

> because I'd save all the queries, one per article, I am doing now. But<br>

> pagelinks table has millions of rows and I cannot load that because mysql<br>

> would die. I could buffer, but I haven't tried if it works also.<br>

><br>

> I am considering creating a personal table in the database with titles,<br>

> ids, and inner joining to just obtain the pagelinks for these 300.000<br>

> articles. With this I would just retrieve 20% of the database instead of<br>

> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one<br>

> of both per row), or even more... loaded into python dictionaries and<br>

> lists. Would that be a problem...? I have no idea of how much RAM this<br>

> implies and how much I can use in Tool labs.<br>

><br>

> I am totally lost when I get these problems related to scale...I thought<br>

> about writing to the IRC channel but I thought it was maybe too long and<br>

> too specific. If you give me any hint that would really help.<br>

><br>

> Thank you very much!<br>

><br>

> Cheers,<br>

><br>

> Marc Miquel<br>

> ᐧ<br>

><br>

> _______________________________________________<br>

> Labs-l mailing list<br>

> <a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>

> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>

><br>

><br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 3<br>

Date: Fri, 13 Mar 2015 17:36:00 +0000<br>

From: Tim Landscheidt <<a href="mailto:tim@tim-landscheidt.de" target="_blank">tim@tim-landscheidt.de</a>><br>

To: <a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a><br>

Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>

Message-ID: <<a href="mailto:878uf0vlz3.fsf@passepartout.tim-landscheidt.de" target="_blank">878uf0vlz3.fsf@passepartout.tim-landscheidt.de</a>><br>

Content-Type: text/plain<br>

<br>

(anonymous) wrote:<br>

<br>

> [...]<br>

<br>

> To be clear: I'm not going to make my code proprietary in<br>

> any way. I just wanted to know whether I'm entitled to ask<br>

> for the source of every Labs bot ;-)<br>

<br>

Everyone is entitled to /ask/, but I don't think you have a<br>

right to /receive/ the source :-).<br>

<br>

AFAIK, there are two main reasons for the clause:<br>

<br>

a) WMF doesn't want to have to deal with individual licences<br>

   that may or may not have the potential for litigation<br>

   ("The Software shall be used for Good, not Evil").  With<br>

   requiring OSI-approved, tried and true licences, the risk<br>

   is negligible.<br>

<br>

b) Bots and tools running on an infrastructure financed by<br>

   donors, like contributions to Wikipedia & Co., shouldn't<br>

   be usable for blackmail.  Noone should be in a legal po-<br>

   sition to demand something "or else ..."  The perpetuity<br>

   of OS licences guarantees that everyone can be truly<br>

   thankful to developers without having to fear that other-<br>

   wise they shut down devices, delete content, etc.<br>

<br>

But the nice thing about collaboratively developed open<br>

source software is that it usually is of a better quality,<br>

so clandestine code is often not that interesting.<br>

<br>

Tim<br>

<br>

<br>

<br>

<br>

------------------------------<br>

<br>

Message: 4<br>

Date: Fri, 13 Mar 2015 11:52:18 -0600<br>

From: Ryan Lane <<a href="mailto:rlane32@gmail.com" target="_blank">rlane32@gmail.com</a>><br>

To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>

Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>

Message-ID:<br>

        <<a href="mailto:CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx%2BactaKMDuXmg@mail.gmail.com" target="_blank">CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <<a href="mailto:ricordisamoa@openmailbox.org" target="_blank">ricordisamoa@openmailbox.org</a>><br>

wrote:<br>

<br>

> From <a href="https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use" target="_blank">https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use</a><br>

> (verbatim): "Do not use or install any software unless the software is<br>

> licensed under an Open Source license".<br>

> What about tools and services made up of software themselves? Do they have<br>

> to be Open Source?<br>

> Strictly speaking, do the Terms of use require that all code be made<br>

> available to the public?<br>

> Thanks in advance.<br>

><br>

><br>

As the person who wrote the initial terms and included this I can speak to<br>

the spirit of the term (I'm not a lawyer, so I won't try to go into any<br>

legal issues).<br>

<br>

I created Labs with the intent that it could be used as a mechanism to fork<br>

the projects as a whole, if necessary. A means to this end was including<br>

non-WMF employees in the process of infrastructure operations (which is<br>

outside the goals of the tools project in Labs). Tools/services that are<br>

can't be distributed publicly harm that goal. Tools/services that aren't<br>

open source completely break that goal. It's fine if you wish to not<br>

maintain the code in a public git repo, but if another tool maintainer<br>

wishes to publish your code, there should be nothing blocking that.<br>

<br>

Depending on external closed source services is a debatable topic. I know<br>

in the past we've decided to allow it. It goes against the spirit of the<br>

project, but it doesn't require us to distribute close sourced software in<br>

the case of a fork.<br>

<br>

My personal opinion is that your code should be in a public repository to<br>

encourage collaboration. As the terms are written, though, your code is<br>

required to be open source, and any libraries it depends on must be as well.<br>

<br>

- Ryan<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 5<br>

Date: Fri, 13 Mar 2015 11:29:47 -0700<br>

From: Pine W <<a href="mailto:wiki.pine@gmail.com" target="_blank">wiki.pine@gmail.com</a>><br>

To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>

Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>

Message-ID:<br>

        <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=<a href="mailto:P%2BiCaA@mail.gmail.com" target="_blank">P+iCaA@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Question: are there heightened security or privacy risks posed by having<br>

non-open-source code running in Labs?<br>

<br>

Is anyone proactively auditing Labs software for open source compliance,<br>

and if not, should this be done?<br>

<br>

Pine<br>

On Mar 13, 2015 10:52 AM, "Ryan Lane" <<a href="mailto:rlane32@gmail.com" target="_blank">rlane32@gmail.com</a>> wrote:<br>

<br>

> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <<br>

> <a href="mailto:ricordisamoa@openmailbox.org" target="_blank">ricordisamoa@openmailbox.org</a>> wrote:<br>

><br>

>> From <a href="https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use" target="_blank">https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use</a><br>

>> (verbatim): "Do not use or install any software unless the software is<br>

>> licensed under an Open Source license".<br>

>> What about tools and services made up of software themselves? Do they<br>

>> have to be Open Source?<br>

>> Strictly speaking, do the Terms of use require that all code be made<br>

>> available to the public?<br>

>> Thanks in advance.<br>

>><br>

>><br>

> As the person who wrote the initial terms and included this I can speak to<br>

> the spirit of the term (I'm not a lawyer, so I won't try to go into any<br>

> legal issues).<br>

><br>

> I created Labs with the intent that it could be used as a mechanism to<br>

> fork the projects as a whole, if necessary. A means to this end was<br>

> including non-WMF employees in the process of infrastructure operations<br>

> (which is outside the goals of the tools project in Labs). Tools/services<br>

> that are can't be distributed publicly harm that goal. Tools/services that<br>

> aren't open source completely break that goal. It's fine if you wish to not<br>

> maintain the code in a public git repo, but if another tool maintainer<br>

> wishes to publish your code, there should be nothing blocking that.<br>

><br>

> Depending on external closed source services is a debatable topic. I know<br>

> in the past we've decided to allow it. It goes against the spirit of the<br>

> project, but it doesn't require us to distribute close sourced software in<br>

> the case of a fork.<br>

><br>

> My personal opinion is that your code should be in a public repository to<br>

> encourage collaboration. As the terms are written, though, your code is<br>

> required to be open source, and any libraries it depends on must be as well.<br>

><br>

> - Ryan<br>

><br>

> _______________________________________________<br>

> Labs-l mailing list<br>

> <a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>

> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>

><br>

><br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html</a>><br>

<br>

------------------------------<br>

<br>

_______________________________________________<br>

Labs-l mailing list<br>

<a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>

<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>

<br>

<br>

End of Labs-l Digest, Vol 39, Issue 13<br>

**************************************<br>

</blockquote></div><br></div></div></div>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>