<div dir="ltr">OK, No offense here, your not being helpful. Are you pulling from a category, using page text or some format already in the database? The reason I am asking is that depending on how you are selecting the articles, it might be possible to match query the database in a manor that optimizes the process and makes the overall time drop drastically. There have been a few quires that I have played with in the past that originally took hours or even days, and we where able to get them down to a few minutes. However without more information the vague data we are given so far isn't that helpful.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 13, 2015 at 3:01 PM, Marc Miquel <span dir="ltr"><<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I get them according to some selection I do according to other parameters more related to the content. The selection of these 300000, which could be either 30000 or even 500000 for other cases (like german wiki) is not an issue. The link analysis to see if these 300.000 receive links from another group of articles is my concern... <div><br></div><div>Marc</div><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px;max-height:0px" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=8d4ae16f-4ff9-40b5-b148-55d853a91433"><font color="#ffffff" size="1">ᐧ</font></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:56 GMT+01:00 John <span dir="ltr"><<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">where are you getting the list of 300k pages from? I want to get a feel for the kinds of queries your running so that we can optimize the process for you.<br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 13, 2015 at 2:53 PM, Marc Miquel <span dir="ltr"><<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I load from a file "page_titles" and "page_ids" and put them in a dictionary. One option I haven't used would be putting than into a database and INNER Joining with the pagelinks table to just obtain the links for those articles. Still, if the list is 300.000, even this is just 20% of the database, it is still a lot.<div><br></div><div>Marc</div><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px;max-height:0px" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=2c33b224-abba-4648-a1bb-c9ffcfb474f5"><font color="#ffffff" size="1">ᐧ</font></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:51 GMT+01:00 John <span dir="ltr"><<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Where are you getting your list of pages from?<br></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <span dir="ltr"><<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi John,<div><br></div><div><br></div><div>My queries are to obtain "inlinks" and "outlinks" for some articles I have in a group (x). Then I check (using python) if they have inlinks and outlinks from another group of articles. By now I am doing a query for each article. I wanted to obtain all links for group (x) and then do this comprovation....But getting all links for groups as big as 300000 articles would imply 6 million links. Is it possible to obtain all this or is there a MySQL/RAM limit?</div><div><br></div><div>Thanks.</div><div><br></div><div>Marc</div><div><br><div hspace="streak-pt-mark" style="max-height:1px"><img style="width:0px;max-height:0px" src="https://mailfoogae.appspot.com/t?sender=abWFyY21pcXVlbEBnbWFpbC5jb20%3D&type=zerocontent&guid=0eace8e0-4563-4a92-82a4-91855a778d29"><font color="#ffffff" size="1">ᐧ</font></div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-13 19:29 GMT+01:00 <span dir="ltr"><<a href="mailto:labs-l-request@lists.wikimedia.org" target="_blank">labs-l-request@lists.wikimedia.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send Labs-l mailing list submissions to<br>
<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:labs-l-request@lists.wikimedia.org" target="_blank">labs-l-request@lists.wikimedia.org</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:labs-l-owner@lists.wikimedia.org" target="_blank">labs-l-owner@lists.wikimedia.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Labs-l digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. dimension well my queries for very large tables like<br>
pagelinks - Tool Labs (Marc Miquel)<br>
2. Re: dimension well my queries for very large tables like<br>
pagelinks - Tool Labs (John)<br>
3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)<br>
4. Re: Questions regarding the Labs Terms of use (Ryan Lane)<br>
5. Re: Questions regarding the Labs Terms of use (Pine W)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Fri, 13 Mar 2015 17:59:09 +0100<br>
From: Marc Miquel <<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>><br>
To: "<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>" <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: [Labs-l] dimension well my queries for very large tables like<br>
pagelinks - Tool Labs<br>
Message-ID:<br>
<CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=<a href="mailto:naEa9d6%2Bw@mail.gmail.com" target="_blank">naEa9d6+w@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Hello guys,<br>
<br>
I have a question regarding Tool Labs. I am doing research on links and<br>
although I know very well what I am looking for I struggle in how to get it<br>
effectively...<br>
<br>
I need to know your opinion because you know very well the system and<br>
what's feasible and what is not.<br>
<br>
I explain you what I need to do:<br>
I have a list of articles for different languages which I need to check<br>
their pagelinks and see where they point to and from where they point at<br>
them.<br>
<br>
I now do a query for each article id in this list of articles, which goes<br>
from 80000 in some wikipedias to 300000 in other and more. I have to do it<br>
several times and it is very time consuming (several days). I wish I could<br>
only count the total of links for each case but I need to see only some of<br>
the links per article.<br>
<br>
I was thinking about getting all pagelinks and iterating using python<br>
(which is the language I use for all this). This would be much faster<br>
because I'd save all the queries, one per article, I am doing now. But<br>
pagelinks table has millions of rows and I cannot load that because mysql<br>
would die. I could buffer, but I haven't tried if it works also.<br>
<br>
I am considering creating a personal table in the database with titles,<br>
ids, and inner joining to just obtain the pagelinks for these 300.000<br>
articles. With this I would just retrieve 20% of the database instead of<br>
the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one<br>
of both per row), or even more... loaded into python dictionaries and<br>
lists. Would that be a problem...? I have no idea of how much RAM this<br>
implies and how much I can use in Tool labs.<br>
<br>
I am totally lost when I get these problems related to scale...I thought<br>
about writing to the IRC channel but I thought it was maybe too long and<br>
too specific. If you give me any hint that would really help.<br>
<br>
Thank you very much!<br>
<br>
Cheers,<br>
<br>
Marc Miquel<br>
ᐧ<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Fri, 13 Mar 2015 13:07:20 -0400<br>
From: John <<a href="mailto:phoenixoverride@gmail.com" target="_blank">phoenixoverride@gmail.com</a>><br>
To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: Re: [Labs-l] dimension well my queries for very large tables<br>
like pagelinks - Tool Labs<br>
Message-ID:<br>
<CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=<a href="mailto:YSbFQ@mail.gmail.com" target="_blank">YSbFQ@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
what kind of queries are you doing? odds are they can be optimized.<br>
<br>
On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <<a href="mailto:marcmiquel@gmail.com" target="_blank">marcmiquel@gmail.com</a>> wrote:<br>
<br>
> Hello guys,<br>
><br>
> I have a question regarding Tool Labs. I am doing research on links and<br>
> although I know very well what I am looking for I struggle in how to get it<br>
> effectively...<br>
><br>
> I need to know your opinion because you know very well the system and<br>
> what's feasible and what is not.<br>
><br>
> I explain you what I need to do:<br>
> I have a list of articles for different languages which I need to check<br>
> their pagelinks and see where they point to and from where they point at<br>
> them.<br>
><br>
> I now do a query for each article id in this list of articles, which goes<br>
> from 80000 in some wikipedias to 300000 in other and more. I have to do it<br>
> several times and it is very time consuming (several days). I wish I could<br>
> only count the total of links for each case but I need to see only some of<br>
> the links per article.<br>
><br>
> I was thinking about getting all pagelinks and iterating using python<br>
> (which is the language I use for all this). This would be much faster<br>
> because I'd save all the queries, one per article, I am doing now. But<br>
> pagelinks table has millions of rows and I cannot load that because mysql<br>
> would die. I could buffer, but I haven't tried if it works also.<br>
><br>
> I am considering creating a personal table in the database with titles,<br>
> ids, and inner joining to just obtain the pagelinks for these 300.000<br>
> articles. With this I would just retrieve 20% of the database instead of<br>
> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one<br>
> of both per row), or even more... loaded into python dictionaries and<br>
> lists. Would that be a problem...? I have no idea of how much RAM this<br>
> implies and how much I can use in Tool labs.<br>
><br>
> I am totally lost when I get these problems related to scale...I thought<br>
> about writing to the IRC channel but I thought it was maybe too long and<br>
> too specific. If you give me any hint that would really help.<br>
><br>
> Thank you very much!<br>
><br>
> Cheers,<br>
><br>
> Marc Miquel<br>
> ᐧ<br>
><br>
> _______________________________________________<br>
> Labs-l mailing list<br>
> <a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Fri, 13 Mar 2015 17:36:00 +0000<br>
From: Tim Landscheidt <<a href="mailto:tim@tim-landscheidt.de" target="_blank">tim@tim-landscheidt.de</a>><br>
To: <a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a><br>
Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>
Message-ID: <<a href="mailto:878uf0vlz3.fsf@passepartout.tim-landscheidt.de" target="_blank">878uf0vlz3.fsf@passepartout.tim-landscheidt.de</a>><br>
Content-Type: text/plain<br>
<br>
(anonymous) wrote:<br>
<br>
> [...]<br>
<br>
> To be clear: I'm not going to make my code proprietary in<br>
> any way. I just wanted to know whether I'm entitled to ask<br>
> for the source of every Labs bot ;-)<br>
<br>
Everyone is entitled to /ask/, but I don't think you have a<br>
right to /receive/ the source :-).<br>
<br>
AFAIK, there are two main reasons for the clause:<br>
<br>
a) WMF doesn't want to have to deal with individual licences<br>
that may or may not have the potential for litigation<br>
("The Software shall be used for Good, not Evil"). With<br>
requiring OSI-approved, tried and true licences, the risk<br>
is negligible.<br>
<br>
b) Bots and tools running on an infrastructure financed by<br>
donors, like contributions to Wikipedia & Co., shouldn't<br>
be usable for blackmail. Noone should be in a legal po-<br>
sition to demand something "or else ..." The perpetuity<br>
of OS licences guarantees that everyone can be truly<br>
thankful to developers without having to fear that other-<br>
wise they shut down devices, delete content, etc.<br>
<br>
But the nice thing about collaboratively developed open<br>
source software is that it usually is of a better quality,<br>
so clandestine code is often not that interesting.<br>
<br>
Tim<br>
<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 4<br>
Date: Fri, 13 Mar 2015 11:52:18 -0600<br>
From: Ryan Lane <<a href="mailto:rlane32@gmail.com" target="_blank">rlane32@gmail.com</a>><br>
To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>
Message-ID:<br>
<<a href="mailto:CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx%2BactaKMDuXmg@mail.gmail.com" target="_blank">CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <<a href="mailto:ricordisamoa@openmailbox.org" target="_blank">ricordisamoa@openmailbox.org</a>><br>
wrote:<br>
<br>
> From <a href="https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use" target="_blank">https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use</a><br>
> (verbatim): "Do not use or install any software unless the software is<br>
> licensed under an Open Source license".<br>
> What about tools and services made up of software themselves? Do they have<br>
> to be Open Source?<br>
> Strictly speaking, do the Terms of use require that all code be made<br>
> available to the public?<br>
> Thanks in advance.<br>
><br>
><br>
As the person who wrote the initial terms and included this I can speak to<br>
the spirit of the term (I'm not a lawyer, so I won't try to go into any<br>
legal issues).<br>
<br>
I created Labs with the intent that it could be used as a mechanism to fork<br>
the projects as a whole, if necessary. A means to this end was including<br>
non-WMF employees in the process of infrastructure operations (which is<br>
outside the goals of the tools project in Labs). Tools/services that are<br>
can't be distributed publicly harm that goal. Tools/services that aren't<br>
open source completely break that goal. It's fine if you wish to not<br>
maintain the code in a public git repo, but if another tool maintainer<br>
wishes to publish your code, there should be nothing blocking that.<br>
<br>
Depending on external closed source services is a debatable topic. I know<br>
in the past we've decided to allow it. It goes against the spirit of the<br>
project, but it doesn't require us to distribute close sourced software in<br>
the case of a fork.<br>
<br>
My personal opinion is that your code should be in a public repository to<br>
encourage collaboration. As the terms are written, though, your code is<br>
required to be open source, and any libraries it depends on must be as well.<br>
<br>
- Ryan<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html</a>><br>
<br>
------------------------------<br>
<br>
Message: 5<br>
Date: Fri, 13 Mar 2015 11:29:47 -0700<br>
From: Pine W <<a href="mailto:wiki.pine@gmail.com" target="_blank">wiki.pine@gmail.com</a>><br>
To: Wikimedia Labs <<a href="mailto:labs-l@lists.wikimedia.org" target="_blank">labs-l@lists.wikimedia.org</a>><br>
Subject: Re: [Labs-l] Questions regarding the Labs Terms of use<br>
Message-ID:<br>
<CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=<a href="mailto:P%2BiCaA@mail.gmail.com" target="_blank">P+iCaA@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="utf-8"<br>
<br>
Question: are there heightened security or privacy risks posed by having<br>
non-open-source code running in Labs?<br>
<br>
Is anyone proactively auditing Labs software for open source compliance,<br>
and if not, should this be done?<br>
<br>
Pine<br>
On Mar 13, 2015 10:52 AM, "Ryan Lane" <<a href="mailto:rlane32@gmail.com" target="_blank">rlane32@gmail.com</a>> wrote:<br>
<br>
> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <<br>
> <a href="mailto:ricordisamoa@openmailbox.org" target="_blank">ricordisamoa@openmailbox.org</a>> wrote:<br>
><br>
>> From <a href="https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use" target="_blank">https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use</a><br>
>> (verbatim): "Do not use or install any software unless the software is<br>
>> licensed under an Open Source license".<br>
>> What about tools and services made up of software themselves? Do they<br>
>> have to be Open Source?<br>
>> Strictly speaking, do the Terms of use require that all code be made<br>
>> available to the public?<br>
>> Thanks in advance.<br>
>><br>
>><br>
> As the person who wrote the initial terms and included this I can speak to<br>
> the spirit of the term (I'm not a lawyer, so I won't try to go into any<br>
> legal issues).<br>
><br>
> I created Labs with the intent that it could be used as a mechanism to<br>
> fork the projects as a whole, if necessary. A means to this end was<br>
> including non-WMF employees in the process of infrastructure operations<br>
> (which is outside the goals of the tools project in Labs). Tools/services<br>
> that are can't be distributed publicly harm that goal. Tools/services that<br>
> aren't open source completely break that goal. It's fine if you wish to not<br>
> maintain the code in a public git repo, but if another tool maintainer<br>
> wishes to publish your code, there should be nothing blocking that.<br>
><br>
> Depending on external closed source services is a debatable topic. I know<br>
> in the past we've decided to allow it. It goes against the spirit of the<br>
> project, but it doesn't require us to distribute close sourced software in<br>
> the case of a fork.<br>
><br>
> My personal opinion is that your code should be in a public repository to<br>
> encourage collaboration. As the terms are written, though, your code is<br>
> required to be open source, and any libraries it depends on must be as well.<br>
><br>
> - Ryan<br>
><br>
> _______________________________________________<br>
> Labs-l mailing list<br>
> <a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
> <a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html" target="_blank">https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html</a>><br>
<br>
------------------------------<br>
<br>
_______________________________________________<br>
Labs-l mailing list<br>
<a href="mailto:Labs-l@lists.wikimedia.org" target="_blank">Labs-l@lists.wikimedia.org</a><br>
<a href="https://lists.wikimedia.org/mailman/listinfo/labs-l" target="_blank">https://lists.wikimedia.org/mailman/listinfo/labs-l</a><br>
<br>
<br>
End of Labs-l Digest, Vol 39, Issue 13<br>
**************************************<br>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>