[Labs-l] Labs-l Digest, Vol 39, Issue 13

Fri Mar 13 18:46:19 UTC 2015

Hi John,

My queries are to obtain "inlinks" and "outlinks" for some articles I have
in a group (x). Then I check (using python) if they have inlinks and
outlinks from another group of articles. By now I am doing a query for each
article. I wanted to obtain all links for group (x) and then do this
comprovation....But getting all links for groups as big as 300000 articles
would imply 6 million links. Is it possible to obtain all this or is there
a MySQL/RAM limit?

Thanks.

Marc

ᐧ

2015-03-13 19:29 GMT+01:00 <labs-l-request at lists.wikimedia.org>:

> Send Labs-l mailing list submissions to
>         labs-l at lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.wikimedia.org/mailman/listinfo/labs-l
> or, via email, send a message with subject or body 'help' to
>         labs-l-request at lists.wikimedia.org
>
> You can reach the person managing the list at
>         labs-l-owner at lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Labs-l digest..."
>
>
> Today's Topics:
>
>    1. dimension well my queries for very large tables like
>       pagelinks - Tool Labs (Marc Miquel)
>    2. Re: dimension well my queries for very large tables like
>       pagelinks - Tool Labs (John)
>    3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)
>    4. Re: Questions regarding the Labs Terms of use (Ryan Lane)
>    5. Re: Questions regarding the Labs Terms of use (Pine W)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 13 Mar 2015 17:59:09 +0100
> From: Marc Miquel <marcmiquel at gmail.com>
> To: "labs-l at lists.wikimedia.org" <labs-l at lists.wikimedia.org>
> Subject: [Labs-l] dimension well my queries for very large tables like
>         pagelinks - Tool Labs
> Message-ID:
>         <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=
> naEa9d6+w at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello guys,
>
> I have a question regarding Tool Labs. I am doing research on links and
> although I know very well what I am looking for I struggle in how to get it
> effectively...
>
> I need to know your opinion because you know very well the system and
> what's feasible and what is not.
>
> I explain you what I need to do:
> I have a list of articles for different languages which I need to check
> their pagelinks and see where they point to and from where they point at
> them.
>
> I now do a query for each article id in this list of articles, which goes
> from 80000 in some wikipedias to 300000 in other and more. I have to do it
> several times and it is very time consuming (several days). I wish I could
> only count the total of links for each case but I need to see only some of
> the links per article.
>
> I was thinking about getting all pagelinks and iterating using python
> (which is the language I use for all this). This would be much faster
> because I'd save all the queries, one per article, I am doing now. But
> pagelinks table has millions of rows and I cannot load that because mysql
> would die. I could buffer, but I haven't tried if it works also.
>
> I am considering creating a personal table in the database with titles,
> ids, and inner joining to just obtain the pagelinks for these 300.000
> articles. With this I would just retrieve 20% of the database instead of
> the 100%. That would be maybe 8M rows sometimes (page_title or page_id, one
> of both per row), or even more... loaded into python dictionaries and
> lists. Would that be a problem...? I have no idea of how much RAM this
> implies and how much I can use in Tool labs.
>
> I am totally lost when I get these problems related to scale...I thought
> about writing to the IRC channel but I thought it was maybe too long and
> too specific. If you give me any hint that would really help.
>
> Thank you very much!
>
> Cheers,
>
> Marc Miquel
> ᐧ
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Fri, 13 Mar 2015 13:07:20 -0400
> From: John <phoenixoverride at gmail.com>
> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
> Subject: Re: [Labs-l] dimension well my queries for very large tables
>         like pagelinks - Tool Labs
> Message-ID:
>         <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=
> YSbFQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> what kind of queries are you doing? odds are they can be optimized.
>
> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <marcmiquel at gmail.com>
> wrote:
>
> > Hello guys,
> >
> > I have a question regarding Tool Labs. I am doing research on links and
> > although I know very well what I am looking for I struggle in how to get
> it
> > effectively...
> >
> > I need to know your opinion because you know very well the system and
> > what's feasible and what is not.
> >
> > I explain you what I need to do:
> > I have a list of articles for different languages which I need to check
> > their pagelinks and see where they point to and from where they point at
> > them.
> >
> > I now do a query for each article id in this list of articles, which goes
> > from 80000 in some wikipedias to 300000 in other and more. I have to do
> it
> > several times and it is very time consuming (several days). I wish I
> could
> > only count the total of links for each case but I need to see only some
> of
> > the links per article.
> >
> > I was thinking about getting all pagelinks and iterating using python
> > (which is the language I use for all this). This would be much faster
> > because I'd save all the queries, one per article, I am doing now. But
> > pagelinks table has millions of rows and I cannot load that because mysql
> > would die. I could buffer, but I haven't tried if it works also.
> >
> > I am considering creating a personal table in the database with titles,
> > ids, and inner joining to just obtain the pagelinks for these 300.000
> > articles. With this I would just retrieve 20% of the database instead of
> > the 100%. That would be maybe 8M rows sometimes (page_title or page_id,
> one
> > of both per row), or even more... loaded into python dictionaries and
> > lists. Would that be a problem...? I have no idea of how much RAM this
> > implies and how much I can use in Tool labs.
> >
> > I am totally lost when I get these problems related to scale...I thought
> > about writing to the IRC channel but I thought it was maybe too long and
> > too specific. If you give me any hint that would really help.
> >
> > Thank you very much!
> >
> > Cheers,
> >
> > Marc Miquel
> > ᐧ
> >
> > _______________________________________________
> > Labs-l mailing list
> > Labs-l at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/labs-l
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 3
> Date: Fri, 13 Mar 2015 17:36:00 +0000
> From: Tim Landscheidt <tim at tim-landscheidt.de>
> To: labs-l at lists.wikimedia.org
> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
> Message-ID: <878uf0vlz3.fsf at passepartout.tim-landscheidt.de>
> Content-Type: text/plain
>
> (anonymous) wrote:
>
> > [...]
>
> > To be clear: I'm not going to make my code proprietary in
> > any way. I just wanted to know whether I'm entitled to ask
> > for the source of every Labs bot ;-)
>
> Everyone is entitled to /ask/, but I don't think you have a
> right to /receive/ the source :-).
>
> AFAIK, there are two main reasons for the clause:
>
> a) WMF doesn't want to have to deal with individual licences
>    that may or may not have the potential for litigation
>    ("The Software shall be used for Good, not Evil").  With
>    requiring OSI-approved, tried and true licences, the risk
>    is negligible.
>
> b) Bots and tools running on an infrastructure financed by
>    donors, like contributions to Wikipedia & Co., shouldn't
>    be usable for blackmail.  Noone should be in a legal po-
>    sition to demand something "or else ..."  The perpetuity
>    of OS licences guarantees that everyone can be truly
>    thankful to developers without having to fear that other-
>    wise they shut down devices, delete content, etc.
>
> But the nice thing about collaboratively developed open
> source software is that it usually is of a better quality,
> so clandestine code is often not that interesting.
>
> Tim
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 13 Mar 2015 11:52:18 -0600
> From: Ryan Lane <rlane32 at gmail.com>
> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
> Message-ID:
>         <
> CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
> ricordisamoa at openmailbox.org>
> wrote:
>
> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
> > (verbatim): "Do not use or install any software unless the software is
> > licensed under an Open Source license".
> > What about tools and services made up of software themselves? Do they
> have
> > to be Open Source?
> > Strictly speaking, do the Terms of use require that all code be made
> > available to the public?
> > Thanks in advance.
> >
> >
> As the person who wrote the initial terms and included this I can speak to
> the spirit of the term (I'm not a lawyer, so I won't try to go into any
> legal issues).
>
> I created Labs with the intent that it could be used as a mechanism to fork
> the projects as a whole, if necessary. A means to this end was including
> non-WMF employees in the process of infrastructure operations (which is
> outside the goals of the tools project in Labs). Tools/services that are
> can't be distributed publicly harm that goal. Tools/services that aren't
> open source completely break that goal. It's fine if you wish to not
> maintain the code in a public git repo, but if another tool maintainer
> wishes to publish your code, there should be nothing blocking that.
>
> Depending on external closed source services is a debatable topic. I know
> in the past we've decided to allow it. It goes against the spirit of the
> project, but it doesn't require us to distribute close sourced software in
> the case of a fork.
>
> My personal opinion is that your code should be in a public repository to
> encourage collaboration. As the terms are written, though, your code is
> required to be open source, and any libraries it depends on must be as
> well.
>
> - Ryan
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 5
> Date: Fri, 13 Mar 2015 11:29:47 -0700
> From: Pine W <wiki.pine at gmail.com>
> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
> Message-ID:
>         <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=
> P+iCaA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Question: are there heightened security or privacy risks posed by having
> non-open-source code running in Labs?
>
> Is anyone proactively auditing Labs software for open source compliance,
> and if not, should this be done?
>
> Pine
> On Mar 13, 2015 10:52 AM, "Ryan Lane" <rlane32 at gmail.com> wrote:
>
> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
> > ricordisamoa at openmailbox.org> wrote:
> >
> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
> >> (verbatim): "Do not use or install any software unless the software is
> >> licensed under an Open Source license".
> >> What about tools and services made up of software themselves? Do they
> >> have to be Open Source?
> >> Strictly speaking, do the Terms of use require that all code be made
> >> available to the public?
> >> Thanks in advance.
> >>
> >>
> > As the person who wrote the initial terms and included this I can speak
> to
> > the spirit of the term (I'm not a lawyer, so I won't try to go into any
> > legal issues).
> >
> > I created Labs with the intent that it could be used as a mechanism to
> > fork the projects as a whole, if necessary. A means to this end was
> > including non-WMF employees in the process of infrastructure operations
> > (which is outside the goals of the tools project in Labs). Tools/services
> > that are can't be distributed publicly harm that goal. Tools/services
> that
> > aren't open source completely break that goal. It's fine if you wish to
> not
> > maintain the code in a public git repo, but if another tool maintainer
> > wishes to publish your code, there should be nothing blocking that.
> >
> > Depending on external closed source services is a debatable topic. I know
> > in the past we've decided to allow it. It goes against the spirit of the
> > project, but it doesn't require us to distribute close sourced software
> in
> > the case of a fork.
> >
> > My personal opinion is that your code should be in a public repository to
> > encourage collaboration. As the terms are written, though, your code is
> > required to be open source, and any libraries it depends on must be as
> well.
> >
> > - Ryan
> >
> > _______________________________________________
> > Labs-l mailing list
> > Labs-l at lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/labs-l
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html
> >
>
> ------------------------------
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
>
> End of Labs-l Digest, Vol 39, Issue 13
> **************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8baeed91/attachment-0001.html>