[Labs-l] Labs-l Digest, Vol 39, Issue 13

Marc Miquel marcmiquel at gmail.com
Fri Mar 13 18:53:36 UTC 2015


I load from a file "page_titles" and "page_ids" and put them in a
dictionary. One option I haven't used would be putting than into a database
and INNER Joining with the pagelinks table to just obtain the links for
those articles. Still, if the list is 300.000, even this is just 20% of the
database, it is still a lot.

Marc
ᐧ

2015-03-13 19:51 GMT+01:00 John <phoenixoverride at gmail.com>:

> Where are you getting your list of pages from?
>
> On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <marcmiquel at gmail.com> wrote:
>
>> Hi John,
>>
>>
>> My queries are to obtain "inlinks" and "outlinks" for some articles I
>> have in a group (x). Then I check (using python) if they have inlinks and
>> outlinks from another group of articles. By now I am doing a query for each
>> article. I wanted to obtain all links for group (x) and then do this
>> comprovation....But getting all links for groups as big as 300000 articles
>> would imply 6 million links. Is it possible to obtain all this or is there
>> a MySQL/RAM limit?
>>
>> Thanks.
>>
>> Marc
>>
>>>>
>> 2015-03-13 19:29 GMT+01:00 <labs-l-request at lists.wikimedia.org>:
>>
>>> Send Labs-l mailing list submissions to
>>>         labs-l at lists.wikimedia.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>         https://lists.wikimedia.org/mailman/listinfo/labs-l
>>> or, via email, send a message with subject or body 'help' to
>>>         labs-l-request at lists.wikimedia.org
>>>
>>> You can reach the person managing the list at
>>>         labs-l-owner at lists.wikimedia.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of Labs-l digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>    1. dimension well my queries for very large tables like
>>>       pagelinks - Tool Labs (Marc Miquel)
>>>    2. Re: dimension well my queries for very large tables like
>>>       pagelinks - Tool Labs (John)
>>>    3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)
>>>    4. Re: Questions regarding the Labs Terms of use (Ryan Lane)
>>>    5. Re: Questions regarding the Labs Terms of use (Pine W)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Fri, 13 Mar 2015 17:59:09 +0100
>>> From: Marc Miquel <marcmiquel at gmail.com>
>>> To: "labs-l at lists.wikimedia.org" <labs-l at lists.wikimedia.org>
>>> Subject: [Labs-l] dimension well my queries for very large tables like
>>>         pagelinks - Tool Labs
>>> Message-ID:
>>>         <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=
>>> naEa9d6+w at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hello guys,
>>>
>>> I have a question regarding Tool Labs. I am doing research on links and
>>> although I know very well what I am looking for I struggle in how to get
>>> it
>>> effectively...
>>>
>>> I need to know your opinion because you know very well the system and
>>> what's feasible and what is not.
>>>
>>> I explain you what I need to do:
>>> I have a list of articles for different languages which I need to check
>>> their pagelinks and see where they point to and from where they point at
>>> them.
>>>
>>> I now do a query for each article id in this list of articles, which goes
>>> from 80000 in some wikipedias to 300000 in other and more. I have to do
>>> it
>>> several times and it is very time consuming (several days). I wish I
>>> could
>>> only count the total of links for each case but I need to see only some
>>> of
>>> the links per article.
>>>
>>> I was thinking about getting all pagelinks and iterating using python
>>> (which is the language I use for all this). This would be much faster
>>> because I'd save all the queries, one per article, I am doing now. But
>>> pagelinks table has millions of rows and I cannot load that because mysql
>>> would die. I could buffer, but I haven't tried if it works also.
>>>
>>> I am considering creating a personal table in the database with titles,
>>> ids, and inner joining to just obtain the pagelinks for these 300.000
>>> articles. With this I would just retrieve 20% of the database instead of
>>> the 100%. That would be maybe 8M rows sometimes (page_title or page_id,
>>> one
>>> of both per row), or even more... loaded into python dictionaries and
>>> lists. Would that be a problem...? I have no idea of how much RAM this
>>> implies and how much I can use in Tool labs.
>>>
>>> I am totally lost when I get these problems related to scale...I thought
>>> about writing to the IRC channel but I thought it was maybe too long and
>>> too specific. If you give me any hint that would really help.
>>>
>>> Thank you very much!
>>>
>>> Cheers,
>>>
>>> Marc Miquel
>>>>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html
>>> >
>>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Fri, 13 Mar 2015 13:07:20 -0400
>>> From: John <phoenixoverride at gmail.com>
>>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>>> Subject: Re: [Labs-l] dimension well my queries for very large tables
>>>         like pagelinks - Tool Labs
>>> Message-ID:
>>>         <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=
>>> YSbFQ at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> what kind of queries are you doing? odds are they can be optimized.
>>>
>>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <marcmiquel at gmail.com>
>>> wrote:
>>>
>>> > Hello guys,
>>> >
>>> > I have a question regarding Tool Labs. I am doing research on links and
>>> > although I know very well what I am looking for I struggle in how to
>>> get it
>>> > effectively...
>>> >
>>> > I need to know your opinion because you know very well the system and
>>> > what's feasible and what is not.
>>> >
>>> > I explain you what I need to do:
>>> > I have a list of articles for different languages which I need to check
>>> > their pagelinks and see where they point to and from where they point
>>> at
>>> > them.
>>> >
>>> > I now do a query for each article id in this list of articles, which
>>> goes
>>> > from 80000 in some wikipedias to 300000 in other and more. I have to
>>> do it
>>> > several times and it is very time consuming (several days). I wish I
>>> could
>>> > only count the total of links for each case but I need to see only
>>> some of
>>> > the links per article.
>>> >
>>> > I was thinking about getting all pagelinks and iterating using python
>>> > (which is the language I use for all this). This would be much faster
>>> > because I'd save all the queries, one per article, I am doing now. But
>>> > pagelinks table has millions of rows and I cannot load that because
>>> mysql
>>> > would die. I could buffer, but I haven't tried if it works also.
>>> >
>>> > I am considering creating a personal table in the database with titles,
>>> > ids, and inner joining to just obtain the pagelinks for these 300.000
>>> > articles. With this I would just retrieve 20% of the database instead
>>> of
>>> > the 100%. That would be maybe 8M rows sometimes (page_title or
>>> page_id, one
>>> > of both per row), or even more... loaded into python dictionaries and
>>> > lists. Would that be a problem...? I have no idea of how much RAM this
>>> > implies and how much I can use in Tool labs.
>>> >
>>> > I am totally lost when I get these problems related to scale...I
>>> thought
>>> > about writing to the IRC channel but I thought it was maybe too long
>>> and
>>> > too specific. If you give me any hint that would really help.
>>> >
>>> > Thank you very much!
>>> >
>>> > Cheers,
>>> >
>>> > Marc Miquel
>>> > ᐧ
>>> >
>>> > _______________________________________________
>>> > Labs-l mailing list
>>> > Labs-l at lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>> >
>>> >
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html
>>> >
>>>
>>> ------------------------------
>>>
>>> Message: 3
>>> Date: Fri, 13 Mar 2015 17:36:00 +0000
>>> From: Tim Landscheidt <tim at tim-landscheidt.de>
>>> To: labs-l at lists.wikimedia.org
>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>> Message-ID: <878uf0vlz3.fsf at passepartout.tim-landscheidt.de>
>>> Content-Type: text/plain
>>>
>>> (anonymous) wrote:
>>>
>>> > [...]
>>>
>>> > To be clear: I'm not going to make my code proprietary in
>>> > any way. I just wanted to know whether I'm entitled to ask
>>> > for the source of every Labs bot ;-)
>>>
>>> Everyone is entitled to /ask/, but I don't think you have a
>>> right to /receive/ the source :-).
>>>
>>> AFAIK, there are two main reasons for the clause:
>>>
>>> a) WMF doesn't want to have to deal with individual licences
>>>    that may or may not have the potential for litigation
>>>    ("The Software shall be used for Good, not Evil").  With
>>>    requiring OSI-approved, tried and true licences, the risk
>>>    is negligible.
>>>
>>> b) Bots and tools running on an infrastructure financed by
>>>    donors, like contributions to Wikipedia & Co., shouldn't
>>>    be usable for blackmail.  Noone should be in a legal po-
>>>    sition to demand something "or else ..."  The perpetuity
>>>    of OS licences guarantees that everyone can be truly
>>>    thankful to developers without having to fear that other-
>>>    wise they shut down devices, delete content, etc.
>>>
>>> But the nice thing about collaboratively developed open
>>> source software is that it usually is of a better quality,
>>> so clandestine code is often not that interesting.
>>>
>>> Tim
>>>
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 4
>>> Date: Fri, 13 Mar 2015 11:52:18 -0600
>>> From: Ryan Lane <rlane32 at gmail.com>
>>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>> Message-ID:
>>>         <
>>> CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>> ricordisamoa at openmailbox.org>
>>> wrote:
>>>
>>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>> > (verbatim): "Do not use or install any software unless the software is
>>> > licensed under an Open Source license".
>>> > What about tools and services made up of software themselves? Do they
>>> have
>>> > to be Open Source?
>>> > Strictly speaking, do the Terms of use require that all code be made
>>> > available to the public?
>>> > Thanks in advance.
>>> >
>>> >
>>> As the person who wrote the initial terms and included this I can speak
>>> to
>>> the spirit of the term (I'm not a lawyer, so I won't try to go into any
>>> legal issues).
>>>
>>> I created Labs with the intent that it could be used as a mechanism to
>>> fork
>>> the projects as a whole, if necessary. A means to this end was including
>>> non-WMF employees in the process of infrastructure operations (which is
>>> outside the goals of the tools project in Labs). Tools/services that are
>>> can't be distributed publicly harm that goal. Tools/services that aren't
>>> open source completely break that goal. It's fine if you wish to not
>>> maintain the code in a public git repo, but if another tool maintainer
>>> wishes to publish your code, there should be nothing blocking that.
>>>
>>> Depending on external closed source services is a debatable topic. I know
>>> in the past we've decided to allow it. It goes against the spirit of the
>>> project, but it doesn't require us to distribute close sourced software
>>> in
>>> the case of a fork.
>>>
>>> My personal opinion is that your code should be in a public repository to
>>> encourage collaboration. As the terms are written, though, your code is
>>> required to be open source, and any libraries it depends on must be as
>>> well.
>>>
>>> - Ryan
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html
>>> >
>>>
>>> ------------------------------
>>>
>>> Message: 5
>>> Date: Fri, 13 Mar 2015 11:29:47 -0700
>>> From: Pine W <wiki.pine at gmail.com>
>>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>> Message-ID:
>>>         <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=
>>> P+iCaA at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Question: are there heightened security or privacy risks posed by having
>>> non-open-source code running in Labs?
>>>
>>> Is anyone proactively auditing Labs software for open source compliance,
>>> and if not, should this be done?
>>>
>>> Pine
>>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <rlane32 at gmail.com> wrote:
>>>
>>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>> > ricordisamoa at openmailbox.org> wrote:
>>> >
>>> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>> >> (verbatim): "Do not use or install any software unless the software is
>>> >> licensed under an Open Source license".
>>> >> What about tools and services made up of software themselves? Do they
>>> >> have to be Open Source?
>>> >> Strictly speaking, do the Terms of use require that all code be made
>>> >> available to the public?
>>> >> Thanks in advance.
>>> >>
>>> >>
>>> > As the person who wrote the initial terms and included this I can
>>> speak to
>>> > the spirit of the term (I'm not a lawyer, so I won't try to go into any
>>> > legal issues).
>>> >
>>> > I created Labs with the intent that it could be used as a mechanism to
>>> > fork the projects as a whole, if necessary. A means to this end was
>>> > including non-WMF employees in the process of infrastructure operations
>>> > (which is outside the goals of the tools project in Labs).
>>> Tools/services
>>> > that are can't be distributed publicly harm that goal. Tools/services
>>> that
>>> > aren't open source completely break that goal. It's fine if you wish
>>> to not
>>> > maintain the code in a public git repo, but if another tool maintainer
>>> > wishes to publish your code, there should be nothing blocking that.
>>> >
>>> > Depending on external closed source services is a debatable topic. I
>>> know
>>> > in the past we've decided to allow it. It goes against the spirit of
>>> the
>>> > project, but it doesn't require us to distribute close sourced
>>> software in
>>> > the case of a fork.
>>> >
>>> > My personal opinion is that your code should be in a public repository
>>> to
>>> > encourage collaboration. As the terms are written, though, your code is
>>> > required to be open source, and any libraries it depends on must be as
>>> well.
>>> >
>>> > - Ryan
>>> >
>>> > _______________________________________________
>>> > Labs-l mailing list
>>> > Labs-l at lists.wikimedia.org
>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>> >
>>> >
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html
>>> >
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> Labs-l mailing list
>>> Labs-l at lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>
>>>
>>> End of Labs-l Digest, Vol 39, Issue 13
>>> **************************************
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/0ecb6e78/attachment-0001.html>


More information about the Labs-l mailing list