[Labs-l] Labs-l Digest, Vol 39, Issue 13

Marc Miquel marcmiquel at gmail.com
Fri Mar 13 19:01:55 UTC 2015


I get them according to some selection I do according to other parameters
more related to the content. The selection of these 300000, which could be
either 30000 or even 500000 for other cases (like german wiki) is not an
issue. The link analysis to see if these 300.000 receive links from another
group of articles is my concern...

Marc
ᐧ

2015-03-13 19:56 GMT+01:00 John <phoenixoverride at gmail.com>:

> where are you getting the list of 300k pages from? I want to get a feel
> for the kinds of queries your running so that we can optimize the process
> for you.
>
> On Fri, Mar 13, 2015 at 2:53 PM, Marc Miquel <marcmiquel at gmail.com> wrote:
>
>> I load from a file "page_titles" and "page_ids" and put them in a
>> dictionary. One option I haven't used would be putting than into a database
>> and INNER Joining with the pagelinks table to just obtain the links for
>> those articles. Still, if the list is 300.000, even this is just 20% of the
>> database, it is still a lot.
>>
>> Marc
>>>>
>> 2015-03-13 19:51 GMT+01:00 John <phoenixoverride at gmail.com>:
>>
>>> Where are you getting your list of pages from?
>>>
>>> On Fri, Mar 13, 2015 at 2:46 PM, Marc Miquel <marcmiquel at gmail.com>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>>
>>>> My queries are to obtain "inlinks" and "outlinks" for some articles I
>>>> have in a group (x). Then I check (using python) if they have inlinks and
>>>> outlinks from another group of articles. By now I am doing a query for each
>>>> article. I wanted to obtain all links for group (x) and then do this
>>>> comprovation....But getting all links for groups as big as 300000 articles
>>>> would imply 6 million links. Is it possible to obtain all this or is there
>>>> a MySQL/RAM limit?
>>>>
>>>> Thanks.
>>>>
>>>> Marc
>>>>
>>>>>>>>
>>>> 2015-03-13 19:29 GMT+01:00 <labs-l-request at lists.wikimedia.org>:
>>>>
>>>>> Send Labs-l mailing list submissions to
>>>>>         labs-l at lists.wikimedia.org
>>>>>
>>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>>         https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>> or, via email, send a message with subject or body 'help' to
>>>>>         labs-l-request at lists.wikimedia.org
>>>>>
>>>>> You can reach the person managing the list at
>>>>>         labs-l-owner at lists.wikimedia.org
>>>>>
>>>>> When replying, please edit your Subject line so it is more specific
>>>>> than "Re: Contents of Labs-l digest..."
>>>>>
>>>>>
>>>>> Today's Topics:
>>>>>
>>>>>    1. dimension well my queries for very large tables like
>>>>>       pagelinks - Tool Labs (Marc Miquel)
>>>>>    2. Re: dimension well my queries for very large tables like
>>>>>       pagelinks - Tool Labs (John)
>>>>>    3. Re: Questions regarding the Labs Terms of use (Tim Landscheidt)
>>>>>    4. Re: Questions regarding the Labs Terms of use (Ryan Lane)
>>>>>    5. Re: Questions regarding the Labs Terms of use (Pine W)
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>>
>>>>> Message: 1
>>>>> Date: Fri, 13 Mar 2015 17:59:09 +0100
>>>>> From: Marc Miquel <marcmiquel at gmail.com>
>>>>> To: "labs-l at lists.wikimedia.org" <labs-l at lists.wikimedia.org>
>>>>> Subject: [Labs-l] dimension well my queries for very large tables like
>>>>>         pagelinks - Tool Labs
>>>>> Message-ID:
>>>>>         <CANSEGinZBWYsb0Y9r9Yk8AZo3COwzT4NTs7YkFxj=
>>>>> naEa9d6+w at mail.gmail.com>
>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>
>>>>> Hello guys,
>>>>>
>>>>> I have a question regarding Tool Labs. I am doing research on links and
>>>>> although I know very well what I am looking for I struggle in how to
>>>>> get it
>>>>> effectively...
>>>>>
>>>>> I need to know your opinion because you know very well the system and
>>>>> what's feasible and what is not.
>>>>>
>>>>> I explain you what I need to do:
>>>>> I have a list of articles for different languages which I need to check
>>>>> their pagelinks and see where they point to and from where they point
>>>>> at
>>>>> them.
>>>>>
>>>>> I now do a query for each article id in this list of articles, which
>>>>> goes
>>>>> from 80000 in some wikipedias to 300000 in other and more. I have to
>>>>> do it
>>>>> several times and it is very time consuming (several days). I wish I
>>>>> could
>>>>> only count the total of links for each case but I need to see only
>>>>> some of
>>>>> the links per article.
>>>>>
>>>>> I was thinking about getting all pagelinks and iterating using python
>>>>> (which is the language I use for all this). This would be much faster
>>>>> because I'd save all the queries, one per article, I am doing now. But
>>>>> pagelinks table has millions of rows and I cannot load that because
>>>>> mysql
>>>>> would die. I could buffer, but I haven't tried if it works also.
>>>>>
>>>>> I am considering creating a personal table in the database with titles,
>>>>> ids, and inner joining to just obtain the pagelinks for these 300.000
>>>>> articles. With this I would just retrieve 20% of the database instead
>>>>> of
>>>>> the 100%. That would be maybe 8M rows sometimes (page_title or
>>>>> page_id, one
>>>>> of both per row), or even more... loaded into python dictionaries and
>>>>> lists. Would that be a problem...? I have no idea of how much RAM this
>>>>> implies and how much I can use in Tool labs.
>>>>>
>>>>> I am totally lost when I get these problems related to scale...I
>>>>> thought
>>>>> about writing to the IRC channel but I thought it was maybe too long
>>>>> and
>>>>> too specific. If you give me any hint that would really help.
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Marc Miquel
>>>>>>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: <
>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/59a09113/attachment-0001.html
>>>>> >
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Message: 2
>>>>> Date: Fri, 13 Mar 2015 13:07:20 -0400
>>>>> From: John <phoenixoverride at gmail.com>
>>>>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>>>>> Subject: Re: [Labs-l] dimension well my queries for very large tables
>>>>>         like pagelinks - Tool Labs
>>>>> Message-ID:
>>>>>         <CAP-JHpn=ToVYdT7i-imp7+XLTkgQ1PtieORx7BTuVAJw=
>>>>> YSbFQ at mail.gmail.com>
>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>
>>>>> what kind of queries are you doing? odds are they can be optimized.
>>>>>
>>>>> On Fri, Mar 13, 2015 at 12:59 PM, Marc Miquel <marcmiquel at gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hello guys,
>>>>> >
>>>>> > I have a question regarding Tool Labs. I am doing research on links
>>>>> and
>>>>> > although I know very well what I am looking for I struggle in how to
>>>>> get it
>>>>> > effectively...
>>>>> >
>>>>> > I need to know your opinion because you know very well the system and
>>>>> > what's feasible and what is not.
>>>>> >
>>>>> > I explain you what I need to do:
>>>>> > I have a list of articles for different languages which I need to
>>>>> check
>>>>> > their pagelinks and see where they point to and from where they
>>>>> point at
>>>>> > them.
>>>>> >
>>>>> > I now do a query for each article id in this list of articles, which
>>>>> goes
>>>>> > from 80000 in some wikipedias to 300000 in other and more. I have to
>>>>> do it
>>>>> > several times and it is very time consuming (several days). I wish I
>>>>> could
>>>>> > only count the total of links for each case but I need to see only
>>>>> some of
>>>>> > the links per article.
>>>>> >
>>>>> > I was thinking about getting all pagelinks and iterating using python
>>>>> > (which is the language I use for all this). This would be much faster
>>>>> > because I'd save all the queries, one per article, I am doing now.
>>>>> But
>>>>> > pagelinks table has millions of rows and I cannot load that because
>>>>> mysql
>>>>> > would die. I could buffer, but I haven't tried if it works also.
>>>>> >
>>>>> > I am considering creating a personal table in the database with
>>>>> titles,
>>>>> > ids, and inner joining to just obtain the pagelinks for these 300.000
>>>>> > articles. With this I would just retrieve 20% of the database
>>>>> instead of
>>>>> > the 100%. That would be maybe 8M rows sometimes (page_title or
>>>>> page_id, one
>>>>> > of both per row), or even more... loaded into python dictionaries and
>>>>> > lists. Would that be a problem...? I have no idea of how much RAM
>>>>> this
>>>>> > implies and how much I can use in Tool labs.
>>>>> >
>>>>> > I am totally lost when I get these problems related to scale...I
>>>>> thought
>>>>> > about writing to the IRC channel but I thought it was maybe too long
>>>>> and
>>>>> > too specific. If you give me any hint that would really help.
>>>>> >
>>>>> > Thank you very much!
>>>>> >
>>>>> > Cheers,
>>>>> >
>>>>> > Marc Miquel
>>>>> > ᐧ
>>>>> >
>>>>> > _______________________________________________
>>>>> > Labs-l mailing list
>>>>> > Labs-l at lists.wikimedia.org
>>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>> >
>>>>> >
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: <
>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/c60b6d44/attachment-0001.html
>>>>> >
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Message: 3
>>>>> Date: Fri, 13 Mar 2015 17:36:00 +0000
>>>>> From: Tim Landscheidt <tim at tim-landscheidt.de>
>>>>> To: labs-l at lists.wikimedia.org
>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>>> Message-ID: <878uf0vlz3.fsf at passepartout.tim-landscheidt.de>
>>>>> Content-Type: text/plain
>>>>>
>>>>> (anonymous) wrote:
>>>>>
>>>>> > [...]
>>>>>
>>>>> > To be clear: I'm not going to make my code proprietary in
>>>>> > any way. I just wanted to know whether I'm entitled to ask
>>>>> > for the source of every Labs bot ;-)
>>>>>
>>>>> Everyone is entitled to /ask/, but I don't think you have a
>>>>> right to /receive/ the source :-).
>>>>>
>>>>> AFAIK, there are two main reasons for the clause:
>>>>>
>>>>> a) WMF doesn't want to have to deal with individual licences
>>>>>    that may or may not have the potential for litigation
>>>>>    ("The Software shall be used for Good, not Evil").  With
>>>>>    requiring OSI-approved, tried and true licences, the risk
>>>>>    is negligible.
>>>>>
>>>>> b) Bots and tools running on an infrastructure financed by
>>>>>    donors, like contributions to Wikipedia & Co., shouldn't
>>>>>    be usable for blackmail.  Noone should be in a legal po-
>>>>>    sition to demand something "or else ..."  The perpetuity
>>>>>    of OS licences guarantees that everyone can be truly
>>>>>    thankful to developers without having to fear that other-
>>>>>    wise they shut down devices, delete content, etc.
>>>>>
>>>>> But the nice thing about collaboratively developed open
>>>>> source software is that it usually is of a better quality,
>>>>> so clandestine code is often not that interesting.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Message: 4
>>>>> Date: Fri, 13 Mar 2015 11:52:18 -0600
>>>>> From: Ryan Lane <rlane32 at gmail.com>
>>>>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>>> Message-ID:
>>>>>         <
>>>>> CALKgCA3Lv-SQoeibEsm7Ckc0gaPJwph_b0HSTx+actaKMDuXmg at mail.gmail.com>
>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>
>>>>> On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>>>> ricordisamoa at openmailbox.org>
>>>>> wrote:
>>>>>
>>>>> > From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>>>> > (verbatim): "Do not use or install any software unless the software
>>>>> is
>>>>> > licensed under an Open Source license".
>>>>> > What about tools and services made up of software themselves? Do
>>>>> they have
>>>>> > to be Open Source?
>>>>> > Strictly speaking, do the Terms of use require that all code be made
>>>>> > available to the public?
>>>>> > Thanks in advance.
>>>>> >
>>>>> >
>>>>> As the person who wrote the initial terms and included this I can
>>>>> speak to
>>>>> the spirit of the term (I'm not a lawyer, so I won't try to go into any
>>>>> legal issues).
>>>>>
>>>>> I created Labs with the intent that it could be used as a mechanism to
>>>>> fork
>>>>> the projects as a whole, if necessary. A means to this end was
>>>>> including
>>>>> non-WMF employees in the process of infrastructure operations (which is
>>>>> outside the goals of the tools project in Labs). Tools/services that
>>>>> are
>>>>> can't be distributed publicly harm that goal. Tools/services that
>>>>> aren't
>>>>> open source completely break that goal. It's fine if you wish to not
>>>>> maintain the code in a public git repo, but if another tool maintainer
>>>>> wishes to publish your code, there should be nothing blocking that.
>>>>>
>>>>> Depending on external closed source services is a debatable topic. I
>>>>> know
>>>>> in the past we've decided to allow it. It goes against the spirit of
>>>>> the
>>>>> project, but it doesn't require us to distribute close sourced
>>>>> software in
>>>>> the case of a fork.
>>>>>
>>>>> My personal opinion is that your code should be in a public repository
>>>>> to
>>>>> encourage collaboration. As the terms are written, though, your code is
>>>>> required to be open source, and any libraries it depends on must be as
>>>>> well.
>>>>>
>>>>> - Ryan
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: <
>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/8d760f03/attachment-0001.html
>>>>> >
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Message: 5
>>>>> Date: Fri, 13 Mar 2015 11:29:47 -0700
>>>>> From: Pine W <wiki.pine at gmail.com>
>>>>> To: Wikimedia Labs <labs-l at lists.wikimedia.org>
>>>>> Subject: Re: [Labs-l] Questions regarding the Labs Terms of use
>>>>> Message-ID:
>>>>>         <CAF=dyJjO69O-ye+327BU6wC_k_+AQvwUq0rfZvW0YaV=
>>>>> P+iCaA at mail.gmail.com>
>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>
>>>>> Question: are there heightened security or privacy risks posed by
>>>>> having
>>>>> non-open-source code running in Labs?
>>>>>
>>>>> Is anyone proactively auditing Labs software for open source
>>>>> compliance,
>>>>> and if not, should this be done?
>>>>>
>>>>> Pine
>>>>> On Mar 13, 2015 10:52 AM, "Ryan Lane" <rlane32 at gmail.com> wrote:
>>>>>
>>>>> > On Fri, Mar 13, 2015 at 8:42 AM, Ricordisamoa <
>>>>> > ricordisamoa at openmailbox.org> wrote:
>>>>> >
>>>>> >> From https://wikitech.wikimedia.org/wiki/Wikitech:Labs_Terms_of_use
>>>>> >> (verbatim): "Do not use or install any software unless the software
>>>>> is
>>>>> >> licensed under an Open Source license".
>>>>> >> What about tools and services made up of software themselves? Do
>>>>> they
>>>>> >> have to be Open Source?
>>>>> >> Strictly speaking, do the Terms of use require that all code be made
>>>>> >> available to the public?
>>>>> >> Thanks in advance.
>>>>> >>
>>>>> >>
>>>>> > As the person who wrote the initial terms and included this I can
>>>>> speak to
>>>>> > the spirit of the term (I'm not a lawyer, so I won't try to go into
>>>>> any
>>>>> > legal issues).
>>>>> >
>>>>> > I created Labs with the intent that it could be used as a mechanism
>>>>> to
>>>>> > fork the projects as a whole, if necessary. A means to this end was
>>>>> > including non-WMF employees in the process of infrastructure
>>>>> operations
>>>>> > (which is outside the goals of the tools project in Labs).
>>>>> Tools/services
>>>>> > that are can't be distributed publicly harm that goal.
>>>>> Tools/services that
>>>>> > aren't open source completely break that goal. It's fine if you wish
>>>>> to not
>>>>> > maintain the code in a public git repo, but if another tool
>>>>> maintainer
>>>>> > wishes to publish your code, there should be nothing blocking that.
>>>>> >
>>>>> > Depending on external closed source services is a debatable topic. I
>>>>> know
>>>>> > in the past we've decided to allow it. It goes against the spirit of
>>>>> the
>>>>> > project, but it doesn't require us to distribute close sourced
>>>>> software in
>>>>> > the case of a fork.
>>>>> >
>>>>> > My personal opinion is that your code should be in a public
>>>>> repository to
>>>>> > encourage collaboration. As the terms are written, though, your code
>>>>> is
>>>>> > required to be open source, and any libraries it depends on must be
>>>>> as well.
>>>>> >
>>>>> > - Ryan
>>>>> >
>>>>> > _______________________________________________
>>>>> > Labs-l mailing list
>>>>> > Labs-l at lists.wikimedia.org
>>>>> > https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>> >
>>>>> >
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL: <
>>>>> https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/fc853f84/attachment.html
>>>>> >
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Labs-l mailing list
>>>>> Labs-l at lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/labs-l
>>>>>
>>>>>
>>>>> End of Labs-l Digest, Vol 39, Issue 13
>>>>> **************************************
>>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20150313/26fb5993/attachment-0001.html>


More information about the Labs-l mailing list