All clear, Luca and Nuria. Thanks!
On Thu, Jul 11, 2019 at 2:55 AM Luca Toscano <ltoscano@wikimedia.org> wrote:
>
> Hi Leila and Kate,
>
> adding a few words after Nuria's email to clarify my original intentions.
> My point was that any important and vital file that needs to be preserved
> may be stored in HDFS rather than on stat/notebooks due to the absence of
> backups of the home directories. My concern was that people had a different
> understanding about backups and I wanted to clarify.
> We (as Analytics team) don't have any good way at the moment to
> periodically scan HDFS and home directories across hosts to find PII data
> that is retained more than the allowed period of time. The main motivation
> is that we'd need to find a way to check a huge amount of files, with
> different names and formats, and figure out if the data contained in them
> is PII and retained more than X days. It is not an impossible task but not
> easy or trivial, we'd need a lot more staff in my opinion to create and
> maintain something similar :) We started recently with the clean up of old
> home directories (i.e. belonging to users not active anymore) and we
> established a process with SRE to get pinged when a user is offboarded to
> verify what data should be kept and what not (I know that both of you are
> aware of this since you have been working with us on several tasks, I am
> writing it to allow other people to get the context :). This is only a
> starting point, I really hope to have something more robust and complete in
> the future. In the meantime, I'd say that every user is responsible of the
> data that he/she handles on the Analytics infrastructure, periodically
> reviewing it and deleting when necessary. I don't have a specific
> guideline/process to suggest, but we can definitely have a chat together
> and decide something shared among our teams!
>
> Let me know if this makes sense or not :)
>
> Thanks,
>
> Luca
>
> Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <nuria@wikimedia.org>
> ha scritto:
>
> > >I have one question for you: As you allow/encourage for more copies of
> > >the files to exist
> > To be extra clear, we do not encourage for data to be in that notebooks
> > hosts at all, there is no capacity of them to neither process nor hosts
> > large amounts of data. Data that you are working with is best placed on
> > /user/your-username databse in hadoop so far from encouraging multiple
> > copies we are rather encouraging you keep the data outside the notebook
> > machines.
> >
> > Thanks,
> >
> > Nuria
> >
> > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <kzimmerman@wikimedia.org>
> > wrote:
> >
> >> I second Leila's question. The issue of how we flag PII data and ensure
> >> it's appropriately scrubbed came up in our team meeting yesterday. We're
> >> discussing team practices for data/project backups tomorrow and plan to
> >> come out with some proposals, at least for the short term.
> >>
> >> Are there any existing processes or guidelines I should be aware of?
> >>
> >> Thanks!
> >> Kate
> >>
> >> --
> >>
> >> Kate Zimmerman (she/they)
> >> Head of Product Analytics
> >> Wikimedia Foundation
> >>
> >>
> >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <leila@wikimedia.org> wrote:
> >>
> >>> Hi Luca,
> >>>
> >>> Thanks for the heads up. Isaac is coordinating a response from the
> >>> Research side.
> >>>
> >>> I have one question for you: As you allow/encourage for more copies of
> >>> the files to exist, what is the mechanism you'd like to put in place
> >>> for reducing the chances of PII to be copied in new folders that then
> >>> will be even harder (for your team) to keep track of? Having an
> >>> explicit process/understanding about this will be very helpful.
> >>>
> >>> Thanks,
> >>> Leila
> >>>
> >>>
> >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <ltoscano@wikimedia.org>
> >>> wrote:
> >>> >
> >>> > Hi everybody,
> >>> >
> >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics
> >>> team
> >>> > thought to reach out to everybody to make it clear that all the home
> >>> > directories on the stat/notebook nodes are not backed up periodically.
> >>> They
> >>> > run on a software RAID configuration spanning multiple disks of
> >>> course, so
> >>> > we are resilient on a disk failure, but even if unlikely if might
> >>> happen
> >>> > that a host could loose all its data. Please keep this in mind when
> >>> working
> >>> > on important projects and/or handling important data that you care
> >>> about.
> >>> >
> >>> > I just added a warning to
> >>> >
> >>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
> >>> .
> >>> > If you have really important data that is too big to backup, keep in
> >>> mind
> >>> > that you can use your home directory (/user/your-username) on HDFS
> >>> (that
> >>> > replicates data three times across multiple nodes).
> >>> >
> >>> > Please let us know if you have comments/suggestions/etc.. in the
> >>> > aforementioned task.
> >>> >
> >>> > Thanks in advance!
> >>> >
> >>> > Luca (on behalf of the Analytics team)
> >>> > _______________________________________________
> >>> > Wiki-research-l mailing list
> >>> > Wiki-research-l@lists.wikimedia.org
> >>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >>>
> >>>
> >>> _______________________________________________
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> > _______________________________________________
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l