Hi Leila and Kate,
adding a few words after Nuria's email to clarify my original intentions. My point was that any important and vital file that needs to be preserved may be stored in HDFS rather than on stat/notebooks due to the absence of backups of the home directories. My concern was that people had a different understanding about backups and I wanted to clarify.
We (as Analytics team) don't have any good way at the moment to periodically scan HDFS and home directories across hosts to find PII data that is retained more than the allowed period of time. The main motivation is that we'd need to find a way to check a huge amount of files, with different names and formats, and figure out if the data contained in them is PII and retained more than X days. It is not an impossible task but not easy or trivial, we'd need a lot more staff in my opinion to create and maintain something similar :) We started recently with the clean up of old home directories (i.e. belonging to users not active anymore) and we established a process with SRE to get pinged when a user is offboarded to verify what data should be kept and what not (I know that both of you are aware of this since you have been working with us on several tasks, I am writing it to allow other people to get the context :). This is only a starting point, I really hope to have something more robust and complete in the future. In the meantime, I'd say that every user is responsible of the data that he/she handles on the Analytics infrastructure, periodically reviewing it and deleting when necessary. I don't have a specific guideline/process to suggest, but we can definitely have a chat together and decide something shared among our teams!
Let me know if this makes sense or not :)
Thanks,
Luca