Re: [Wikitech-l] forking media files

16 Aug 2011

      I hate this email client. Hate, hate, hate. Thank you, Microsoft, for making my life that little bit worse. Anyway, you can't rely on the media files being stored in a filesystem. They could be stored in a database or an object storage. So *sync is not available. I don't know how the media files are backed up. If you only want the originals, that's a lot less than 12TB (or whatever is the current number for thumbs+origs). If you just want to fetch a tarball, wget or curl will automaticallly restart a connection and supply a range parameter if the server supports it. If you want a ready-to-use format, then you're going to need a client which can write individual files. But it's not particularly efficient to stream 120B files over a separate TCP connection. You'd have to have a client which can do TCP session reuse. No matter how you cut it, you're looking at a custom client. But there's no need to invent a new download protocol or stream format. That's why I suggest tarball and range. Standards ... they're not just for breakfast.
________________________________________
From: wikitech-l-bounces@lists.wikimedia.org [wikitech-l-bounces@lists.wikimedia.org] on behalf of Peter Gervai [grinapo@gmail.com]
Sent: Monday, August 15, 2011 5:40 PM
To: Wikimedia developers
Subject: Re: [Wikitech-l] forking media files
On Mon, Aug 15, 2011 at 18:40, Russell N. Nelson - rnnelson
rnnelson@clarkson.edu wrote:
...
The problem is that 1) the files are bulky,
That's expected. :-)
...

there are many of them, 3) they are in constant flux,

That is not really a problem: since there are many of them
statistically they are not in flux.
...
and 4) it's likely that your connection would close for whatever reason part-way through the download..
I seem not to forgot to mention zsync/rsync. ;-)
...
Even taking a snapshot of the filenames is dicey. By the time you finish, it's likely that there will be new ones, and possible that some will be deleted. Probably the best way to make this work is to 1) make a snapshot of files periodically,
Since I've been told they're backed up it naturally should exist.
...

create an API which returns a tarball using the snapshot of files that also implements Range requests.

I would very much prefer ready-to-use format instead of a tarball, not
to mention it's pretty resource consuming to create a tarball just for
that.
...
Of course, this would result in a 12-terabyte file on the recipient's host. That wouldn't work very well. I'm pretty sure that the recipient would need an http client which would 1) keep track of the place in the bytestream and 2) split out files and write them to disk as separate files. It's possible that a program like getbot already implements this.
I'd make a snapshot without tar especially because partial transfers
aren't possible that way.
--
 byte-byte,
    grin
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] forking media files