Hi, it looks like the user-store is completely full:
df
hemlock:/aux0/user-store 3904398336 3904398336 0 100% /mnt/user-store
It requires cleaning and sorting (especially the dumps that are all over the place), but I especially wonder what is taking so much place ? I guess I'll have to use my home quota for now.
Darkdadaah
On Wed, Dec 22, 2010 at 8:31 AM, Darkdadaah darkdadaah@yahoo.fr wrote:
Hi, it looks like the user-store is completely full:
df
hemlock:/aux0/user-store 3904398336 3904398336 0 100% /mnt/user-store
It requires cleaning and sorting (especially the dumps that are all over the place), but I especially wonder what is taking so much place ?
I guess I'll have to use my home quota for now.
I'm looking at this, but I have no brighter idea than just running du on it, which is taking a very long time. So if any other roots have a better idea to figure out what's going on and/or fix it, feel free to kill my du process (running as root) and delete /tmp/userstore-du on hemlock. (I'm also not quite sure what I'd do if I did figure out the culprit, since I don't want to delete users' data without their permission unless it's clearly useless.)
Hey,
On Wed, Dec 22, 2010 at 3:29 PM, Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
wrote:
I'm looking at this, but I have no brighter idea than just running du on it, which is taking a very long time. So if any other roots have a better idea to figure out what's going on and/or fix it, feel free to kill my du process (running as root) and delete /tmp/userstore-du on hemlock. (I'm also not quite sure what I'd do if I did figure out the culprit, since I don't want to delete users' data without their permission unless it's clearly useless.)
It looks like a huge volume, so it makes sense that it would take some time to complete.
Are quotas enabled on the volume? That might give you a quick snapshot of who the biggest user is.
You could try running the du in a subdirectory, which might give you some more ideas about big directories. (You can try: du -k | sort -rn).
I'm not a root, so I can't help in the problem analysis except by writing here. :)
Gerald
On Wed, Dec 22, 2010 at 3:36 PM, Gerald A geraldablists@gmail.com wrote:
It looks like a huge volume, so it makes sense that it would take some time to complete.
Yes. It's spent almost the whole time so far in /aux0/user-store/osm_hillshading, which looks to be millions of tiny files split up over tens of thousands of directories.
Are quotas enabled on the volume? That might give you a quick snapshot of who the biggest user is.
Not as far as I can tell. quota -v on users who have lots of files there doesn't return any results.
You could try running the du in a subdirectory, which might give you some more ideas about big directories. (You can try: du -k | sort -rn).
I could, but it doesn't seem like it would be much faster than just waiting for a du on the whole thing to complete.
2010/12/22 Aryeh Gregor Simetrical+wikilist@gmail.com:
Yes. It's spent almost the whole time so far in /aux0/user-store/osm_hillshading, which looks to be millions of tiny files split up over tens of thousands of directories.
This has been like that for months, and never was a problem so far. It's also unlikely that this directory has increased in size lately, unless someone apart from me has added stuff there.
Cheers Colin
On 12/22/2010 5:13 PM, Colin Marquardt wrote:
2010/12/22 Aryeh Gregor Simetrical+wikilist@gmail.com:
Yes. It's spent almost the whole time so far in /aux0/user-store/osm_hillshading, which looks to be millions of tiny files split up over tens of thousands of directories.
Well, there's 986 Gigs in /mnt/user-store/stats
FYI, we have ordered a new array with 24 TB of space for stats, user store, etc. We hope to get it installed in January. Things will get better soon.
-- daniel
On 23.12.2010 02:33, Q wrote:
On 12/22/2010 5:13 PM, Colin Marquardt wrote:
2010/12/22 Aryeh Gregor Simetrical+wikilist@gmail.com:
Yes. It's spent almost the whole time so far in /aux0/user-store/osm_hillshading, which looks to be millions of tiny files split up over tens of thousands of directories.
Well, there's 986 Gigs in /mnt/user-store/stats
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Hello, At Friday 24 December 2010 13:43:34 DaB. wrote:
FYI, we have ordered a new array with 24 TB of space for stats, user store, etc. We hope to get it installed in January. Things will get better soon.
but this should not stop people to look into the user-store and remove old data which they need anymore ;-). (like the 7th dump of enwp of the same age).
Sincerly, DaB.
WOW, that is great!
BTW I have deleted some temp files of my own in /mnt/user-store (about 27 GB).
2010/12/24 Daniel Kinzler daniel@brightbyte.de
FYI, we have ordered a new array with 24 TB of space for stats, user store, etc. We hope to get it installed in January. Things will get better soon.
-- daniel
On 23.12.2010 02:33, Q wrote:
On 12/22/2010 5:13 PM, Colin Marquardt wrote:
2010/12/22 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
:
Yes. It's spent almost the whole time so far in /aux0/user-store/osm_hillshading, which looks to be millions of tiny files split up over tens of thousands of directories.
Well, there's 986 Gigs in /mnt/user-store/stats
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Please, do not delete that files, they are important for stats.
2010/12/23 Q overlordq@gmail.com
On 12/22/2010 5:13 PM, Colin Marquardt wrote:
2010/12/22 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com :
Yes. It's spent almost the whole time so far in /aux0/user-store/osm_hillshading, which looks to be millions of tiny files split up over tens of thousands of directories.
Well, there's 986 Gigs in /mnt/user-store/stats
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
On 12/28/2010 03:52 PM, emijrp wrote:
Well, there's 986 Gigs in /mnt/user-store/stats
Please, do not delete that files, they are important for stats.
Following a suggestion made by River, I am currently recompressing all files using the newly-installed "xz" program. Based on an arbitrary sample (all files from 1 January 2011), we can expect a size reduction of about 25% -- or about 250 Gb freed in total. The compression time, however, is quite long, and I haven't measured how the decompression time compares with gzip.
Let me know if the xz format causes any problem; I have started recompressing the older files, which are not used so often, but if this is ok with all users, I may start in the future recompressing the new files as soon as they are downloaded.
Hopefully, with the new storage space, we'll be able to store these files in a more convenient way (not only raw files as they are now, but also the same information in a better format).
Frédéric
Hi Frederic, thanks for your work. Have you tested 7z?
We can compress to xz while the new disks arrive. I read that it is about 24 TB, so, we can revert to gzip in the future.
2011/1/2 Frédéric Schütz schutz@mathgen.ch
On 12/28/2010 03:52 PM, emijrp wrote:
Well, there's 986 Gigs in /mnt/user-store/stats
Please, do not delete that files, they are important for stats.
Following a suggestion made by River, I am currently recompressing all files using the newly-installed "xz" program. Based on an arbitrary sample (all files from 1 January 2011), we can expect a size reduction of about 25% -- or about 250 Gb freed in total. The compression time, however, is quite long, and I haven't measured how the decompression time compares with gzip.
Let me know if the xz format causes any problem; I have started recompressing the older files, which are not used so often, but if this is ok with all users, I may start in the future recompressing the new files as soon as they are downloaded.
Hopefully, with the new storage space, we'll be able to store these files in a more convenient way (not only raw files as they are now, but also the same information in a better format).
Frédéric
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
emijrp:
We can compress to xz while the new disks arrive. I read that it is about 24 TB, so, we can revert to gzip in the future.
I suggested xz pretty much as an emergency fix; we don't want to delete the files, but they do take up a lot of space.
I don't mind switching back to gzip when we have more space; in fact, I already suggested this to schutz in private. However, if xz works okay (and it should, since the interface is identical to gzip) we may as well stay with it.
If it turns out to be too slow, we could consider 7z or something else (rzip, bzip2, ..., or maybe even just gzip -9).
- river.
On 03.01.2011 01:29, River Tarnell wrote:
If it turns out to be too slow, we could consider 7z or something else (rzip, bzip2, ..., or maybe even just gzip -9).
i found pbzip2 to be nice. bzip2, just faster :)
-- daniel
emijrp:
We can compress to xz while the new disks arrive. I read that it is about 24 TB, so, we can revert to gzip in the future.
PS: Yes, we are installing 24TB of disks (12x 2TB disks), but this is raw space. To begin with, a 2TB disk is only 1,862 GB of real space. Assuming we configure the disks as RAID 50 with two 6-disk legs, that gives (1862*(6-1))*2 = 18,620 GB (18.2TB) usable space. We will reserve some of this for internal use (such as backups), so the total amount available to users will be less than that.
When budgeting for this upgrade, we assumed 5TB would be used for user-store. In reality, there should be a lot more than this (we originally planned to use 1TB disks); but it won't be a full 24TB.
- river.
On 03.01.2011 01:35, River Tarnell wrote:
When budgeting for this upgrade, we assumed 5TB would be used for user-store. In reality, there should be a lot more than this (we originally planned to use 1TB disks); but it won't be a full 24TB.
indeed. listen to river. sorry for throwing around numbers :P
-- daniel
emijrp wrote:
Hi Frederic, thanks for your work. Have you tested 7z?
It makes no difference to me. River suggested (and installed) xz, so I used it, but 7z would have worked too.
A quick test using my biased data for one day (but it should be representative enough):
$ du -s * 1027260 7z 1004 M, 25.27% saved 1374804 gz 1.4 G, 0% saved 1020692 xz 997 M, 25.75% saved
The difference between xz and 7z is negligible (<1%). I haven't benchmarked anything formally, but 7z was much faster on my system. It looks like this is mainly because the software can use several cores simultaneously.
We can compress to xz while the new disks arrive. I read that it is about 24 TB, so, we can revert to gzip in the future.
Is there any particular reason to use gzip ? When I use these files, I mostly uncompress them on the fly from Perl, and there is a module to do this with zx too (haven't tested it, though). I am sure Python and other languages can do the same.
Even if we have plenty of space, it makes sense to use xz (or another format that offers good compression) and to benefit from the size reduction, for example if/when these files are backuped or moved around. Also, I'd like to be able to provide the files for download for those people who want local copies [several academic groups have already requested them], and the 25% size reduction is a big bonus here too.
But as I wrote earlier, these files are mostly archived on the toolserver, and I assume that most users don't dig often through the older ones, so that the best compression should not be a problem.
A better file format (e.g. one file per day, with separate data for 24 hours, and another file with data aggregated per day) is probably what is most needed for "real uses" -- as far as I know, this is how Erik Zachte handles this data. A databae would be best, of course, but requires much more work...
As always, comments are very welcome.
Frédéric
Frederic Schutz wrote:
emijrp wrote:
Hi Frederic, thanks for your work. Have you tested 7z?
It makes no difference to me. River suggested (and installed) xz, so I used it, but 7z would have worked too.
A quick test using my biased data for one day (but it should be representative enough):
$ du -s * 1027260 7z 1004 M, 25.27% saved 1374804 gz 1.4 G, 0% saved 1020692 xz 997 M, 25.75% saved
The difference between xz and 7z is negligible (<1%).
xz has a much saner syntax.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aryeh Gregor:
I'm looking at this, but I have no brighter idea than just running du on it, which is taking a very long time.
I wrote a tool called summdisk for this, which produces reports on per-user disk usage:
http://lists.wikimedia.org/pipermail/toolserver-announce/2010-September/000343.html
Unfortunately it still takes a very long time to run. According to df, there are 136,110,408 inodes used on the volume[0]; I wonder if people who currently create large numbers of small files could save some accounting space by aggregating them into larger blocks, like OSM's meta-tiles.
- river.
[0] of which ~127m or 93% are used by a single user
toolserver-l@lists.wikimedia.org