In a message off-list, Platonides wrote:
I think pretty much evryone using them would want the last dump, so I don't see a problem in keeping world readable just the last two dumps or so (I chose the number two in the case someone started using one dump and wanted to finish with that, and a new one was published in the meantime).
This is incorrect. When I got the idea to analyze external link statistics, I needed all old dumps I could get, and they took a lot of time to download. There was neither disk space nor bandwidth on the toolserver, so I had to use a server of my own. I now have all page.sql.gz and externallinks.sql.gz, but only I can use them, because they are on my server and not on the toolserver. These files now take 160 GB, which is a fraction of a 2 TB disk that cost 100 euro to purchase. We're talking disk space at the cost of a lunch.
Limiting the toolserver to what most people would use, we could just restrict it to dumps of the English and German Wikipedia, since that is what the majority of users would be interested in. That sort of thinking will lead you wrong every time.
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
On Mon, Dec 12, 2011 at 11:04 AM, Lars Aronsson lars@aronsson.se wrote:
These files now take 160 GB, which is a fraction of a 2 TB disk that cost 100 euro to purchase. We're talking disk space at the cost of a lunch.
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
Entirely different classes of hard-drives involved.
Consumer grade 2TB ($160): http://www.newegg.com/Product/Product.aspx?Item=N82E16822148681 Server Grade (Raid Edition) ($320): http://www.newegg.com/Product/Product.aspx?Item=N82E16822136579
Getting the disks in the servers is an additional cost and space may not be available.
On Mon, Dec 12, 2011 at 11:22 AM, OQ overlordq@gmail.com wrote:
On Mon, Dec 12, 2011 at 11:04 AM, Lars Aronsson lars@aronsson.se wrote:
These files now take 160 GB, which is a fraction of a 2 TB disk that cost 100 euro to purchase. We're talking disk space at the cost of a lunch.
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
Entirely different classes of hard-drives involved.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Look at 10K+ rpm SAS drives then you're in the right ballpark. It's closer to $2/GB then $.2/GB
On Mon, Dec 12, 2011 at 3:24 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Consumer grade 2TB ($160): http://www.newegg.com/Product/Product.aspx?Item=N82E16822148681 Server Grade (Raid Edition) ($320): http://www.newegg.com/Product/Product.aspx?Item=N82E16822136579
Getting the disks in the servers is an additional cost and space may not be available.
On Mon, Dec 12, 2011 at 11:22 AM, OQ overlordq@gmail.com wrote:
On Mon, Dec 12, 2011 at 11:04 AM, Lars Aronsson lars@aronsson.se wrote:
These files now take 160 GB, which is a fraction of a 2 TB disk that cost 100 euro to purchase. We're talking disk space at the cost of a lunch.
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
Entirely different classes of hard-drives involved.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
We might probably get away with 7200 RPM SATA drives if they are primarily for storage and large sequential reads/writes.
-Aaron
On Mon, Dec 12, 2011 at 4:12 PM, OQ overlordq@gmail.com wrote:
Look at 10K+ rpm SAS drives then you're in the right ballpark. It's closer to $2/GB then $.2/GB
On Mon, Dec 12, 2011 at 3:24 PM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Consumer grade 2TB ($160): http://www.newegg.com/Product/Product.aspx?Item=N82E16822148681 Server Grade (Raid Edition) ($320): http://www.newegg.com/Product/Product.aspx?Item=N82E16822136579
Getting the disks in the servers is an additional cost and space may not
be
available.
On Mon, Dec 12, 2011 at 11:22 AM, OQ overlordq@gmail.com wrote:
On Mon, Dec 12, 2011 at 11:04 AM, Lars Aronsson lars@aronsson.se
wrote:
These files now take 160 GB, which is a fraction of a 2 TB disk that cost 100 euro to purchase. We're talking disk space at the cost of a lunch.
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
Entirely different classes of hard-drives involved.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Op 12-12-2011 23:27, Aaron Halfaker schreef:
We might probably get away with 7200 RPM SATA drives if they are primarily for storage and large sequential reads/writes.
-Aaron
Nice discussion. We have too many books, we need to select. No, just build a bigger library! Now we're about at the part where people compare IKEA with office supplies ;-)
Maarten
On 12/13/2011 12:16 AM, Maarten Dammers wrote:
Nice discussion. We have too many books, we need to select. No, just build a bigger library! Now we're about at the part where people compare IKEA with office supplies ;-)
(Lots of company offices use IKEA furniture. IKEA has had a separate department for this since well over a decade. If that's not good enough for you, why bother with Wikipedia, when you can use a "real" encyclopedia instead?)
The point was that when I needed the toolserver, it didn't provide the space I needed (in this case 160 GB), so I had to use my own server, which uses cheap SATA disks. I was able to complete my task, but not with the help of the toolserver. This means others can't build on my work and people without a server of their own will fail to complete such a project. The idea behind the toolserver was not met by the actual implementation in this case. It doesn't matter if it's gold plated, if it can't be used.
Lars Aronsson wrote:
The point was that when I needed the toolserver, it didn't provide the space I needed (in this case 160 GB), so I had to use my own server, which uses cheap SATA disks. I was able to complete my task, but not with the help of the toolserver. This means others can't build on my work and people without a server of their own will fail to complete such a project. The idea behind the toolserver was not met by the actual implementation in this case. It doesn't matter if it's gold plated, if it can't be used.
Well, sure. The Toolserver is a free service with limited capabilities. I think the issue that you're running into is that the Toolserver doesn't have very clear governance. There are a lot of projects that the Toolserver could get heavily involved in: database dumps, providing high-level full-text access, pageview stats, OpenStreetMap, and so on.
There are (financial) resources available for some of these projects (through grants, chapters, etc.), but not others. Ultimately, however, there doesn't seem to be a clear body that controls where the Toolserver puts its resources (besides maybe WMDE, which seems pretty hands-off). Some people would consider this hands-off approach to be a feature; some people would consider it a bug. Those looking to undertake very large project with complex and costly needs obviously see it as a bug.
Wikimedia Deutschland seems to be focusing (part of?) its efforts on WikiData: https://meta.wikimedia.org/wiki/WikiData_WMDE. Whether they have additional resources (financial, staff, or otherwise) to devote to other projects is unclear to me. If you can make database dumps a priority for them (or other rich chapters), you can build a platform to do exactly what you want to do. It's not easy work, though.
MZMcBride
Lars Aronsson wrote:
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
Probably not terribly difficult, but first you'd need to be able answer questions such as:
* How much disk space is needed right now? * How much disk space is going to be needed for the next year or two? * Will this data be backed up? If so, that doubles the amount of space needed. If not, are Toolserver users/donors/et al. willing to possibly lose this data?
It's easy enough to throw money at the problem, but it'd be nice if someone did a measurement of the size and scope of the problem first. Someone may have already researched all of this, I don't know.
MZMcBride
On Tue, Dec 13, 2011 at 3:04 AM, Lars Aronsson lars@aronsson.se wrote:
How hard can it be to get enough disk space on the toolserver? I think many chapters contribute money to its operation. Is it not enough?
Getting the HDD space in the TS isn't as simple as just grabbing a few server level HDDs and sticking them in, Due to the cluster setup and previous convos about this I believe they need a new SAN, which I believe was being talked about before River started being too busy in RL.
toolserver-l@lists.wikimedia.org