a discussion on the SLA

| | Comments (17)

So, according to some metrics over the last two days we had 3 hours of downtime. But it was spread over two days, so it really should count for more.

So, here's what svtix said about the matter:

In consideration of the downtime experienced in our SVTIX data center on Septem\ ber 13 and 14, I am crediting your account for three days of service. This wil\ l be applied to your current invoice.

Now, this seems to be how most of my competitors do it, too. At best, they give you a symbolic apology.

the thing is that if I had taken the sla payout from my last network outage, and instead of giving those credits, I had spent the money on a new router and a secondary, redundant upstream, this problem would not have been a big deal at all. Customers would not have experienced downtime.

So yeah, while an SLA is a good way of estimating the cost of a problem and aligning the interests of the owner with the interests of the customers wrt. downtime, I think that when the company is in 'full growth' mode like prgmr.com is, it might hurt more than it helps, by removing some of the working capital that would have otherwise paid for infrastructure upgrades.

17 Comments

Your current SLA says:

I will guarantee 99.5% uptime - this means that if service is down for more than 3.6 hours in any given month, affected customers will get a service credit that is equal to what they paid for the month in which the SLA was not met.

I've always thought this SLA is far too generous and for the reasons you state above, actually quite concerning.

I say either lower the threshold or use a more realistic service credit formula like SVTIX's: credit = $no_of_days_affected + 1. The way things are now, I kinda wish for 3.6hrs/mth downtime each month so I can get me some of that free hosting ... ;-)

Cliff

Best case for client + company is little or no downtime. Neither party can truly calculate what downtime actually costs us in real business, etc so I vote that the SLA be geared towards fixing the root cause rather than temporary fixes.

The pay later, fix it later puts us all on a slippery slope!

Thanks for your service Prgmr.com


I find the SLA to be extremely generous as cliff said. I'm one of those people that do use and depend on my VPS, but its not a 24/7 thing. So for what you have just said, how ever long my vps may have been down i didn't even notice.

Personally I find it acceptable (its bound to happen no matter what,) for stuff like this to happen (providing its not too long..) and would much prefer paying even for a little downtime knowing that that money is going towards upgrades to the infrastructure to insure its a lot less likely to happen again. As appose to being credited, and having it happen continuously.

i disagree -- the SLA and the low cost is what makes prgmr.com niche .. if the SLA wasn't there, it would be hard to separate prgmr.com from other dime in a dozen providers

the root of the problem was that prgmr.com didn't have a 2nd carrier or 3rd carrier, many VPSes for the same price have multi-home bandwidth. I also do recognize the VPS market is very cutthroat and low margin-- that's one of the cons of this business.

But overall, I am quite happy with prgmr.com and I will keep using them for their current price/SLA/service.

Thanks for replying, everyone. We've been having lots of internal discussion on the topic, too. Me, I'm in favour of weakening the refund portion of the SLA and redirecting that money (perhaps officially/by policy with documentation for customers?) towards infrastructure improvement. Chris and Nick seem to be in favour of a stronger SLA with stronger refunds to customers, paying for new infrastructure with loans, if required. (I /hate/ taking on debt in general.

lambchops511: Out of curiosity, can you point me to another xen vps that is price competitive? From what I've seen, I will have some price competition on my hands shortly, but I don't think I have any at this moment, at least not English-speaking providers that use Xen (rather than OpenVZ... I would consider a KVM provider to be competitive, but not an OpenVZ provider.)

Now, to some people, the difference between $12 and $20/month doesn't really matter, though, for those people, I would think, an occasional $12 refund wouldn't make that much difference, either. I would expect that I lost most of those people last time linode dropped their prices. I think if I am to compete with linode, I need to 1. keep my prices dramatically lower, at least as a percentage, if not as an absolute number, and 2. be about as reliable as they are, and 3. continue to improve my support, and 4. come up with a better provisioning/upgrade/downgrade interface, and then I think the SLA would be #5 on that list of priorities.

That's just my impression, though, and why I'm talking about this. lambchops511: where would you put the SLA in the list of priorities?

I see price as my big differentiator, and I'm gearing up to have a fight on my hands in that arena within the next year or so. (it will be painful, but I think I can deal with it, if I can solve some of my infrastructure problems before then.)

The root of this problem was that I don't have multiple upstreams, and yeah, most of my competition does. It's reasonable to say that this (having a single upstream) is not an acceptable situation. My network is a real weak spot. The root of the last problem was largely that I have a weak router and my incompetence... if I had spotted the attack for what it was earlier and blackholed the IP in question, the downtime would have been much shorter for everyone but the target of the attack. Further, if I had a stronger router, I would have been able to just eat the packets... it was only 200Mbps or so of traffic, and if 200Mbps of traffic takes you down, you've got a problem.


How would you feel about something like 25% of the month back in case of downtime? or maybe giving you a free day for every day that had much downtime at all? Both of those options would put me in a much stronger position to spend on infrastructure improvements.

Another option is to give service credits, but somehow spread them out more. you can only pay 1/4 of your bill with service credits. this would be sort of like a loan to us. The problem with that is that it would require some significant work to implement in the billing system.

I think all of prgmr's customers appreciate how the SLA works at prgmr. I think that is for fiscal and ethical reasons - we like cheap vps and we like it when we can see a real response when a provider stuffs up (unlike most providers who wrap up their SLA in a 1000+ word legalese and have no real penalties).

I see these as prgmr distinctives and don't want to lose them even if you have 2 upstream providers.

If developing a billing system is going to take a lot of time and capital then I would opt for a simple change in SLA to get the improved comms equip.

@luke

I can't give you specifics right now because I'm at work, but I remember there were a couple Xen I saw before that were really low priced as well.

What about give us an option? For those of us that wants the "better" SLA we will pay for it, while those who wants low costs won't. i.e. Give us a +$2/month or +5%/month option to get a "better" SLA.

Sorta how ISPs work too, its the same service, but if you get a "business" vs. "personal" account, different terms of agreement.

@luke, if you are looking for investors, I like your business model -- maybe we can chat offline and work something out?

Hm. that's an idea. At current problem rates the premium would be closer to 10-20% (which is pretty bad, really) but still, that's doable. I've been thinking about giving people the option to pay more for a more premium support level, and a better SLA would go along with that very well, but I'm not really sure about our ability to deliver on that sale with our current resources, and I'm unsure of my ability to hire more resources with the likely small number of premium support accounts that would be bought.

I have considered a 'super premium' plan where you get my phone number and you'd pay some high monthly fee, then I'd charge something silly per call, and really, even if I charge you $100 for the call, which would be silly for what you use a VPS for, especially if I was charging you a monthly, I don't think it's really worth dragging my ass out of bed. Even twice that, really, isn't going to cover being sleep-deprived for the rest of the day, and I think nick hates the phone even more than I do, so I'm not sure it would work out. I guess if I charged enough on the monthly fees and didn't get many calls, it'd be great but I get the feeling that some people really like talking on the phone. I could also do 10:00-18:00 phone support which would (mostly) alleviate the getting woken up part, or at least it would force me into keeping the sort of schedule I should be keeping anyhow (I woke up around 19:00 today. Ugh. right after kingstar closed; I was going to pick up my new chassis.) But I obviously couldn't charge as much for that, maybe just a monthly fee with some amount of phone time included and two bucks a minute thereafter or something? or maybe just the option to terminate if you call me too much :) I dono. It is an idea that I should maybe try. I wonder if there is an open-source asterisk module for pay-by-the-minute phone services? actually, the more I talk about this idea, the more I like it. I'm thinking about trying it for a month.

As for investment, I'm kindof a control freak about that sort of thing. I'm unlikely to be taking investment in the near future. I should write more about my reasoning behind that, though really, it's unlikely to change until I'm ready to cash out, and we're a long ways from that.

(I have taken some steps to get some capacity online, though. we've got a new 32GiB server we setup tonight, and I'll be having Megan email the waiting list today, assuming the new server comes up as expected, and I've got the parts for another server waiting for me at KingStarUSA when they open up at 9:00)

@luke -- well, I'm not very interested in phone service... I'm more interested in a strong SLA ... the next couple of months is a critical period for my project and I downtown is unacceptable... well, when is downtown ever acceptable ;)

I mean, the customer base of ppl using xen/bsd vps as opposed to openvz should mean we're quite technically competent. We wouldn't be needing you to install stuff or reboot vps for us? (or am I wrong???)

BTW-- why 32 GiB servers? Why not 64 GiB? are you limited by spindles or cycles? I would think with the new Nehalemns, you wouldn't be limited by cycles ... so makes me think spindles...

P.S. are you using 2.5" drives as opposited to 3.5"? The 2.5" give better latencies and IOPS per dollar than the 3.5". Plus they generally last longer too (from my experience).

we use 32GiB servers 'cause the single socket 8 core (we use amd) servers with 4 disks and 32GiB ram isn't that much more than half what a 64GiB dual socket 16 core server with 8 disks would cost. Also, while we've got one of those on the test bench, we're having trouble getting it to work, so we figure we can just throw up 32GiB servers for now.

We've never gotten complaints on CPUs, it's always disk, and it's all about the random access time there.

As for the 2.5 vs. 3.5, the 2.5 is usually more expensive (certainly per gigabyte... usually on the order of double) and spindle for spindle, last time I looked, seek time on 2.5" disks was worse, at least according to the specs. (of course, I looked at this on sas disks... I should go re-check to make sure the same is still true, and still true of sata disks)

@luke

no -- in real life, 2.5" give better IOPS and lower latencies per dollar ... their disks are smaller, but spin at the same RPM

more disk space is useless if you can't hit it because of bandwidth/latency

It is /almost/ time to buy SSDs, they give MUCH MUCH higher IOPs / dollar than rotational drives ... obviously they suck at GiB / $ .

Moving to 2.5" will also allow higher drive density, allowing more DomUs per server.

>no -- in real life, 2.5" give better IOPS and lower latencies per dollar

this is not the case in terms of capital costs, at least. here, let me try to find two equivalent drives.


ST9500530NS- 500GB seagate 'constellation' 2.5" sata. around $180 each
Average latency 4.16ms
Random read seek time 8.5ms
Random write seek time 9.5ms

ST31000524NS 1tb seagate 'constellation' 3.5" sata- around $130 each.
Average latency 4.16ms
Random read seek time Random write seek time

(specs from the seagate website)

so according to the specs, the performance is nearly identical.

Now, so what I'm doing right now is that I'm short-stroking the 1TB drive down to 500GB, because you are totally right, more space doesn't matter, in this application, if the performance isn't there. by removing the slower outer tracks, this is going to further improve my speed on the 3.5" drives.

Now, 2.5" drives use less power (5 watts rather than 10, as far as I can tell) and they are smaller, and yeah, two 2.5" drives are going to beat the snot out of one 3.5" drive... but it costs a /whole lot/ more,
and as I said, as far as I can tell, 3.5" drives still hold the performance per dollar crown as far as capital costs go. and, uh, if I'm doing my math right, that 5 watts is going to cost me all of $1.50 a month, so yeah, I think 3.5" drives give you more performance per dollar, even over the life of the drive (assuming a 36 month life like most computer hardware.)

gah. for the 3.5" drive it said

Random read seek time 8.5ms
Random write seek time 9.5ms

but there was a less than sign in front of both of them, making the numbers dissapear.

@luke,

You've been lied to ;)

The specs quoted on the Seagate site shouldn't be believed, and they're not comparable across different generations/series, and never mind comparable to different brands as well. They're all benchmarked differently.

As far as power is concerned, I don' think you'll be saving anything substantial .. I mean a few watts is nice... but your drives aren't spinning 24x7

Personally, for the current generation I buy Hitachi. Although it is spec-ed less than Seagate, their firmware is less "aggressive" in optimizing for bandwidth and focuses more on latency... i.e. less focus on buffering and reorganizing disk commands to optimize for bandwidth than latency.

What drive manufacturers should do is put a hard guarantee on IO requests latency.. i.e. 95% of requests once it hits the command buffer will complete within 10 ms.


Slower outer tracks? I thought the outer track was the faster track, because in 1 spin you can get a lot more data than compared to the inner track....


What really kills VPS performance isn't the average latency, but rather the tail end latency. 1/100 disk requests are going to take > 50 maybe even 100 ms... and that /really/ hurts -- since it'll also slow down all the other requests pending on that one.


And sorry, I was thinking about consumer drives and not enterprise drives. The cost difference might be a lot more substantial than I thought. Those $200 SSDs are looking mighty cheap now in terms of IOPs ;)

right I have inner/outer track reversed.

I also like the Hitachis for consumer drives, but they don't have enterprise sata, as far as I can tell. I use the 2tb Hitachi consumer drives for backups, though.

"What drive manufacturers should do is put a hard guarantee on IO requests latency.. i.e. 95% of requests once it hits the command buffer will complete within 10 ms."

which would be nice, but doesn't really help.

Well I just signed up, but what attracted me was basically 2 things:

1. Gut feeling of competence/transparency. Any idiot can take out a loan, pay someone $200 for a website using stock photos of 'friendly customer service people waiting for your calls' and setup a hosting company. I mean I could do it if I really wanted to (I don't) - it's not technically demanding so much as it is cut-throat with low margins and lots and lots of competition - I'm lazy and prefer a relatively captive audience. The fact that you seem to weed out a lot of idiot customers this way tells me you probably end up with a smaller, but less troublesome customer base.

2. RAM, RAM, RAM. I'm kinda surprised nobody mentioned RAM. While I haven't scoured the internet for the cheapest deal, you're pretty price-competitive on RAM. Honestly I could care less about disk/transfer. I'm unlikely (and most people are) to hit those limits. RAM on the other hand...

3.. As far as the SLA thing goes, there's basically two things people care about when something breaks. Status and... status. When will it be back up? Are you working on it? Blah blah... The other 10% of your customers will be chomping at the bit because "We need it now!". 90% won't care (made up numbers) and probably wouldn't even have noticed if you didn't tell them.

Of course it depends what you're using a VPS for. I let google run all my email and just use these things to run obscure things occasionally. So if it goes down it's not the end of the world for me. I care about price to an extent, but honestly if you're in the $12/month world.. well that's more than I typically spend on food in a given day.. so yeah.

I don't care so much about a strict SLA, especially one that can make you lose lots of money, but I would appreciate more rapid turnaround on support requests especially in the event of outages.

I don't really need or want phone support, but I'd really like IM support, or at least guaranteed quick turnaround on email support, at least within clearly denoted business hours. Right now, stop me if I'm wrong, but there's no stated policy on that.

Leave a comment

About this Entry

This page contains a single entry by luke published on September 15, 2010 6:16 PM.

svtix upstream outage was the previous entry in this blog.

22 guests rebooted on Mantle is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.