recent packet loss.

| | Comments (0)
so yeah, uh, this is what I think about the recent packet loss issues.

I mean, the root of the problem (I think) is that we are overwhelming the 1G link between 1435 and 1460 at MPT.      Most of this is not prgmr.com traffic, most of this is traffic being trucked from 250 stockton to 55 s. market and pushed out he.net..

Here are the two sides of the link in question


http://panoptes.prgmr.com/cgi-bin/14all.cgi?log=bairin_24

http://panoptes.prgmr.com/cgi-bin/14all.cgi?log=biruwa_2


compared to our he.net link, here:

 http://traffic.he.net/port.php?key=NckClsQDVOBA6eIv5yB2Te14ILiLy9G+42RSVtlPLQda1k7oYS8PVdz1xjzBCjhEfPDtohjwQkNqIlUguWKRSQ==

so more traffic is going over our cross connect than over the he.net link.  This is because of my poor network design.  



As you can see, we aren't pegging it hard or anything;  70% is our 5 minute peak.  My belief (now that the cable tested good) is that we are seeing 'microbursts'  wherein traffic is peaking above 1G/sec

With my switches, HP procurve 2824s,  I don't know how to verify my theory without a MIRROR port.    

I should bring in a server and setup a mirror port on that thing, and measure the throughput every 5 seconds rather than every 5 minutes or something;  that would allow me to prove or disprove the microbursts theory. 

This customer at 250 stockton  that is 500 or so Mbps of this traffic  is in the process of moving off our network, which should be complete within a week.  That will resolve the issue for us, at least until our own traffic increases about 3x. 

but the problem of this cross connect getting saturated before the he.net uplink will come back when our traffic grows (and as you have noticed, my bandwidth allocations are currently, well, 2007.  I really, really need to start giving customers more bandwidth, which /will/ cause me to use more bandwidth overall.) 

The problem here is that cross connect is part of a  'router on a stick' configuration, without a big enough stick.  I mean, packets from cogent to 1460 (where most of the dom0s are)  go from cogent to 250 stockton, over the transport link from egi to suite 1460 at 55 s. market, up the cross connect from 1460  to 1435, down the same cross connect, and then to the server in question.

So, uh, yeah.  three ways to solve this

1. put a router at 55 s. market in suite 1460;   this is probably the cleanest way to do it, but now we've got another router that can break.    Don't use the router on a stick configuration.

2. get a bigger stick.   For like $300-$350 a month (vs the $100-$175 I'm paying for a copper cross connect)  I can get an optical link from 1460 to 1435.  I then need 10G capable switches in both 1460 and 1435.  Figure 2-6 grand, depending on brand, features, and if I go used or not.     (I can also just get a second copper link,  this will cost about the same monthly, but could be done with our current switches.)

3. traffic engineering.  make sure that all traffic at 250 stockton goes in/out Cogent and all traffic at 55 s. market goes in/out he.net (maybe just prepend so hard that unless a link is completely down, traffic goes in/out the local link?)



I do have an extra box that could be used as a router right now.   Several, in fact. 

The UnixSurplus guy has some arista 48 port 1g switches with 4 10GbE sfp+ ports he might give me a sweetheart deal on  (I have optics)  he will also rent 'em to me if I want.

so I guess I'm kinda leaning towards the router.  Cheaper.  Of course, I also don't like those HP switches;  I wouldn't mind getting rid of 'em in favour of the aristas.  I wouldn't mind one bit.  But, I generally am not a fan of router on a stick designs, and yeah, it'd save me a good chunk of change.  

 

Anyhow, I need a day off, so I'm taking today off, but I thought customers have a right to know what the hell is going on.    Nick is around today and he might setup a router or at least the SPAN port.  If not, I'll at least get a server with a MIRROR port setup on that procurve on Monday.  . 

Leave a comment

About this Entry

This page contains a single entry by luke published on October 13, 2012 2:57 PM.

For most of you, network will go down hard for 10 minutes at 21:00 PST was the previous entry in this blog.

replaced bad disk in ingot.prgmr.com is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.