srn: April 2015 Archives

For a period of about 16 minutes between 8:25 - 8:41 PST -0700, cattle and girdle had unreliable network connectivity. We do not provide our own network connectivity at this location and I have no information from our provider on why this happened.

router transition mostly complete

| | Comments (0)
All the BGP announcements and gateways have been moved to the new router. There are still some people routing ipv6 traffic through the old router, which is why it hasn't been brought down.  If in a couple of hours it is still in use, I or someone else will reach out to those people individually to figure out what's wrong.

partial network outage

| | Comments (0)
While we were making changes to the BGP announcements, some subset of the internet stopped accepting our routes. For example, from my home ISP and some other providers there were no issues reaching all of our network, but there were issues between our servers at Sacramento and some of our network.

We eventually isolated this to an MTU change to support jumbo frames which should have been OK - the switches and our upstream were configured for jumbo frames and we checked this in advance of making our change.  We believed the MTU not matching our upstream may have been causing some more minor packet loss, which is why we changed it.

We found our BGP session session was continually being re-established. When the connection times out the routes were dropped and immediately picked up again but it was not long enough to propagate to the transit providers who apparently do not cache routes. 
Since the MTU was supposed to work and were other changes to BGP configuration we only found the problem by process of elimination.

 When the MTU change was reverted everything started working properly again. Now that this is resolved I am continuing finalizing the network changes.
I expect us to complete decommissioning of our primary router tonight in favor of one with a much more modern version of Linux, as well as much better hardware.

For moving the gateway IPs -  testing last night with an unused subnet showed no visible downtime pinging at a 1 second interval from outside the network to a VM inside the network. Previously we had attempted to add a route from the existing router, but now we are changing which router announces our IPs in advance.

By design, BGP route announcements can come from multiple routers, so changing this should not lead to downtime. We have already moved a portion of our subnets.

HE.net will also be moving the static routes for the IPs we still have from them to the new router tonight. We will coordinate with their technical support during the transition.

Normally I would give more advance notice, but we are being charged by HE.net for not having completed moving from the old connection to the new one yet and I don't expect the changes to cause any problems.

Update on network maintenance

| | Comments (0)
Due to bugs or otherwise unknown behavior in our current router, it was not adding routes when it claimed to be, so it was not forwarding traffic properly to the new router. There was downtime of about 3 minutes for about 60 hosts and about 20 minutes total for 10 hosts. At one point we restarted the router, which restarted our BGP session, which led to our network being inaccessible for about 10 minutes from the subset of the internet that drops routes very quickly.

We are probably going to swap around the order of operations to attempt to work around the behavior on the old router, but we won't have a concrete plan until sometime tomorrow.
We weren't able to complete moving the gateways last week because we didn't complete configuring the router in time, so it will be happening starting Friday night at 8:00pm PST. Again, we don't expect downtime per gateway to exceed more than a minute or so.

We'll be making some other changes to move to the new router, but none of them should result in downtime.

white down again

| | Comments (0)
UPDATE 2015-04-15 06:29 PST -700: All instances are back up and ipv6 is working.
---
The bios error is "HHT Link SYNC Error" we will be finding an alternate chassis to put white into.

13 minute downtime on dish

| | Comments (0)
Me and Luke screwed up. I'm sorry. Dish was switched to a new chassis during the xen security updates and the physical label for the chassis was not corrected on the back of the server. We were attempting to disconnect one of the servers that had been decommissioned and instead pulled power on dish. As soon as this happened we noticed the error and brought it back up.
We are going to be moving gateway IP addresses to a new router, one at a time. Total downtime per gateway should not exceed one minute.

Invoicing problems resolved

| | Comments (0)
We managed to use the blesta API to fix everything. One type of change we had to make, which was removing invoice line items, appeared to expose a bug in blesta that I reported and submitted a fix for. Sorry it took so long but we wanted to make sure that we weren't creating more problems as we were fixing others, as happened last time.

Another update on invoicing problems

| | Comments (0)
Summary: We believe we've identified all the problems and we're working on how to best fix them. They are one-time bugs ultimately related to the import and not something that would have been an ongoing problem. They are also our fault and not the billing system.

I'm sorry this is taking so long.

More details: We attempted to reproduce the issue where the last invoiced date was not updated and were able to cause different problems, but were not able to exactly reproduce the one we saw.

I and one of the blesta developers concluded that there is a potential problem where if an invoice creation job is killed in the middle, there could be one duplicate invoice, but there was more than one duplicate invoice so it could not have been the problem we were seeing. It also would have raised an error in the blesta UI because the invoice creation job would have not been marked as completed.

If the invoice creation dates had not been updated at all this would have lead to an infinite number of duplicate invoices (at least until someone noticed or we ran out of disk space) so this is also not what happened.

Nobody explicitly remembered doing anything to the DB but it seemed like there was no other possibility.

I realized that there was an intermediate database backup between the dates we were looking at. I confirmed that the renewal dates were correct in there. So we must have overwritten the updated service renewal dates with the old service renewal dates.

We believe what happened is that changes were made directly to a stale backup of the service table while someone was working on fixing a different problem, and the stale backup was restored. I also looked through my chat logs and confirmed that someone said they were working on a problem and made changes to the database on that day. So this is a problem with our own procedure and not with blesta.

Preventive action going forwards: don't touch the DB directly now that the import is complete, which means going through either the UI or the API.

Corrective action: find and remove the list of duplicate invoice lines by either 1. make a list of
the people who had their service dates changed by comparing the two database backups or 2. find the service ids that show up twice since March 22nd given we don't have any services on a renewal basis shorter than 1 month. In order to remove the duplicate invoice lines we will probably have to first unapply all the payments from that invoice and then reapply the payments, unless blesta is smart enough to figure it out on it's own. We will be testing this on a backup of the database before applying it to production.

Another problem we've found is that there were a number of invoice lines (41) and transactions that have an incorrect association with the invoice table. So how did this happen?

Blesta does not equate the invoice id with the invoice number - these are allowed to be two
different values in the case of there being a 'draft' invoice.

A number of dummy invoices were created to absorb payments that weren't originally applied to an invoice. During the creation of these invoices, there was an assumption that the invoice id would equal the invoice number, but there was a bug in our migration process where this was no longer true. The invoice lines and applied transactions were associated with the invoice number, when the actual relationship is with an invoice id. This also left the 41 original invoices as having no associated invoice lines, which is in error.

if there had been a foreign key constraint on the invoice_id columns then this bug would have been prevented.

So then, how to fix this? Probably the process will be to unapply the payment, remove the bogus invoice line from its current invoice, add it to the correct invoice, and then apply the payment to the correct invoice. Again, we will do this in a test environment before applying it to production and may need to unapply / reapply all payments before being able to edit the given invoice.

Past due notices

| | Comments (0)
We screwed up and did not customize the dates that blesta sends out reminders about open invoices; it was left at the default time period which was ridiculously short. I apologize for that.

We've disabled the notifications until the invoice bug is fix (hoping for that to be complete today.) Also, whenever they are enabled again, the reminders will be sent out at a more appropriate time interval - no sooner than 10, 20, and 30 days after the invoice is due.

The suspension job is blesta is off, has been off since the beginning, and will not be turned on until we've had at least a month with no invoice related issues.

About this Archive

This page is a archive of recent entries written by srn in April 2015.

srn: March 2015 is the previous archive.

srn: May 2015 is the next archive.

Find recent content on the main index or look in the archives to find all content.