March 2015 Archives

Update on invoicing problems

| | Comments (0)
I believe we may have found the 'how' of the bug though not the why.

The invoice creation code depends on the last date an invoice was generated for and the time period between invoices. The last date an invoice is generated for is stored in the database.

We have database backups from before and after the first invoice creation job and have confirmed for at least one customer that the last invoiced date did not change. After the second invoice creation job, for at least this one customer the last invoiced date changed to the desired value. If it had not, multiple duplicate invoices would have been created for the affected customer.

Upon successful creation of an invoice, there is a call to update the last date an invoice was generated for.  These actions do not appear to be atomic. It also appears to be possible for the last invoiced date update to fail, and I cannot find any handling by the caller for failure. I have brought this to Blesta's attention and am waiting for their confirmation.

Unfortunately we don't have a concrete reason for why the update of the last invoiced date would have failed the first time and succeeded the second time. We're hoping that either blesta can give us an explanation or can help us reproduce the bug against the database backup before the first invoice creation job. However, if we can avoid duplicate invoices from being created in the future this is probably good enough.

To our best knowledge, the duplicate invoices are limited to 161 customers.

After the problem with blesta is fixed or worked around, we will manually verify the number of invoices created since the initial import from freeside, either with our own code or by having freeside generate invoices again.

Problems with invoicing

| | Comments (0)
Some customers have gotten duplicate charges on their account. We are working on this.

Please note there is no way for us to automatically take money from any of your payment accounts, so until we have it figured out and sent out an update please ignore any invoices you receive.

DIY grub2 support (mostly 64-bit) added

| | Comments (0)
Instructions here.

Paravirtualized xen support has been available in grub2 for some time now; I think it's been at least 6 months that I've been talking about deploying it. This week I said "screw it" and actually pulled everything together to get it done. This took a few patches described here.

The biggest, most annoying thing is that 32-bit mode doesn't work under xen 3. I looked and it's just not worth the time and effort to try to make it work. But since our motto is "we don't assume you are stupid" I decided not to hold back 64-bit support from everywhere, despite it potentially being more confusing. This is yet another reason to move from centos 5 to centos 6 on all of our machines.

The pre-generated distributions were designed for pv-grub and will either not work or have hiccups under grub2, which is why using grub2 is DIY. The next time the images are updated, hopefully we'll have centos 6 everywhere so the images can start using it natively. Since legacy grub files are still supported we can make grub2 the default at that time.

Have fun and let us know if you run into any issues or need changes to the default config file.

Billing migration complete

| | Comments (0)
Emails to individual customers with more details will be sent later Sunday because I realized that different emails should be sent to different contact types and the logic for that isn't complete yet.
We expect up to 12 hours of downtime for the billing interface only. No other services will be affected.

mcgrigor down

| | Comments (0)
UPDATE 06:38 PST -0700: all the domains should be back up. I think it may have been related to a domain being in a reboot loop but am not certain.

The dom0 now has more ram; it will also be getting some swap. xendomains is going to be changed so that it shuts down guests in parallel.

mcgrigor experienced an OOM condition which triggered xend being killed. This mucked with the networking. As best as I can tell the domains were all OK but I am not able to figure out why the networking is having issues. I am not able to figure it out in a reasonable amount of time so  I am rebooting mcgrigor.

I/O stutter on hughes

| | Comments (0)
There's been some intermittent problems with IO on hughes. This was happening only when we were zeroing the data on newly allocated logical volumes from the host server.

We had kernel error messages such as

Mar 15 22:04:05 hughes kernel: PCI-DMA: Out of SW-IOMMU space for 65536 bytes at device 0000:03:00.0

This symptom is the same as

but fortunately has not caused a system crash. 

We have not seen this problem with I/O on hughes at any other time. At the time of the next system reboot we will take the opportunity to put in a larger value for swiotlb. If we observe this under normal conditions rather than unusually heavy load we will schedule a reboot within a few hours after that.
We ran into some unexpected problems and were not able to finish before people were too tired to continue. All we had done on the existing billing system was disable some cron jobs and take down some links; those changes have been reverted.

The window for Saturday will again be noon to 11:59 PM.

So what exactly were those problems? One was that our license for blesta was not taking and we weren't sure why. That was traced down to some extremely zealous firewall rules.

A second problem was with paypal credentials. Previously we had been using sandbox credentials to test Blesta. There are different methods of authenticating to paypal and blesta accepts only a subset of those. We were having difficulty with the paypal user interface trying to figure out how to generate the correct type of credentials. That was eventually straightened out.

Third was that a script was not updated and the migration was performed to a test server and not production. We could have tried copying data from the test server to production, however since they are different servers with different licenses it would have risked corrupting the blesta setup. Running the migration takes long enough that everyone would probably be too tired at the time it completed, which is why we decided to roll back the changes and try again this Saturday.

Our billing interface will be going down from about 12:00 PST -0700 for up to 12 hours so that we can migrate to a different system. No other services will be affected. We will send an email directly to customers with more information later today.

Recent security updates

| | Comments (0)
All updates were completed by the embargo end date for the respective advisory.

At least the following two security advisories were covered:

XSA-122 can be thought of as heartbleed in between VMs on the same physical server. Uninitialised ram was being returned to a VM.

XSA-123 is more serious. "Arbitrary code execution, and therefore privilege escalation, cannot be excluded." This appears to be with reference to other VMs running on the same physical server according to

Other recently publicized security advisories do not apply as we do not run any arm or HVM mode VMs and do not allow PCI passthrough.

resolver01 down temporarily

| | Comments (0)
UPDATE 00:24 - it is back.
resolver02 is up, resolver01 is on a host that is getting rebooted. It should be back within an hour.
UPDATE: ticketing system is back up.
The ticketing system is down right now in advance of maintenance later tonight. and will be down sometime between 20:00-23:00 PST -0800.

About this Archive

This page is an archive of entries from March 2015 listed from newest to oldest.

February 2015 is the previous archive.

April 2015 is the next archive.

Find recent content on the main index or look in the archives to find all content.