On server migrations…
Server migration: there’s a term bound to strike fear into the heart. Horror stories of downtime and lost data abound. This post looks at the good and bad of our migration from dedicated to Cloud hosting.
The story so far
At Freshleaf we provide web hosting, as a logical add-on to our web development services. Most clients prefer to outsource responsibility for their hosting alongside management of the website. Some want to have some access and input, but most want a fully managed service.
Historically, we’ve always re-sold Rackspace’s fully managed dedicated hosting. Although expensive, the support was rock-solid and the whole setup was reliable and trouble-free. Prospective clients would sometimes ask me about downtime - as though that were a necessary evil to be expected sometimes. Not with our hosting, it wasn’t.
Still, times move on. Specifically, in this case, newer PHP versions are released, and older ones are retired. You can’t continue to run websites on a scripting language that no longer has security support; but if you’re on a dedicated box, there's no option to simply "update".
So at the start of last year we took the opportunity to review our hosting requirements, and look at what the options were. Although our hosting doesn’t need to be bleeding-edge, dedicated was starting to feel a little restrictive. Firstly, there was a strong desire not to find ourselves here again – facing a mass migration because the flexibility isn’t there to update things in a granular way. But in addition to that, there was a feeling that maybe we were falling behind best practice. That by staying with essentially the same solution for as long as we have, we were quite probably missing a bunch of technology and process improvements that newer options have to offer.
Current situation & risk assessment
The first step was to review and understand our requirements. The current server was effectively End of Life because it could not be upgraded from PHP5.6 to PHP7 or later. We had approximately 50 sites on the server, most of which comprised of a staging and a production environment. The sites were of various sizes and spanned several different platforms, some modern and some legacy.
Given the requirement to upgrade PHP version, a move of some kind was clearly necessary - so what were the risks? Well, at first assessment, the risks seemed to be manageable. The applications which were to be moved were not business-critical tools, and nor did they contain masses of incredibly sensitive data. All the applications were reasonably downtime-tolerant. The systems were not overly complex, and although many were legacy, we could boast good familiarity with all. We also didn’t expect any compatibility issues – all could be expected to run in the new environment without a lot of unexpected surprises – providing, of course, we selected the right environment. And there would be some learning curve if we selected a new approach, but that was an acceptable investment if there were deemed to be enough technological advances to justify the switch.
What solution, and with whom?
In the end, Cloud hosting looked like the right approach. It allows for easy compartmentalisation of resources that are often expensive with dedicated hardware, which in turn means no more time-consuming mass migrations – an important consideration. Each container runs its own copy of PHP, meaning that each site can be upgraded individually. Containerisation also means that each site is completely sandboxed from other sites and environments, and gives us freedom to experiment with new software or services. It allows better resource management and better monitoring across the system.
Additionally, Cloud is expected to be cheaper, more scalable, and more failure tolerant (more on that later).
Having decided on the approach, we needed to select a provider. We looked at what was offered by each of a shortlist of top seven providers, and at what cost. We also scored each provider – as best we could initially without trying each service - on a matrix of desirable features:
- Customer support
- Apparent technological capabilities
- Available range of services
- Market share
Based on our scoring, two providers came out clear top. AWS, the accepted market leader, scored equally with the less well-known offering from IBM.
At Freshleaf, we’re not ethical consumerism warriors, but we do try to be true to our values by choosing to do business with organisations that are aligned with our own beliefs and goals. Amazon, and therefore by association AWS, appears to have some business practices which – all other things being equal - we preferred not to support.
Based on those concerns, we opted to go with IBM.
Being completely new to us, getting to grips with IBM Cloud and Kubernetes had a fairly steep learning curve – but that was to be expected, and judged to be acceptable. Once we had a handle on the new setup, the plan for the migration was fairly simple: group sites into types, migrate and upgrade one of each type, test & fix as required, then document the process in order to create a guide for the remaining sites of that type.
Sites were estimated to take ~5hrs to migrate and upgrade, so we’d be looking at a total of ~300 hours, some of which was planning time and time invested in learning a new system. For a necessary hardware and software upgrade plus the process improvements we’d be getting, that seemed a reasonable number.
However, as we got into the process, there seemed to be a roadblock around every corner. The test implementations took a while to set up, but they were the proofs of concept and incorporated a fair amount of learning curve, and we started out sure that things would get simpler. But they didn’t. Everything took longer than expected, and the migration was beset with issues and unexpected complications. We found ourselves way behind schedule, and sites that were estimated to take around 5 hours were taking over 20 hours, increasing the scale of the whole migration.
But that wasn’t even the worst of it. As time went on it increasingly seemed that the new setup was less than optimal. Yes, the test sites ran faster, but we had some stability issues which we’d never experienced before with the old dedicated setup. There were some unexplained hiccups with persistent storage, but the biggest issue turned out to be insufficient redundancy.
At a headline level, Cloud boasts “more resilience”. But there’s two important pieces of context that go along with that:
- Cloud hosting – at least the setup we’re dealing with - is inherently less stable than dedicated hardware. The cost of Cloud/Kubernetes improved flexibility is complexity: more moving parts, and therefore more to go wrong.
- To attain resilience, you need to configure correctly, and plan for a good deal more redundancy than with dedicated.
The particular issue we ran into with IBM was with the database. Most elements were configured with fail-over, which seemed in the main to work. But setting up database fail-over is complex and time-consuming, and our preferred approach – a managed service – wasn’t offered by IBM.
At this point we had migrated a handful of production sites in addition to the test sites, but we were unhappy with the stability of the new setup. In addition, we had some fairly significant frustrations with IBM’s account management portal, account-level support, and its billing system. Nothing about this was feeling right.
Time to admit it: IBM was the wrong choice. We selected their service over AWS “all other things being equal”, but as it turns out, all other things are not equal. It seems that AWS, as the market leader, has been able to invest significantly more in its infrastructure and services than its competition. And inevitably that manifests itself in the quality and breadth of the offering: including having a managed database service; sensible billing that runs on time and can be easily understood, and a support portal that doesn’t send you in frustrating circles.
It became obvious that – time already spent notwithstanding - migration to IBM must halt, and setup with AWS must be investigated.
Where we are now
Some 12 months after we started the process – and after a considerable lull while we assessed the situation and dealt with other priorities – we have nearly completed the migration to AWS. The whole process, mainly the misadventure with IBM, has been quite tough, but we are confident in the new setup on AWS. The headline outcome of the server migration is that we achieved all the goals we outlined at the start:
- Virtually no downtime/service interruption
- Zero data loss
- Successful PHP version upgrade
- Websites seeing 2x performance gain
- Greater flexibility to manage sites in a granular way (including no more major migrations)
- Better monitoring, encryption, failure tolerance
- A number of as-yet not fully realised process and technology improvements
Additionally, we learned a handful of things:
- Sometimes the market leader is the market leader for a very good reason.
- Server migrations are complex, and easily under-estimated. Anyone in development circles will tell you that estimating is hard and under-estimating is easy. However with something like this where per-site estimate inaccuracies scale up by the number of sites to be migrated, under-estimating can become seriously problematic.
- Planning and risk-assessment for the process should be shared across a number of individuals in the team, and should include top-level research as well as technical and practical planning. This ensures a wider and deeper understanding of the implications. In our case, it’s possible that a broader understanding of the stability vs flexibility implications of Cloud would have flagged up the importance of a managed database service earlier in the process.
- It’s well worth getting as many test implementations in place as you can before committing to any of the big decisions. Test data in test environments won’t uncover all of the real-world gremlins that can crop up with something as complex as a server migration. But it will shake loose a few, and it will also give you a better ‘feel’ for what you’re committing to.