Brazil Datacenter Migration Adventure
The year was 2012, I was working on a company that had every single piece of hardware co-located in a datacenter in Rio de Janeiro, and I was the infrastructure manager responsible for maintaining that. Making sure that everything was running smoothly was part of my job, as well as also checking for new hardware deals and finding a better place to run them, where the foundation of this story take place.
If you have already been to Rio de Janeiro, you know that services operating in that city are not the best ones, actually they are far from that, and when the subject is data centers, it is even worse. I had around 50 servers and some switches in 4 racks, from firewalls to hadoop clusters, all of them running on JBOD (just a bunch of disks) on cheap chassis, nothing fancy, no redundant power supplies, nothing. We were using a chassis brand called Nilko, made in Brazil, so poor in quality that I had to leave the server assembly room directly to the hospital because I had cut my finger on the chassis, but this is a story for another post…
As I said, part of my job was also looking for better datacenter deals, and I found a datacenter in São Paulo that had a better service with lower costs, so we decided to migrate. If you think that migrating a cloud computing architecture from one region to another is complex, now imagine carrying the actual servers from one datacenter to another in a moving truck for 600km overnight, and having to re-assemble everything on the new datacenter while the service availability is partially degraded.
Wait? A moving truck? Yes. That’s what we could afford, and without insurance! If a cloud computing architecture that consists on having everything logical is not trivial to migrate, imagine migrating poor chassis servers, replicating the firewall rules anddoing everything possible in order to have both infrastructures concurrently operating on multi-site active-active as fast as possible. Although the logical part is very exciting, this post will be focused on adventures of the physical world and the logical considerations will be posted later.
After doing some logistics, we decided to move 50% of the servers overnight, and we chose that time frame for obvious reasons, it was the lowest resource fleet utilization, but as soon as the day started, users were waking up and the demand would obviously start increasing, requiring both infrastructures operating in full throttle, per Zabbix statistics, that would actually start happening about noon, which gave us a rough deadline of 12h for the whole migration: from 12AM to 12PM? Challenge accepted.
If you are not familiar with Brazil’s roadway system, this is how it works