Creating a distributed data centre architecture using Kubernetes and containers
The seed for this project was first planted when a question was posed while brainstorming ideas around a whiteboard:
What if we could break up our CFC’s data centre into smaller nodes?
If the CFC (Customer Fulfilment Centre) name doesn’t ring a bell, that’s what we call the highly automated warehouses where our groceries are stored, picked, and sent off for delivery to Ocado customers.
The idea was bold and would revolutionise the way we approach on-site data. If possible, this would eradicate the need for the data centres and the network routers entirely, considerably downscaling the system, saving valuable time spent on maintenance, and, of course, cutting costs and energy consumption. Most importantly, it would be an elegant solution, as the resulting nodes could also run every other element of the warehouse from display screens to pick stations independently. However, to do this, the system would have to be fail-safe, flexible, and simple to implement. That solution may have now been created – enter Kubermesh.
Before Kubermesh: OSP CFCs would use a sizeable data center
Kubermesh: an elegant solution to running our OSP warehouses
Kubermesh is a bare-metal, self-hosted, self-healing, self-provisioning, partial-mesh network Kubernetes cluster. A bit of a mouthful, I know. But before we introduce Kubermesh in more detail, let’s take a step back and look at how Ocado Technology first started using the Kubernetes container management system.
Kubernetes was initially introduced at Ocado Technology as a container management system for the Code for Life project. Code for Life is a non-profit initiative that delivers free, open-source games designed to help teachers deliver the new computing curriculum and introduce children to coding. The Rapid Router game, designed for primary schools, has more than 90,000 users internationally and that number is growing.
When the team started working on a second multiplayer game aimed at teenagers, they realised that the game characters needed to be able to navigate the map and execute actions while the students were logged out. It was not practical to run that part of the game on personal computers, so Kubernetes was used instead. Considering the large user base of Code for Life, a system needed to be developed capable of processing and managing large sets of data continuously in order for the game to run smoothly.
The Code for Life website and Rapid Router were already hosted on the Google Cloud Platform, which supports Kubernetes. This meant Kubernetes was the obvious choice when looking for a container management system to run many students’ code in one cluster on many containers within virtual machines on the cloud.
Mike Bryant, IT team leader, was involved with implementing this system for Code for Life and realised the potential it held for being used outside of a cloud platform for the specific purpose of streamlining the data system at Ocado Technology. This was the beginning of the Kubermesh project. Mike quickly designed a working prototype and started working on developing it further.
How does Kubermesh work?
Kubernetes uses a mesh network which could allow developers to potentially remove the data centres, the network, and other machines running around the warehouse, leaving only the computing nodes and fibre optics remaining.
Using Kubermesh, different computers in the warehouse can be connected to form a distributed data center
Our largest grocery CFC in Erith spans 563,000 sq ft and would require 400 of these nodes randomly dotted around the warehouse and wired together to create the mesh. The apps deployed on the nodes could then be strategically placed near other apps they would often communicate with for optimal speed and performance. A node could be any computing equipment typically found in our warehouse, ranging from dedicated servers or Intel NUCs to workstations in pick aisles or PCs used to display engineering-related information on overhead displays.
The Ocado Smart Platform CFC includes thousands of robots roaming on top of a grid
The self-provisioning element offers remarkable flexibility, allowing additional nodes to simply be connected up, and voila! within a few minutes the new resource will be incorporated into the cluster. It will then be able to present itself, offering extra capacity meaning containers can then be scheduled onto the new node as required.
The benefit of a self-healing system is fairly obvious: dependability. If we have CFCs running across nodes in a distributed data centre, we need to know that there is little to no risk of them letting us down. All the pieces of the puzzle within the mesh work independently, so if someone accidentally runs over a node with a forklift truck for example, the system as a whole will still be online and fully functioning. In fact, it would take over a third of the whole system going offline before functionality would be compromised (that would mean a lot of run-ins with a lot of forklift trucks!).
So how does the whole system still run efficiently if one node goes down? If this were to happen, the other nodes would notice and re-route their communications via other nodes. Any containers damaged or lost along with the fallen node will then be reproduced and the necessary copies will be distributed among the remaining nodes.
The benefits of using Kubermesh
Implementing Kubermesh at our CFCs would reduce capital cost considerably, both because of downscaling the physical parts necessary and because of reduced maintenance needs due to the independent nature of the system. Using Kubernetes within the Kubermesh project makes deploying software faster and easier, increasing the efficiency of the overall system. Kubernetes also allows you to deliver one API from which anyone can access containers using the cloud tools and services of their choice. This means that you can potentially change to an alternative cloud platform without a user accessing the API noticing any change – the UX remains the same. This will eliminate any disruption a change in cloud platform could cause, giving us flexibility and choice when it comes to providers.
Kubermesh can essentially be seen as the glue holding many elements together, one of which is of course Kubernetes, run in OpenStack on bare metal servers. For those of you who want to know more about how Kubermesh works under the hood, we talked to Mike Bryant to find out a little bit about what other different elements Kubermesh incorporates and how they work together.
“For the underlay network, we’re using OSPF3 on IPv6, provided by the Quagga Software Routing Suite. This streamlines configuration, as we don’t need to configure individual point-to-point links, IPV6 handles this using link-local address auto-configuration. The overlay network is custom (but using flannel for IP allocation), as none of the standard providers currently support an IPv4 over IPv6 overlay.
“For bootstrapping the self-hosted environment we use bootkube, by CoreOS. Adding nodes is done by iPXE booting over IPV6, then using bootcfg to provision the base software, including the Kubelet. Once up and running that uses anycast routing to find the apiserver and bring up the rest of the stack. One of our most visible features is of course der blinkenlichten. For our test environment we use blink(1) lights to communicate node status.”
The project is free and open source, so for more in-depth information visit Github.
The combination of the unique case presented by the retail-technology environment at Ocado and out of the box thinking has produced a platform with the incredible potential to streamline our systems, while increasing efficiency and dependability.
We’re looking forward to the next stage: implementation.