How to build for the future without any downtime
By Marcus Smedman, Unibet Group CTO
In 2010 there was a decision made by Unibet Group to move to a new technical architecture. The existing backend services were built as a monolith and could not scale to cater for the increased load from the increasing number of actives, nor could it provide the flexibility of adapting the capabilities for upcoming country requirements.
We knew that entering each country with its own regulations would require a platform that could handle changes and adaptations while providing stability for our strong brands.
The nature of a monolith is that it is a computer program (“an application”) built as a single unit, managing a lot of different tasks. A monolith becomes brittle when it becomes too large, as a change to one user journey can easily affect another user journey in a negative way (bugs, performance etc). Also, the only way to scale (managing more customers and transactions) a monolith is to buy more powerful hardware that the monolith is running on. Eventually at some point there will be no more powerful hardware available on the market, and you end up with maxing out the capability of the application.
Microservices and the new architecture
The new architecture that was chosen was an “Event Driven Architecture (EDA) with Microservices“ (http://martinfowler.com/articles/microservices.html). The idea of a microservice is to isolate certain functionality (“separation of concerns”) in a smaller application that “serves” it’s functionality to other applications, i.e. “a service”. Thus, a move from a single large application to many smaller (“micro”) services started 2010, and today it is fully rolled out with more than 150 microservices running in production. Many companies would like to get there but it is known to be very hard to implement. (see “Microservices envy” https://www.thoughtworks.com/radar/techniques/microservice-envy) We are now an organisation that has 5+ years of experience doing this, and we are good at it!
EDA – Event Driven Architecture
The event driven approach is very fundamental to the new architecture. The idea is that a service can communicate with other services in an asynchronous (“async” ) fashion where it makes sense, and not only synchronous (“sync”) as was the default in the older architecture.
Example: Compare it with a SMS (async) versus a phone call (sync). Many times you do not need to wait for a response before continuing doing what you are doing (SMS), but sometimes you must wait for a reply (phonecall). Further, sometimes you want to broadcast a piece of information to many recipients, but it is more of a notification and you really don’t need any response at all, similar to how Facebook works.
In EDA you talk about producers and consumers. A producer generates events and places them on an “event bus”, and a consumer registers itself onto the bus and tells it what events it is interested in. This architecture enables a very flexible way of expanding the functionality of the platform, as many times you only need to add a new consumer to gather information it needs and then act upon it within its context to fulfil a business need.
One example would be the new Big Data solution in production. How could we get all data generated from all services to flow into a big “data lake”, that can then be mined for interesting analytics? Well, following the EDA pattern, a new microservice called “Big Data Publisher” was built that registers itself on the event bus as a consumer, and says “I am interested in this info and that info”. The Big Data Publisher service will then consume the relevant events and put all this info into the Data Lake. No other services are impacted, and the risks for rolling out this functionality to break any existing user journeys can be kept to a minimum.
A load balancer (“LB”) is a computer program that dispatches traffic from one input to several outputs. Having several instances (minimum 2) of a microservice and a LB in front of it, means that the LB will balance the load between the two microservices.
Example: In the platform there is microservice called “Custard” – it creates, stores and updates a customer profile. In the production environment we have several instances of Custard, all exactly the same, with a LB in front. All calls to custard is then evenly distributed/balanced between the several instances.
Let’s say we have 5 custards, then it means that if we have 1000 registrations coming in, and let’s say that we choose to balance the load evenly between the 5 Custard instances, then each instance will manage 1000/5=200 requests each. Now, if we find that these 5 instances cannot manage the total load of the 1000 registrations, we just add one new instance of Custard behind the LB. With that, each Custard will now handle 167 registrations (1000/6). If that is not enough we just add another one. This way of scaling is called “horizontal scaling”, as the scaling is done by adding more and more instances of the same microservice behind the single load balancer.
The opposite scaling version is “vertical scaling”, which is the way that you scale a monolith, i.e. you can only buy more powerful hardware or add more memory in the computer to scale.
Horizontal scaling is very powerful and cost effective, as each instance of the microservice can be run on simpler, cheaper and smaller hardware, and there is no (theoretical) limit of how much you can scale.
Being true by being stateless
In the example for scalability, you can see an issue with having several instances of the same service if the different instances of the service does not act on the same information all the time, or each of them having their own data and state. If this would be allowed, each instance would have its own version of “truth”. As you may have guessed, that would not work.
This situation is managed by having all instances share the one and same truth, by introducing a shared data storage or “database”. A database is a computer program designed for storing, updating, retrieving and deleting information. By having all instances of the same services sharing one database, the services are now just workers without owning the information, thus the situation is now managed and we can continue to add more workers/instances behind the LB.
Redundancy, availability and load balancing
By using the power of the load balancer, we can add and remove instances of a service, and let the load balancer know how many instances we have. The LB will then distribute the traffic to the instance we have said is available. This opens up for further opportunities.
Example: We have 5 instances of Custard (all running the same version) behind a LB in production. When we release new functionality, we remove one of the instances from the LB, thus running the platform with 4 custards for a short period of time. We then update the instance we just “removed from the loadbalancer” to the new version of Custard and then “put it back in the loadbalancer”. We do this 4 more times, once per remaining instances, and we have now updated the Custard service from one version to another without disrupting the service as seen from a customer perspective.
Uptime & Releases
By using the power described in the example above, we did more than 1200 releases to more than 100 microservices during 2015, without a single moment of downtime for any of the user journeys, i.e. without any of our customer being affected.
You might wonder what is the benefit of so many releases? Surely it is more work to handle and it must mean more risk. The answer is actually very simple: each release has a very small change in it and because of the fact that a microservice has a very simple functionality, testing it is easy and not very time consuming. By releasing more often we mitigate the risk of complex verifications and the change itself is quite manageable. However, each release involves many steps, and we have addressed this issue as well. We have automated the management of the release process, and we call this “the pipeline”.
The automated pipeline
Today we have more than 150 microservices in production, all with separation of concerns, different teams working on them, and different Product Owners managing their backlogs. With so many services to manage, an automation tool is key to make it work. You can even say that you really don’t have a microservices architecture in place until you have a pipeline to manage continuous delivery (https://en.wikipedia.org/wiki/Continuous_delivery) of all new updates to the services. In 2012 we got the first pipeline running.
The pipeline is a number of different programs that are interacting to automate everything around releasing new versions of a service/application to production. The user interface for the pipeline is a web page, used by developers and others that do the actual deploys into production. In the pipeline web page the underlying complexity is successfully abstracted away, and you can move a change into production with a single click. When the click is made, there is a set of automated tests that are run, the relevant jiras are automatically generated, the changes to databases are done in automated fashion and the roll out described in “Redundancy, availability and load balancing” above is also automated.
The pipeline does not only make it possible, it also reduces the risk of human error, and decreases the time for a release to a fraction of what it used to be.
Example: To really understand how far we have come I can share that when we had the monolith back in 2010, we managed to get 24 releases out to production per year. Each release required staff to work in shift, from 7AM to 3PM, and 11 AM to 7PM, and many times the release had to be rolled back due to a bug found in production. And now, five years later, with all the changes and improvements discussed, we did 1200+ releases last year, and the trends for 2016 points to 2000+!
To summarise an event driven architecture with load balanced stateless microservices gives us a very scalable, extendable and resilient technical platform. The pipeline enables us to do thousands of releases to hundreds of microservices each year, all without any downtime to our customers. These are the fundamentals for being able to grow the platform as we do, to an optimised cost, and with very high quality.
Copyright Unibet Group 2016. All rights reserved. You may share using our article tools.