The Microservices Retrospective That Made a Serverless Platform

Oftentimes, the solution to a problem starts with a different perspective, thinking outside the box, or re-framing the problem. At Ask Media Group, we’ve challenged ourselves to not focus immediately on finding solutions, but to look at what’s in the way of making it easier to build and maintain software. Some of this can be seen by the questions we ask and how we answer them. What would it look like if every story in a sprint only took one day to do? The answer to that question might be that most stories require more than a day, so it’s not possible. On the other hand, one of our agile teams chose to answer this question differently. They instead claimed that impediments keep us from creating such small stories and this answer led them to start identifying the impediments and removing them. The result was very satisfying for this agile team because planning became easier and re-prioritizing by swapping stories became trivial. It is common, in the retrospective process of Scrum, to answer questions in this way. The most important part of this type of answer is that it leads us to grow, improve and become better practitioners. This year has been like a retrospective on the microservices development platform we introduced last year. In this post, I write about how, as we acted to improve the platform, we accomplished much more than we expected.

While we may have started down the path of the microservices architecture by simply following the lead of others who advocated for it, we challenged ourselves to identify impediments, and that opened a door to a whole different way of thinking and solving problems. We are getting a glimpse into what it will look like when every service in our system is micro and we are leading ourselves forward. It’s normal now for all our developers to be directly involved in the deployment of software to production by simply pushing a button. In some cases, software is continuously deployed as changes are made to the repository.

This past year, we added a new set of microservices that are different from the ones we have dealt with so far. We refer to these new services as jobs. There are many of them and they don’t run all the time. It required us to look at what worked with microservices so far and what didn’t work, so that we could take action to improve the platform to support this new type of service. As we prepared to increase the number of microservices deployed, we had to decide how we would manage the dramatic change in our software release process. We were used to teams independently deploying a few monoliths in a highly coordinated manner. How were we going to support developers independently deploying hundreds of services at any time and any number of times in a day? In a peak month this year, we deployed 394 times. While Kubernetes, Docker, GitLab and linkerd helped us with the mechanics of managing a large number of projects and deployments, the idea of breaking up all the monoliths and releasing often and randomly seemed irrational.

Some people were not convinced that changing jobs to microservices was a good idea, so we needed to explain what we intuitively knew. The explanation started with this old article about pets vs cattle. We explained that we are focused on building cattle instead of pets and not on multiplying the number of pets to support. In order for us all to continue down the microservices path we started, we had to all see that our architectural transformation was not going to be random and haphazard. Furthermore, we were getting rid of more than pets. We wanted to get rid of monoliths, but not simply by breaking them up. We wanted to decompose the monoliths into collections of services that share common interfaces like the variety of blocks available in the world of Minecraft. Creating new services should be like creating new Minecraft blocks.

We were able to point to existing microservices to explain how much easier it is to manage releases. Because the services were part of a collection, we were able to manage them collectively. We reasoned that problems for a collection of a hundred services only has to be solved once and not a hundred times. We further reasoned that when we decompose a monolith to a collection of services, the size of our problem is reduced to the size of a single service and not the total size of all services combined. Also, managing the release of our microservices does not require coordination. In fact, the release of our microservices has significantly lower risk compared to the release of our monoliths and it is much easier to add automated checks and safeguards. Our services were created as building blocks and this made it easy for developers to independently deploy hundreds of services at any time and any number of times in a day. We identified that our archetype system, which we used to create services as building blocks, was too hard to use.

To improve how we created new services as building blocks, we first provided a simple web form to create new projects from an archetype. We identified three primary types of building blocks: the server, the worker and the application. We then renamed our microservices development platform to SWAP. From the three types of building blocks, we focused on developing conventions for three types of GitLab projects. A SWAP service project uses the server building block. We consider these the typical type of service. A SWAP job project uses the worker building block. These types of services typically process batches of data and start in response to a trigger or a schedule. Finally a SWAP application project uses the application building block. These type of services involve a user interface.

We implemented the building blocks as Docker images. They were built by the SWAP service, job, or application projects, which are GitLab projects that are conventionally structured. We codified the conventions using archetype projects, which are working examples from which new projects are created (by making a copy of the archetype). We built archetypes for creating SWAP services, SWAP jobs and SWAP applications. By creating new projects from archetypes, we removed the need to communicate changes made to an evolving set of conventions. The conventions are simply a part of the archetypes and one simply has to review the archetypes to determine the state of the conventions. In addition to ensuring that projects are created in a conventional way, we also ensure that it is easy for existing projects to be updated with changes to existing conventions. As conventions change in the archetype projects, we automatically push them to all projects associated with each archetype.

Once the idea behind building blocks made sense to one of our agile teams, they ran with it. The team was responsible for a lot of services that were jobs. In a short time, the agile team migrated many jobs from randomly structured monolithic systems to a collection of SWAP jobs. From a collection of SWAP jobs, the team was able to take the next step of composing pipelines. Pipelines consist of SWAP jobs linked together and running in a sequence or in parallel. Creating pipelines is analogous to constructing something in Minecraft by placing blocks together. The ease with which one can compose microservices into subsystems is one of the key benefits of the microservice architecture and it was exciting to experience this for ourselves.

One of the advantages of decomposing monoliths is that it provides greater visibility into the parts of the system. As we are transforming the architecture of our system, we are tracking the cost for compute resources at a granular level and answering the question of how much each subsystem is costing us. We closely watch a bar graph showing the cost per subsystem and have been able to optimize our systems to reduce cost. During this past year, something transformative happened right in front of us without much notice. In our graph, there is a cost report for a jobs platform that is nearly zero and hasn’t changed since it was added to the graph. Have we not made much progress in developing a jobs platform this past year? This would be the question that one might ask if one were expecting a significant cost to building a jobs platform. Perhaps one was expecting a waterfall-style design document for a monolithic system that required many lines of code, lots of resources and lots of planning. Developing monoliths is very eventful and after many years of building large monoliths in the industry, it would be no surprise if we habitually expect to see big events and cost associated with any major endeavor. The development of the jobs platform, however, has been uneventful.

As it turns out, the jobs platform is serverless and the cost is not expected to ever go up significantly. Our jobs run in containers that use resources only when the job is running and we achieved this by using GitLab’s support for running CI jobs in containers. GitLab CI provides us with an easy way to compose pipelines from our SWAP jobs. Additionally, the GitLab web application provides a powerful user interface for developing and running our SWAP jobs. We only have to create SWAP job archetypes and allow an unbounded number of SWAP job projects to be created from those archetypes without incurring any significant cost per SWAP job created. Our only major effort is to convert all our jobs to SWAP jobs. Once the SWAP jobs are created, we have building blocks from which we can easily compose many solutions using GitLab and some SWAP glue.

When we developed the jobs platform, we wanted a platform that can easily support our existing 250 jobs and an unbounded number of future jobs. The problem with running jobs in servers separate from the monolith is that we aren’t sharing resources between the jobs and are therefore multiplying the amount of resources required. This wasn’t working for us because we wanted to keep our cost bounded and low. As in a retrospective, we took action to make it better by ensuring that resources are allocated when jobs are running and freed up when they are not. This was a major impediment and we cleared it by finding a way to start up a Docker container only when we wanted to run a job and shut it down when the job completed. This made the platform serverless.

Our jobs platform leverages GitLab to run the worker building blocks created for our SWAP jobs and this allows the resources of a few runners to be shared between the jobs. It further reduces the cost of unused resources by leveraging Kubernetes to provide GitLab with an auto-scalable pool of runners. With GitLab auto-scaling in place, the platform is able to dynamically request additional runners to run SWAP jobs during peak time and to release those runners when they are no longer needed. We avoid the higher piecemeal cost of paying per request, which is the cost structure offered through serverless solutions like AWS Lambda. Instead, we just pay for the average few minutes per day of uptime for virtual machines that otherwise would have incurred the cost for 24 hours a day of uptime if SWAP jobs were not serverless.

Going forward, we will continue to watch the successes of others and draw inspiration from them, but we have tasted the power of the retrospective and that means we will be paving our own path. The retrospective process led us down the path towards a serverless jobs platform and emboldened us to deliberately change. Now, our agile teams are clearing impediments and turning our public websites to building blocks. Each website project will be bound to a SWAP application archetype, so that we can easily develop, deploy and maintain as many websites as needed to optimize our business success. We are building new SWAP jobs and automating decision-making with them by incorporating models created by our data scientists. To manage our growing collection of microservices, we’re building a SWAP console that allows us to view our collections and to make changes collectively, such as deploying a suite of services with the click of a button. Finally, our microservices retrospective is beginning to affect how we organize our teams. We’re starting to form working groups that allow any developer to lead in areas that interest them and allow developers with shared interests to work together. By asking ourselves what is impeding our agile teams from succeeding, we are leading ourselves to favor small, self-organizing teams and taking Scrum more seriously. It seems we are a continuous improvement company. Let’s see where all this leads us next year.

by Chenglim Ear, Principal Engineer