Ask Media Group has made the shift from on-prem data centers to the public cloud. In doing so, we’ve rearchitected numerous legacy systems from monoliths to microservices. Key to our migration was building an underlying architecture based on containers leveraging Gitlab Runners, Kubernetes, Docker, Linkerd, and AWS.
In the webcast below, co-hosted by Gitlab and AWS, Chenglim Ear, Principal Software Engineer, talks about how we accomplished this and the lessons we learned along the way.
As a high-profile internet technology business, we take our responsibility for information security very seriously. IAC’s Chief Information Security Officer establishes a framework and expectations for all businesses, and each is scored based on their fulfillment of requirements across the various categories that make up the framework. Employees are strongly incentivized to meet these expectations. Security awareness—the process of educating all employees on the importance of cybersecurity—is one specific area on this scorecard, and we host an annual security awareness month to meet this objective.
This year, as security awareness month approached, we asked employees what they remembered learning from the previous year. It quickly became obvious that aside from a few security tropes, much had been forgotten (or possibly never learned!), which was a problem. As a lean organization, the importance of each individual understanding their part in our overall security process cannot be overstated. We deliberately do not dedicate security roles to individuals, requiring that everyone carry an extra bit of burden in the interest of having a strong collective security posture. Our development teams have embraced DevOps practices, so expanding this to incorporate security (i.e. DevSecOps) fits naturally with our culture.
To make the lessons from security awareness month memorable, we put our creative hats on. We changed the traditional security awareness process into an ongoing month-long game for all. Eschewing security-related signs and banners in the office, we opted for activities and games that mandated participation. We awarded points (positive AND negative) for quizzes, online course completion, attendance at presentations, and on-site tasks that required full engagement. Games included challenges like capturing selfies in front of unlocked and unattended computers, along with secret challenges that tested awareness when people didn’t think they weren’t being watched. We involved the entire organization in efforts like bug hunts, identifying MFA capable services missing them, sleuthing data exfiltration and cleartext password occurrences that had escaped notice. We even brought in outside “spies” to test our physical security. And finally, we prominently displayed a leaderboard in the office and online, with daily status updates communicated via slack. At the end of the competition, we awarded prizes to the top 6 point-recipients—and remediation classes to everyone with a negative score!
Security Crimes—taking selfies in front of unlocked, unsupervised computers
The approach was not uncontroversial. Initial complaints included concerns that the games were tilted in favor of the technical staff, that being called out might be bad for morale, and that the staff had enough “real work” to do without having this requirement. Despite that feedback, the competitive instinct (or desire for prizes and to avoid remediation classes) drove a level of participation that took the organizers by surprise and essentially resulted in a points-race on a daily basis.
When security awareness month came to an end, we realized we had a lot to think about as an organization. Increased participation demonstrated that individuals are willing to care about security when presented in the right format.
Security is not simply a checklist. It’s dynamic, constantly evolving, and requires ongoing vigilance.
One policy does not fit all. For example, we found that 3 folks get more phishing emails than everyone else. The top recipient gets 5x more than the next person. This is useful information because rather than buy expensive software services for everyone, we can pilot solutions with the target group.
Physical security is susceptible to social engineering. Collectively, we failed most of the games focused in this area. Two spies made it into the office–one sat down and worked at a desk for an hour, the other walked the office twice unescorted and without being stopped or questioned. In both cases, the staff commented that the spies didn’t look untrustworthy! Chivalry or niceness simply obviated the physical security measures in place. This again highlighted the necessity for “lock your screen when you step away” policies which seem tedious but can limit the effectiveness of socially engineered physical intrusions into our office space.
Don’t assume everyone knows and understands security best practices. We found that we had a number of secrets improperly stored in plain text/code. We thought the best practice was rather obvious, but the feedback we received suggested that the policy wasn’t understood or evangelized, highlighting the need for continuing education. It also presented an opportunity to create automated solutions to police these practices.
Security IS everyone’s responsibility. We discovered that we had some applications that were Multi-Factor Authentication (MFA) capable but that hadn’t been enabled since inception. Every member of the organization, technical or otherwise, must challenge themselves to think about the security implications, and continuously go back to determine whether capabilities that were previously unavailable have since been implemented.
With everything we learned during our security games, we welcome the opportunities for remediation and initiatives we have identified. Most of all, we look forward to once again engaging our colleagues and continuing to keep them on their toes with security awareness.
Oftentimes, the solution to a problem starts with a different perspective, thinking outside the box, or re-framing the problem. At Ask Media Group, we’ve challenged ourselves to not focus immediately on finding solutions, but to look at what’s in the way of making it easier to build and maintain software. Some of this can be seen by the questions we ask and how we answer them. What would it look like if every story in a sprint only took one day to do? The answer to that question might be that most stories require more than a day, so it’s not possible. On the other hand, one of our agile teams chose to answer this question differently. They instead claimed that impediments keep us from creating such small stories and this answer led them to start identifying the impediments and removing them. The result was very satisfying for this agile team because planning became easier and re-prioritizing by swapping stories became trivial. It is common, in the retrospective process of Scrum, to answer questions in this way. The most important part of this type of answer is that it leads us to grow, improve and become better practitioners. This year has been like a retrospective on the microservices development platform we introduced last year. In this post, I write about how, as we acted to improve the platform, we accomplished much more than we expected.
While we may have started down the path of the microservices architecture by simply following the lead of others who advocated for it, we challenged ourselves to identify impediments, and that opened a door to a whole different way of thinking and solving problems. We are getting a glimpse into what it will look like when every service in our system is micro and we are leading ourselves forward. It’s normal now for all our developers to be directly involved in the deployment of software to production by simply pushing a button. In some cases, software is continuously deployed as changes are made to the repository.
This past year, we added a new set of microservices that are different from the ones we have dealt with so far. We refer to these new services as jobs. There are many of them and they don’t run all the time. It required us to look at what worked with microservices so far and what didn’t work, so that we could take action to improve the platform to support this new type of service. As we prepared to increase the number of microservices deployed, we had to decide how we would manage the dramatic change in our software release process. We were used to teams independently deploying a few monoliths in a highly coordinated manner. How were we going to support developers independently deploying hundreds of services at any time and any number of times in a day? In a peak month this year, we deployed 394 times. While Kubernetes, Docker, GitLab and linkerd helped us with the mechanics of managing a large number of projects and deployments, the idea of breaking up all the monoliths and releasing often and randomly seemed irrational.
Some people were not convinced that changing jobs to microservices was a good idea, so we needed to explain what we intuitively knew. The explanation started with this old article about pets vs cattle. We explained that we are focused on building cattle instead of pets and not on multiplying the number of pets to support. In order for us all to continue down the microservices path we started, we had to all see that our architectural transformation was not going to be random and haphazard. Furthermore, we were getting rid of more than pets. We wanted to get rid of monoliths, but not simply by breaking them up. We wanted to decompose the monoliths into collections of services that share common interfaces like the variety of blocks available in the world of Minecraft. Creating new services should be like creating new Minecraft blocks.
We were able to point to existing microservices to explain how much easier it is to manage releases. Because the services were part of a collection, we were able to manage them collectively. We reasoned that problems for a collection of a hundred services only has to be solved once and not a hundred times. We further reasoned that when we decompose a monolith to a collection of services, the size of our problem is reduced to the size of a single service and not the total size of all services combined. Also, managing the release of our microservices does not require coordination. In fact, the release of our microservices has significantly lower risk compared to the release of our monoliths and it is much easier to add automated checks and safeguards. Our services were created as building blocks and this made it easy for developers to independently deploy hundreds of services at any time and any number of times in a day. We identified that our archetype system, which we used to create services as building blocks, was too hard to use.
To improve how we created new services as building blocks, we first provided a simple web form to create new projects from an archetype. We identified three primary types of building blocks: the server, the worker and the application. We then renamed our microservices development platform to SWAP. From the three types of building blocks, we focused on developing conventions for three types of GitLab projects. A SWAP service project uses the server building block. We consider these the typical type of service. A SWAP job project uses the worker building block. These types of services typically process batches of data and start in response to a trigger or a schedule. Finally a SWAP application project uses the application building block. These type of services involve a user interface.
We implemented the building blocks as Docker images. They were built by the SWAP service, job, or application projects, which are GitLab projects that are conventionally structured. We codified the conventions using archetype projects, which are working examples from which new projects are created (by making a copy of the archetype). We built archetypes for creating SWAP services, SWAP jobs and SWAP applications. By creating new projects from archetypes, we removed the need to communicate changes made to an evolving set of conventions. The conventions are simply a part of the archetypes and one simply has to review the archetypes to determine the state of the conventions. In addition to ensuring that projects are created in a conventional way, we also ensure that it is easy for existing projects to be updated with changes to existing conventions. As conventions change in the archetype projects, we automatically push them to all projects associated with each archetype.
Once the idea behind building blocks made sense to one of our agile teams, they ran with it. The team was responsible for a lot of services that were jobs. In a short time, the agile team migrated many jobs from randomly structured monolithic systems to a collection of SWAP jobs. From a collection of SWAP jobs, the team was able to take the next step of composing pipelines. Pipelines consist of SWAP jobs linked together and running in a sequence or in parallel. Creating pipelines is analogous to constructing something in Minecraft by placing blocks together. The ease with which one can compose microservices into subsystems is one of the key benefits of the microservice architecture and it was exciting to experience this for ourselves.
One of the advantages of decomposing monoliths is that it provides greater visibility into the parts of the system. As we are transforming the architecture of our system, we are tracking the cost for compute resources at a granular level and answering the question of how much each subsystem is costing us. We closely watch a bar graph showing the cost per subsystem and have been able to optimize our systems to reduce cost. During this past year, something transformative happened right in front of us without much notice. In our graph, there is a cost report for a jobs platform that is nearly zero and hasn’t changed since it was added to the graph. Have we not made much progress in developing a jobs platform this past year? This would be the question that one might ask if one were expecting a significant cost to building a jobs platform. Perhaps one was expecting a waterfall-style design document for a monolithic system that required many lines of code, lots of resources and lots of planning. Developing monoliths is very eventful and after many years of building large monoliths in the industry, it would be no surprise if we habitually expect to see big events and cost associated with any major endeavor. The development of the jobs platform, however, has been uneventful.
As it turns out, the jobs platform is serverless and the cost is not expected to ever go up significantly. Our jobs run in containers that use resources only when the job is running and we achieved this by using GitLab’s support for running CI jobs in containers. GitLab CI provides us with an easy way to compose pipelines from our SWAP jobs. Additionally, the GitLab web application provides a powerful user interface for developing and running our SWAP jobs. We only have to create SWAP job archetypes and allow an unbounded number of SWAP job projects to be created from those archetypes without incurring any significant cost per SWAP job created. Our only major effort is to convert all our jobs to SWAP jobs. Once the SWAP jobs are created, we have building blocks from which we can easily compose many solutions using GitLab and some SWAP glue.
When we developed the jobs platform, we wanted a platform that can easily support our existing 250 jobs and an unbounded number of future jobs. The problem with running jobs in servers separate from the monolith is that we aren’t sharing resources between the jobs and are therefore multiplying the amount of resources required. This wasn’t working for us because we wanted to keep our cost bounded and low. As in a retrospective, we took action to make it better by ensuring that resources are allocated when jobs are running and freed up when they are not. This was a major impediment and we cleared it by finding a way to start up a Docker container only when we wanted to run a job and shut it down when the job completed. This made the platform serverless.
Our jobs platform leverages GitLab to run the worker building blocks created for our SWAP jobs and this allows the resources of a few runners to be shared between the jobs. It further reduces the cost of unused resources by leveraging Kubernetes to provide GitLab with an auto-scalable pool of runners. With GitLab auto-scaling in place, the platform is able to dynamically request additional runners to run SWAP jobs during peak time and to release those runners when they are no longer needed. We avoid the higher piecemeal cost of paying per request, which is the cost structure offered through serverless solutions like AWS Lambda. Instead, we just pay for the average few minutes per day of uptime for virtual machines that otherwise would have incurred the cost for 24 hours a day of uptime if SWAP jobs were not serverless.
Going forward, we will continue to watch the successes of others and draw inspiration from them, but we have tasted the power of the retrospective and that means we will be paving our own path. The retrospective process led us down the path towards a serverless jobs platform and emboldened us to deliberately change. Now, our agile teams are clearing impediments and turning our public websites to building blocks. Each website project will be bound to a SWAP application archetype, so that we can easily develop, deploy and maintain as many websites as needed to optimize our business success. We are building new SWAP jobs and automating decision-making with them by incorporating models created by our data scientists. To manage our growing collection of microservices, we’re building a SWAP console that allows us to view our collections and to make changes collectively, such as deploying a suite of services with the click of a button. Finally, our microservices retrospective is beginning to affect how we organize our teams. We’re starting to form working groups that allow any developer to lead in areas that interest them and allow developers with shared interests to work together. By asking ourselves what is impeding our agile teams from succeeding, we are leading ourselves to favor small, self-organizing teams and taking Scrum more seriously. It seems we are a continuous improvement company. Let’s see where all this leads us next year.
At Ask Media Group, we’ve transitioned from building monoliths to building microservices, and that transition has been an eye-opening experience. The ability to quickly build new services and understand existing ones is one of the key benefits in practice when working with microservices. It changes the way we work and the way we work together. For example, it has empowered our data scientists to quickly and easily build systems, leveraging models they’ve created for classifying queries. It has also enabled non-scientists to help our data scientists build the services. With a platform in place for building microservices, we’ve opened the door to exploring new ways to collaborate on a smart, fault-tolerant and highly scalable system.
When developers join the company, they are given a guide to our micro-component development platform, called Kram, which is used to develop microservices and more. In this guide, we give developers instructions for easily building a microservice and deploying it to production, all within GitLab. Developers can immediately begin to contribute a new service that can be deployed to production as soon as they start.
Micro-components are developed and deployed with pushbutton automation
We’ve moved away from lots of code in a few big things to lots of little things, and so the software development domain has changed. There’s a new set of best practices for developing software that we are discovering. In some ways, it’s like going from sculpting masterpieces out of large blocks of stone to building amazing sand castles out of grains of sand. One very important practice that we’ve employed is the practice of watching everything. We watch when services start and stop, and when they fail. We watch the failures and response time of every request, and we watch the amount of traffic each service is getting. This is done using technologies like OpenShift Origin, a flavor of Kubernetes, and linkerd, a proxy built on top of Twitter’s finagle. Every microservice consists of OpenShift pods and each pod is automatically paired with a linkerd container that is used to proxy incoming and outgoing requests. This allows us to watch the requests and send the metrics to our monitoring system. We have real-time graphs for everything and an alert system that monitor the metrics for failure conditions.
We use Grafana to visualize our production metrics per microservice.
While we have the usual process of incident escalation in place, a new practice for problem resolution has emerged from our experience building microservices–the practice of building self-healing services. Because our microservices are built on top of OpenShift Origin, services are deployed as Docker containers and mechanisms exist for watching the health of these containers. By leveraging the hooks provided by OpenShift, a service can let OpenShift know when it’s unhealthy and whether it requires a restart or just a temporary traffic shift away from the container. Since the functionality of restarting and shifting traffic is already handled by OpenShift, a service developer can focus on building the core functionality of the service.
In practice, we find that much of the operational functionality of a service can be automated, including the shifting of traffic away from bad pods to good pods. Our deployment process is literally a click of a button, and we achieve safety not by requiring tickets and approval for a deployment, but by automated checks and monitoring of all deployments. Because our services are built following some conventions, such as which port the servers listen on, we are able to build common tools for controlling and interacting with the services. Activities such as sending test traffic to a service and tuning its resource usage are enhanced with these tools. With all the supporting automation and tools in our micro-component development platform, a new service can literally be built and safely deployed to production by one person in a few minutes.
Building microservices is transformative and it aligns well with agile development practices. When we started, we only meant to build a microservices framework, but soon discovered that building microservices is not simply about following a new architecture. As we built our microservices, we practiced a new way of building systems and developed new practices from our discoveries. While we attempted to address issues and concerns that surfaced, it changed us and we began to see and think about things differently. To this day, it continues to transform all of us and help us be more agile.
After we successfully built our microservices on Kram, we started to promote Kram as a micro-component development platform instead of simply a microservices framework, to bring attention to its transformative potential. We have the opportunity to not simply follow a new architecture, but to embrace a new way of thinking. We can leverage the Kram platform to build any component the microservices way. For our developers who use Kram, there’s opportunity for exploration and innovation. New developers who join the company have an opportunity to be a part of the innovations and to work on the cutting edge, using Kubernetes, Docker, linkerd and all the new technology that we’ll introduce as we leverage the Kram platform. Learning the new technologies and practices of tomorrow is a cornerstone of our success at Ask Media Group.
Ask Media Group manages over 15 web properties, and coordinates performance marketing programs across most of these sites. Result? LOTS of data.
The Business Intelligence team at Ask is tasked with making sense of all this data. That means wrangling over 200 million web log events daily (over 1 TB of raw data), cleansing, structuring, and combining data sources to give users a holistic view of the business. Here’s how we do it.
Workflow of Data
The bulk of our data is generated by users interacting with our web sites. Our homegrown logging system tracks everything on the page, as well as how the users interact with the site, as a stream of JSON objects. These objects are enriched in-flight with extra goodies like geographic and device information. These objects are streamed into Amazon S3, where they are copied every minute into our Snowflake data warehouse. From there, we cleanse the data and transform it into a traditional dimensional model for analysis. It takes 45 minutes for an event occurring on one of our websites to be available for our business team to analyze.
But that’s only the beginning of the story. Ask boasts a world class SEM program managing a portfolio of hundreds of millions of keywords. The data involved for those campaigns gets imported daily into the data warehouse as well. We process revenue data from the various types of ads we run on our own properties. We import information from our content management system to track our content lifecycle and performance. We also manage internal metadata for our revenue reporting and A/B testing.
Once all this data comes into our data warehouse, we create different data marts to serve various parts of the business. For example, we merge our SEM data with our web log data and revenue data to get a complete view of a user’s experience on a property—how they enter our site based on our marketing efforts, how they interacted with the site, and how we were able to monetize the session.
Using the Data
Our internal users have a variety of options to interact with the data warehouse. Snowflake is accessible via web browser, ODBC driver, JDBC driver, python connector, spark connector, R connector—you name it, we’ve got it. This allows our users to interact with the front end tool they find best fits their needs.
The BI team specifically supports Looker and Alteryx for our users, as well. Looker is a great browser-based data visualization tool that provides fantastic semantic modeling capabilities so users can focus more on answering questions than writing SQL queries. Alteryx functions as our reporting workhorse—large reports containing thousands of data points go to business teams every hour. It also helps us handle the trickier database pulls, as well as providing a front end for our metadata management workflows.
Ultimately our job is to remove any and all barriers between business and data, and we are constantly evaluating the best ways to do just that.
We can’t overstate the value of data scientists being able to push code into production. In some organizations, a data scientist has to wait for an engineer to translate their prototype code into production, leading to delays, disagreements, and disappointment. Barriers to production are barriers to impact, and a good data scientist is too valuable to hamstring in this way.
At Ask Media Group, we have enabled our data scientists to push software into the same environment that contains the rest of our production systems and platform. However, empowering data scientists and imbuing them with independence does not mean replacing engineering excellence. It just means that we have found a better way for data science and engineering to collaborate.
As an analogy, consider Hadoop/HDFS, which hides the complexity of parallel processing over huge datasets, and instead lets the data scientist focus on their big data task. Much like with Hadoop, our method for data science and engineering collaboration is an instantiation of the idea of separation of concerns. We define an abstraction—the data science side of the abstraction focuses on the application, commonly a data product. The engineering side of the abstraction manages the details of the production environment. The power of this separation is that both data scientists and engineers do what they love and thus have the opportunity to excel.
For example, one of our data products categorizes user queries into topics such as Vehicles, Travel, Health, etc. More precisely, it’s a multi-label categorization system with 24 top-level categories and 10-20 subsequent second-level categories each, for a total of nearly 300 potential labels. For this system to be useful to the business, it needs to have a response time under 10 milliseconds, scale to 2,000 requests per second, and achieve a precision of 90% with a coverage greater than 70%.
Our initial Python prototype, built with our go-to modeling toolkit (scikit-learn), involved complex features and learning; although it met our precision target, it didn’t hit the response time requirement. So we leveraged a technique called uptraining, using the very precise prototype to automatically annotate a large number of queries with highly accurate labels. In turn, we used that data to build a faster, simpler system with bag-of-words and word embedding (word2vec) features for multi-label logistic regression. With the resulting system, we are able to achieve both our response time and precision targets.
To push a system like this into production, we need to concern ourselves with a lot more than precision and raw speed. In our production environment, apps run as services within containers on OpenShift, using linkerd to route requests. To hide the complexity of the environment, our data engineering team has provided a service creation abstraction on top of OpenShift that deals with deploying, versioning, scaling, routing, monitoring, and logging—all aspects of production systems for which a data scientist may have limited expertise.
Using this service creation abstraction, we can simply wrap the Python query categorization code in Flask and use a few Docker commands to make sure that the model and all necessary libraries are loaded into the image. After code review and software/performance testing for our service, a simple click in GitlabCI pushes our fast, highly-accurate, scalable multi-label query topic prediction service to production.
What’s awesome about this process is that a data scientist can finish preparing a model in the morning, and deploy it that afternoon. This means that our data scientists are free to focus on leveraging data science to create business impact without production concerns getting in their way.