We can’t overstate the value of data scientists being able to push code into production. In some organizations, a data scientist has to wait for an engineer to translate their prototype code into production, leading to delays, disagreements, and disappointment. Barriers to production are barriers to impact, and a good data scientist is too valuable to hamstring in this way.
At Ask Media Group, we have enabled our data scientists to push software into the same environment that contains the rest of our production systems and platform. However, empowering data scientists and imbuing them with independence does not mean replacing engineering excellence. It just means that we have found a better way for data science and engineering to collaborate.
As an analogy, consider Hadoop/HDFS, which hides the complexity of parallel processing over huge datasets, and instead lets the data scientist focus on their big data task. Much like with Hadoop, our method for data science and engineering collaboration is an instantiation of the idea of separation of concerns. We define an abstraction—the data science side of the abstraction focuses on the application, commonly a data product. The engineering side of the abstraction manages the details of the production environment. The power of this separation is that both data scientists and engineers do what they love and thus have the opportunity to excel.
For example, one of our data products categorizes user queries into topics such as Vehicles, Travel, Health, etc. More precisely, it’s a multi-label categorization system with 24 top-level categories and 10-20 subsequent second-level categories each, for a total of nearly 300 potential labels. For this system to be useful to the business, it needs to have a response time under 10 milliseconds, scale to 2,000 requests per second, and achieve a precision of 90% with a coverage greater than 70%.
Our initial Python prototype, built with our go-to modeling toolkit (scikit-learn), involved complex features and learning; although it met our precision target, it didn’t hit the response time requirement. So we leveraged a technique called uptraining, using the very precise prototype to automatically annotate a large number of queries with highly accurate labels. In turn, we used that data to build a faster, simpler system with bag-of-words and word embedding (word2vec) features for multi-label logistic regression. With the resulting system, we are able to achieve both our response time and precision targets.
To push a system like this into production, we need to concern ourselves with a lot more than precision and raw speed. In our production environment, apps run as services within containers on OpenShift, using linkerd to route requests. To hide the complexity of the environment, our data engineering team has provided a service creation abstraction on top of OpenShift that deals with deploying, versioning, scaling, routing, monitoring, and logging—all aspects of production systems for which a data scientist may have limited expertise.
Using this service creation abstraction, we can simply wrap the Python query categorization code in Flask and use a few Docker commands to make sure that the model and all necessary libraries are loaded into the image. After code review and software/performance testing for our service, a simple click in Gitlab CI pushes our fast, highly-accurate, scalable multi-label query topic prediction service to production.
What’s awesome about this process is that a data scientist can finish preparing a model in the morning, and deploy it that afternoon. This means that our data scientists are free to focus on leveraging data science to create business impact without production concerns getting in their way.