Automated machine learning workflows at ZapLabs
Real estate is perhaps one of the most subjective domains of commerce when it comes to the impact of buyer and seller dimensions that can impact a sale. The nuances of this trade are often colored in equal measure by the personality of the buyers and those of real estate agents. Our role in Data Engineering is to facilitate the modeling of these vagaries through robust and flexible infrastructure that augments decision-making, with the most accurate and timely predictions and analysis from our Data Science team. A key aspect of these efforts is establishing and improving the workflow through which a data scientist or a data engineer models, implements and deploys machine learning models and data pipelines. As state-of-the-art technologies grow in number and ability, the need to provide a vehicle that integrates the processes of model training, versioning and serving/provision of predictions became evident.
To this end, we have formulated a workflow for automated training and predictions of machine learning models, realized using GIT Version Control, Jenkins and suite of services from AWS. How we built this tool for capturing the process of iterative machine learning in our group is described below. We follow the persona of a user, who iteratively develops and tunes their model by training.
To use this framework, we developed a bootstrapped repository with a specific format that automates this process. The framework provides hooks that are overridden as a singleton Python class object that encapsulates the model that has to be trained. The entry points in this class are designated functions named like train and get_parameters. The framework then reads these params in from the mounted S3 file locations. All these configurations are provided in a configuration file which is either cfg or yaml. We are constantly enriching the set of functionalities supported through these configurations, to enable more sophisticated learning tasks.
The EC2 instances are spun up and they check out the same docker image that was created earlier in the pipeline. This step fixes all the data lying externally at the designated drive mount points, which are expected by the framework to be available to be provided to the model to be trained. Inside the singleton model class prepared by the user, they are free to perform any transformation they like, and in the end return a pandas dataframe to the framework which then writes it out to a specified location in the configuration. The framework also accepts any other training artifacts which it writes out the same output location as the hyperparams.
Such a framework allows many ad hoc training tasks to be flexible over an infinite amount of data, as long as the underlying infrastructure can support it. It also brings control directly in the hands of users with all levels of expertise. To an uninitiated user, who doesn’t know or care about the mechanism, this is a framework that will read training data from an S3 location, and train their model on it as long as they extend a Python class. After training, it will provide a tuned hyperparameter and any training artifacts to another S3 location. If they want data read from or written to a database, there are connectors for that too. We are continually working on adding more adapters that provide more flexibility to other use cases.
A parallel dimension to this process is that of acquiring model training statistics and logs. For these purposes, we have used ModelDB and are extending it to support entities in our workflow. This lets everyone track how a model has changed over time. This completes the circle of tracing a model training run back and forth between the beginning(i.e. the training data and the tuned hyperparameters), and weaving various stages of this lifecycle together in a single visual dashboard. As this effort matures, it will provide a clear and comprehensive coverage of machine learning projects spanning months and years. Logging is handled through Elastic Filebeat, which is embedded in our EC2 instances where models are trained. These logs are shipped over to our on premise ElasticSearch cluster. We also pull various summary statistics about system monitoring of the instances from AWS Cloudwatch into ElasticSearch as well.
Once the training of models reaches a satisfactory level, we provide the trained model to various downstream systems as a live service using AWS Sagemaker and as a batch process that can be requested for predictions over large datasets. This is realized through a similar flavor of the framework that reads in hyperparameters and other parameters and launches a model through EC2 instances as a docker container and serves out prediction to a configured destination. The same Jenkins pipeline also provides a configuration for AWS Sagemaker and creates a docker container for Sagemaker to deploy as an HTTP Endpoint.
While there is tremendous scope for further improvement and enhancement for more functionalities to this pipeline, we want to support new paradigms like online training and make this framework more seamless in its usage and packaging. Automation and tooling around our data science operations hope to capture the process within them. Next, we aim to make this platform as flexible as possible by supporting more libraries and tools, and embellishing downstream systems to have the appropriate instrumentation around these processes to allow for effective monitoring. This capability also provides a mechanism to work on many useful and well known but non-distributed libraries in a distributed manner, thus alleviating any limitations on the input data and processing power. This opens up exciting new avenues to adopt more ambitious learning tasks as well as the opportunity to revisit some existing analyses. I’ll swing by soon as I report back on our continued progress!
Ravi is a Senior Data Engineer at ZapLabs. A recent transplant from New York—where he studied Information Retrieval and Machine Learning—he is passionate about building predictive and scalable systems. At ZapLabs he focuses on building data pipelines and infrastructure that support and enrich the company’s data science capabilities.