25/09/2019 • Stijn Van den Enden

Fast machine learning with TensorFlow and Kubernetes

Training machine learning models can take up a lot of time if you don’t have the hardware to support it. For instdataance with neural networks, you need to calculate the addition of every neuron to the total error during the training step. This can result into thousands of calculations per step. Complex models and large datasets can result in a long process of training. Evaluating such models at scale can potentially slow down your application’s performance. Not to mention hyperparameters you need to tune, restarting the process a few times over.

In this blog post I want to talk about how you can tackle these issues by making maximum use of your resources. In particular, I want to talk about TensorFlow, a framework designed for parallel computing and Kubernetes, a platform able to scale up or down in terms of application usage.

TensorFlow

Image Google AlphaGo and DeepMind

TensorFlow is an open-source library for building and training machine learning models. Originally a Google project, it has had many successes in the field of AI. It is available in multiple layers of abstraction, which allows you to quickly set up predefined machine learning models. TensorFlow was designed to run on distributed systems. The computations it requires can be run in parallel across multiple devices, through data flow graphs underneath. These represent a series of mathematical equations, with multidimensional array representations (tensors) at its edges. Deepmind used this power to create AlphaGo Zero, using 64 GPU workers and 19 CPU parameter servers to play 4.9 million games of GO against itself in just 3 days.

Kubernetes

Kubernetes is Google project as well and is an open-source platform for managing containerized applications at scale. With Kubernetes, you can easily add more instance nodes and get more out of your available hardware. You can compare Kubernetes to cash registers at the supermarket. Whenever there’s a long queue of customers waiting, the store quickly opens up a new register to handle a few of those customers. In reality, this means that Kubernetes (the cash register) is a virtual machine running a service and the customers are consumers of that service.

The power of Kubernetes is in its ease of usage. You don’t need to add newly created instances to the load balancer, it’s done automatically. You don’t need to connect the new instance with file storage or networks, Kubernetes does it for you. And if an instance doesn’t behave like it should, it kills it off and immediately spins up a new one.

Distributed Training

distributed training

Like I mentioned before, you can reduce the time it takes to train a model by doing computations in parallel over different hardware units. Even with a limited configuration, you can reduce your training time to a minimum by distributing it over multiple devices. TensorFlow allows you to use CPUs, GPUs and even TPUs or Tensor Processing Unit, a chip designed to run TensorFlow operations. You need to define a Strategy and make sure you create and compile your model within the scope of that strategy.

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
  model.compile(loss='mse', optimizer='sgd')

The MirroredStrategy above allows you to distribute training over multiple GPUs on the same machine. The model is replicated for every GPU and variable updates are being executed for every replica.

A more interesting variant of this strategy is the MultiWorkerMirroredStrategy. It gives you the opportunity to distribute the training over multiple machines (workers), and each of those may use multiple GPUs. This is where Kubernetes can help fast-track your machine learning. You can create multiple service nodes with Kubernetes according to the need for parameter servers and workers. Parameter servers keep track of the model parameters, workers calculate the updates of those parameters. In general, you can reduce the bandwidth between the members of the server cluster by adding more parameter servers. To make the setup run, you need to set an environment variable TS_CONFIG which defines the role of each node and the setup of the rest of the cluster.

os.environ["TF_CONFIG"] = json.dumps({
    'cluster': {
        'worker': ['worker-0:5000', 'worker-1:5000'],
        'ps': ['ps-0:5000']
    },
    'task': {'type': 'ps', 'index': 0}
})

To make the setup easier, there’s a Github repository with a template for Kubernetes. Note that it doesn’t set up TS_CONFIG itself, but passes its content as parameters to the script. These parameters are used to define which devices can be used in a distributed training.

cluster = tf.train.ClusterSpec({
    "worker": ['worker-0:5000', 'worker-1:5000'], 
    "ps": ['ps-0:5000']})
 
server = tf.train.Server(
    cluster, job_name='ps', task_index='0')

The ClusterSpec specifies the worker and parameter servers in the cluster. It has the same value for all nodes. The Server contains the definition of the task of the current node, hence a different value per node.

TensorFlow Serving

For distributed inference, TensorFlow contains a package for hosting machine learning models. This is called TensorFlow Serving and it has been designed to quickly set up and manage machine learning models. All it needs is a SavedModel representation. SavedModel is a format to save trained Tensorflow models in a way that they can easily be loaded and restored. A SavedModel can be serialized into a directory, making it somewhat portable and easy to share. You can quickly create a SavedModel by using the built-in function save.

model_version = '1'
model_name = 'my_model'
model_path = os.path.join('/path/to/save/dir/', model_name, model_version)
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss='mse', optimizer='sgd')
model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
tf.saved_model.save(model, model_path)

You can use the SavedModel CLI to inspect SavedModel files. Once you have these files in place, TensorFlow Serving can turn it into a gRPC or RESTful interface. The Docker image tensorflow/serving provides the easiest path towards a running server. There are multiple versions of this image, including one for GPU usage. Besides choosing the right image, you only need to provide the path to the directory you just created and name your model.

$ docker run -t --rm -p 8500:8500 -p 8501:8501 \
   -v "/path/to/save/dir/my_model:/models/my_model" \
   -e MODEL_NAME=my_model \
   tensorflow/serving

Obviously, with Kubernetes you can now create a deployment for this image, and scale up/down the number of replicas automatically. Put a LoadBalancer Service in front of it, and your users will be redirected to the right node without anyone noticing. Because inference requires much less computation, you don’t have to distribute the computation amongst multiple nodes. Note that the save directory path also contains a “version” directory. This is a convention TensorFlow Serving uses to watch the directory for new versions of a SavedModel. When it detects a new one, it loads it automatically, ready to be served. With TensorFlow Serving and Kubernetes, you can handle any amount of load for your classification, regression or prediction models.

🚀 Takeaway

You can gain a lot of time by distributing the necessary computations for your machine learning project. By combining a highly scalable library like TensorFlow with a flexible platform like Kubernetes, you can make optimal use of your resources and your time.

Of course, you can speed up things even more if you have a knowledgeable Kubernetes team at your side, or somebody to help tune your machine learning models. If you’re ready to ramp up your machine learning, we can do exactly that! Interested or questions? Shoot me an email at stijn.vandenenden@aca-it.be!