Presentation: Training Deep Learning Models at Scale on Kubernetes
Share this on:
Abstract
Deep Learning has recently become very important for all kinds of AI applications from conversational chatbots to self-driving cars. In this talk, we will talk about how we use deep learning for natural language processing, utilize Tensorflow for training deep learning models, run Tensorflow on top of Kubernetes, and use GPUs.
We have a need to train deep learning models for each conversational bot that we deploy on our platform. Training individual bots on one-off systems using ad-hoc processes is no longer a feasible solution as it does not scale with the number of bots in our system. In order to address the above requirements, we have built a framework for running long running jobs that leverages our existing Kubernetes infrastructure. We have designed our jobs framework to have the following key benefits.
- Jobs can be executed either on a fixed schedule or a manual trigger or an automated trigger ( i.e some other event in our system can trigger a job)
- High availability of job workers.
- Scale up (or down) the number of workers for each job type based on need.
- We can assign specific attributes to specific workers. For example, we ensure that our training workers are always executed on GPU nodes so that they can take full advantage of the GPU resources available in our infrastructure.
- Simplified job management. This includes the ability to monitor, audit and debug each job that was executed. Further, using our systems for centralized logging and monitoring, we can quickly understand key results from the job. For example, in case of model training jobs, we can quickly look at the confusion matrix to understand if the trained model should be promoted to our production systems.
In the talk, we will present how we have leveraged Kubernetes to realize each of the above benefits.
Similar Talks





Tracks
Monday, 5 November
-
Microservices / Serverless Patterns & Practices
Evolving, observing, persisting, and building modern microservices
-
Practices of DevOps & Lean Thinking
Practical approaches using DevOps & Lean Thinking
-
JavaScript & Web Tech
Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
-
Modern CS in the Real World
Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
-
Modern Operating Systems
Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
-
Optimizing You: Human Skills for Individuals
Better teams start with a better self. Learn practical skills for IC
Tuesday, 6 November
-
Architectures You've Always Wondered About
Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
-
21st Century Languages
Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
-
Emerging Trends in Data Engineering
Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
-
Bare Knuckle Performance
Killing latency and getting the most out of your hardware
-
Socially Conscious Software
Building socially responsible software that protects users privacy & safety
-
Delivering on the Promise of Containers
Runtime containers, libraries, and services that power microservices
Wednesday, 7 November
-
Applied AI & Machine Learning
Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
-
Production Readiness: Building Resilient Systems
More than just building software, building deployable production ready software
-
Developer Experience: Level up your Engineering Effectiveness
Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
-
Security: Lessons Attacking & Defending
Security from the defender's AND the attacker's point of view
-
Future of Human Computer Interaction
IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
-
Enterprise Languages
Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track