You are viewing content from a past/completed QCon

Presentation: Training Deep Learning Models at Scale on Kubernetes

Track: Emerging Trends in Data Engineering

Location: Bayview AB

Duration: 4:10pm - 5:00pm

Day of week: Tuesday

Level: Intermediate - Advanced

Persona: Backend Developer, Data Engineering, Developer, ML Engineer

Share this on:

Abstract

Deep Learning has recently become very important for all kinds of AI applications from conversational chatbots to self-driving cars. In this talk, we will talk about how we use deep learning for natural language processing, utilize Tensorflow for training deep learning models, run Tensorflow on top of Kubernetes, and use GPUs. 

We have a need to train deep learning models for each conversational bot that we deploy on our platform. Training individual bots on one-off systems using ad-hoc processes is no longer a feasible solution as it does not scale with the number of bots in our system. In order to address the above requirements, we have built a framework for running long running jobs that leverages our existing Kubernetes infrastructure. We have designed our jobs framework to have the following key benefits.   

  1. Jobs can be executed either on a fixed schedule or a manual trigger or an automated trigger ( i.e some other event in our system can trigger a job) 
  2. High availability of job workers.   
  3. Scale up (or down) the number of workers for each job type based on need. 
  4. We can assign specific attributes to specific workers. For example, we ensure that our training workers are always executed on GPU nodes so that they can take full advantage of the GPU resources available in our infrastructure. 
  5. Simplified job management. This includes the ability to monitor, audit and debug each job that was executed.  Further, using our systems for centralized logging and monitoring, we can quickly understand key results from the job. For example, in case of model training jobs, we can quickly look at the confusion matrix to understand if the trained model should be promoted to our production systems.   

In the talk, we will present how we have leveraged Kubernetes to realize each of the above benefits.

Speaker: Deepak Bobbarjung

Founding Engineer @PassageAI

Deepak Bobbarjung is the founding engineer at Passage.AI. His expertise is in building scalable enterprise-grade software systems.  Previously he was one of the lead engineers at Maginatics (now DellEMC) where he worked on key aspects of the Maginatics File System including Disaster Recovery, Snapshots and File System Management. Prior to that, he was at VMware, where he worked on the VMware Site Recovery Manager and VMware Converter products.  He earned his PhD in computer science from Purdue University, where his doctoral thesis title was ‘Highly Available Storage Systems’.

Find Deepak Bobbarjung at

Speaker: Mitul Tiwari

CTO @PassageAI

Mitul Tiwari is the CTO and Co-founder of Passage.AI. His expertise lies in building data-driven products using AI, Machine Learning and big data technologies. Previously he was head of People You May Know and Growth Relevance at LinkedIn, where he led technical innovations in large-scale social recommender systems. Prior to that, he worked at Kosmix (now Walmart Labs) on web-scale document and query categorization, and its applications. He earned his PhD in Computer Science from the University of Texas at Austin and his undergraduate degree from the Indian Institute of Technology, Bombay. He has also co-authored more than twenty publications in top conferences such as KDD, WWW, RecSys, VLDB, SIGIR, CIKM, and SPAA.

Find Mitul Tiwari at

Tracks

  • Practices of DevOps & Lean Thinking

    Practical approaches using DevOps and a lean approach to delivering software.

  • Microservices Patterns & Practices

    What's the last mile for deploying your service? Learn techniques from the world's most innovative shops on managing and operating Microservices at scale.

  • Bare Knuckle Performance

    Killing latency and getting the most out of your hardware

  • Architectures You've Always Wondered About

    Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more

  • Machine Learning for Developers

    AI/ML is more approachable than ever. Discover how deep learning and ML is being used in practice. Topics include: TensorFlow, TPUs, Keras, PyTorch & more. No PhD required.

  • Production Readiness: Building Resilient Systems

    Making systems resilient involves people and tech. Learn about strategies being used from chaos testing to distributed systems clustering.

  • Regulation, Risk and Compliance

    With so much uncertainty, how do you bulkhead your organization and technology choices? Learn strategies for dealing with uncertainty.

  • Languages of Infrastructure

    This track explores languages being used to code the infrastructure. Expect practices on toolkits and languages like Cloudformation, Terraform, Python, Go, Rust, Erlang.

  • Building & Scaling High-Performing Teams

    To have a high-performing team, everybody on it has to feel and act like an owner. Organizational health and psychological safety are foundational underpinnings to support ownership.

  • Evolving the JVM

    The JVM continues to evolve. We’ll look at how things are evolving. Covering Kotlin, Clojure, Java, OpenJDK, and Graal. Expect polyglot, multi-VM, performance, and more.

  • Trust, Safety & Security

    Privacy, confidentiality, safety and security: learning from the frontlines.

  • JavaScript & Transpiler/WebAssembly Track

    JavaScript is the language of the web. Latest practices for JavaScript development in and how transpilers are affecting the way we work. We’ll also look at the work being done with WebAssembly.

  • Living on the Edge: The World of Edge Compute From Device to Application Edge

    Applied, practical & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on.

  • Software Supply Chain

    Securing the container image supply chain (containers + orchestration + security + DevOps).

  • Modern CS in the Real World

    Thoughts pushing software forward, including consensus, CRDT's, formal methods & probabilistic programming.

  • Tech Ethics: The Intersection of Human Welfare & STEM

    What does it mean to be ethical in software? Hear how the discussion is evolving and what is being said in ethics.

  • Optimizing Yourself: Human Skills for Individuals

    Better teams start with a better self. Learn practical skills for IC.

  • Modern Data Architectures

    Today’s systems move huge volumes of data. Hear how places like LinkedIn, Facebook, Uber and more built their systems and learn from their mistakes.