partners with Nvidia to target inference – TechCrunch

Run.aithe well funded Service for orchestrating AI workloads, recently made a name for itself few years by helping its users to make the most of their GPU resources locally and in the cloud to train their models. But it’s no secret that training models is one thing while getting them into production is another — and that’s why many of these projects still fail. So it’s no wonder that the company, which sees itself as an end-to-end platform, is now going beyond training to also help its customers run their inferencing workloads as efficiently as possible, whether on a private or public cloud . or at the edge. The company’s platform now also offers integration with Nvidia’s Triton Inference Server software, thanks to a close partnership between the two companies.

“One of the things we’ve noticed over the last 6-12 months is that companies are starting to move from building and training machine learning models to actually having those models in production,” Omri Geller, Co-Founder and CEO of told me. “We have started to invest a lot of resources internally in order to master this challenge as well. We believe we cracked the training part and built the right resource management there, so now we are focused on helping organizations manage their compute resources for inference as well.”

Photo credit: NVIDIA

The idea is to make it as easy as possible for companies to deploy their models. promises a two-step deployment process that requires no YAML files to be written. Thanks to’s early adoption on containers and Kubernetes, it’s now able to offload these inference workloads to the most efficient hardware, and with’s new Nvidia integration with Atlas platform, users can even multiply Models – or instances of them – provide the same model – on the Triton Inference Server, with, which is also part of Nvidia’s LaunchPad programwhich handles automatic scaling and prioritization on a model basis.

While inference doesn’t require the same massive computational resources required to train a model, Nvidia’s Manuvir Das, the company’s vice president of enterprise computing, noted that these models are getting larger and deploying them on a CPU is simply not feasible. “We built this thing called the Triton Inference Server, which is all about running your inference not only on CPUs but also on GPUs — because the performance of the GPU has started to matter for inference,” he explained. “Previously, you needed the GPU to do the training, and once you had the models, you could easily deploy them to CPUs. But the models kept getting bigger and more complex. So you have to actually run them on the GPU.”

And as Geller added, the models get more complex over time. After all, he found that there is a direct correlation between the computational complexity of models and their accuracy – and thus the problems that companies can solve with these models.

Although’s initial focus was on training, the company was able to apply many of the technologies developed for it to inference. The resource sharing systems the company has developed for training, for example, also apply to inference, where certain models may require more resources to run in real time.

Now, you’d think these are capabilities that Nvidia might want to build into its Triton Inference Server as well, but Das noted that the company isn’t approaching the market that way. “Anyone doing data science at scale needs a really good end-to-end ML Ops platform to do it,” he said. “This is what does well. And then what we do underneath, we provide the low-level constructs to make really good use of the GPU individually, and then when we integrate them properly, you get the best of both things. This is one of the reasons why we worked well together, because the separation of duties was clear to both of us from the start.”

It’s worth noting that alongside the Nvidia partnership, also announced a number of other updates to its platform today. This includes new inference-focused metrics and dashboards, as well as the ability to deploy models on sub-GPUs and automatically scale them based on their individual latency service level agreements. The platform can now also scale deployments to zero – reducing costs as a result.

Leave a Reply

Your email address will not be published.