Why distributed AI is key to driving AI innovation

The future of AI is distributed, said Ion Stoica, co-founder, CEO and president of Anyscale, on day one of VB transformation. And that’s because model complexity shows no signs of slowing down.

“In recent years, the computational requirements to train a state-of-the-art model have increased, depending on the data set, grow between 10 and 35 times every 18 months,” he said.

Just five years ago, the largest models fit on a single GPU; Fast forward to today and just to meet the parameters of the most advanced models requires hundreds or even thousands of GPUs. PaLM or Google’s Pathway Language Model, has 530 billion parameters – and with more than 1 trillion parameters, this is only about half of the largest. The company uses more than 6,000 GPUs to train the latest ones.

Even if these models stopped growing and GPUs continued to evolve at the same rapid pace as in previous years, it would still take about 19 years to mature enough to run these cutting-edge models on a single GPU, added Stoica added.

“Basically, this is a huge gap, growing month by month, between the needs of machine learning applications and the capabilities of a single processor or a single server,” he said. “There is no other way to support these workloads than to distribute them. It’s that simple. Writing these distributed applications is difficult. It’s even more difficult than before.”

The unique challenges of scaling applications and workloads

There are multiple phases in building a machine learning application, from data labeling and preprocessing to training, hyperparameter tuning, serving, reinforcement learning, and so on—and each of these phases must scale. Typically, each step requires a different distributed system. In order to build end-to-end machine learning pipelines or applications, it is now necessary to put these systems together, but also to manage each of them. And it also requires development for a variety of APIs. All of this adds enormous complexity to an AI/ML project.

The mission of the open-source Ray Distributed Computing project and Anyscale is to make scaling these distributed computing workloads easier, Stoica said.

“With Ray, we’ve tried to provide a compute framework that you can build these applications on end-to-end,” he said. “W Anyscale basically provides a hosted, managed Ray and of course security features and tools to simplify the development, deployment and management of these applications.”

Hybrid stateful and stateless computation

The company recently released a serverless product that abstracts the required functions, so you don’t have to worry about where those functions are running and reduces the burden on developers and programmers when scaling. But with a transparent infrastructure, functions are limited in their functionality – they perform calculations, write the data back to S3 for example, and then they’re gone – but many applications require stateful operators.

For example, the training, which requires a lot of data, would become far too expensive if they were written back to S3 after each iteration, or even just moved from GPU memory to machine memory, since the overhead of fetching the data would be too high to read the data in and then typically serialize and deserialize that data as well.

“Ray was also built from day one around these types of operators that can maintain and continuously update state, which we call “actors” in software engineering lingo,” he says. “Ray has always supported this dual mode of this type of stateless and stateful computation.”

What inning is the AI ​​implementation in?

One is tempted to say that AI implementation has finally reached the migratory phase propelled by the recent acceleration in digital growth on the AI ​​transformation journey — but we’ve just seen the tip of the iceberg, Stoica said. There is still a gap in current market size versus opportunity – similar to Big Data some 10 years ago.

“It takes time, because time [needed] It’s not just for tool development,” he said. “It trains people. train experts. That takes even longer. If you look at big data and what happened, many universities started offering data science degrees eight years ago. And of course there are a lot of courses now, AI courses, but I think you’re going to see more and more applied AI and data courses, which there aren’t a lot of today.”

Learn more about how distributed AI is helping companies improve their business strategy and catch all Transform sessions by registering for a free Virtual Passport exactly here.

Leave a Reply

Your email address will not be published.