Google, in 2015, essentially created the modern-day machine learning ecosystem when it open sourced a small research project from Google Brain in 2015 called TensorFlow. It quickly exploded in popularity and made the company the steward of mainstream AI products.
But the story is very different today, where Google has lost the hearts and minds of developers — to Meta.
Once an omnipresent machine learning tool, Google’s TensorFlow has since fallen behind Meta’s machine-learning tool PyTorch. First developed at Facebook and open sourced in beta form in 2017, PyTorch is increasingly coming to be seen as the leader.
The chorus is the same in interviews with developers, hardware experts, cloud providers, and people close to Google’s machine learning efforts. TensorFlow has lost the war for the hearts and minds of developers. A few of those people even used the exact phrase unprompted: “PyTorch has eaten TensorFlow’s lunch.”
Through a series of tactical missteps, development decisions, and outmaneuvering in the open source community by Meta, experts say Google’s chance to guide the future of machine learning on the Internet may be slipping away. PyTorch has since become the go-to machine learning development tool for casual developers and scientific researchers alike.
Now, under the shadow of PyTorch, Google has been quietly building out a machine learning framework, called JAX (at one point an acronym for “Just After eXecution,” but officially no longer stands for anything), that many see as the successor to TensorFlow.
Google Brain and Google’s DeepMind AI subsidiary have widely ditched TensorFlow in favor of JAX, paving the way for the rest of Google to follow suit, people close to the project told Insider. A Google representative confirmed to Insider that JAX now has almost universal adoption within Google Brain and DeepMind.
Initially, JAX faced substantial pushback from within, people close to Google’s machine learning efforts said. Googlers were accustomed to using TensorFlow, these people said. As unwieldy as it was, it was an uncomfortable unifying factor among Googlers. JAX’s approach was much simpler, but nonetheless changed how Google built software internally, they said.
The tool is now expected to become the underpinnings of all of Google’s products that use machine learning in the coming years, much in the same way TensorFlow did in the late 2010s, people with knowledge of the project say.
And JAX appears to have broken out of the insular Google sphere: Salesforce told Insider it had adopted it in its research teams.
“JAX is a feat of engineering,” said Viral Shah, creator of the Julia programming language that experts frequently compare to JAX. “I think of JAX as a separate programming language that happens to be instantiated through Python. If you stick to the rules that JAX wants you to, then it can do its magic, which is amazing in what it can do.”
Google is now hoping to strike gold again while also learning from its mistakes made in its development of TensorFlow. But experts say that will be an enormous challenge as it now has to unseat an open source tool that has won the hearts and minds of developers.
Meta did not provide a comment at time of publication.
The twilight of TensorFlow and the rise of PyTorch
PyTorch’s engagement on a must-read developer forum is quickly catching up to TensorFlow, according to data provided to Insider. Engagement data from Stack Overflow shows TensorFlow’s popularity measured in its share of questions asked on the forum has stagnated in recent years, while PyTorch’s engagement continues to rise.
TensorFlow started off strong, exploding in popularity following its launch. Companies like Uber and Airbnb and organizations like NASA quickly picked it up and began using it for some of their most complex projects that required training algorithms on massive data sets. It had been downloaded 160 million times by November 2020.
But Google’s feature-creeping and constant updates increasingly made TensorFlow unwieldy and unfriendly to users, even those within Google, developers and people close to the project say. Google had to frequently update its framework with new tools as the machine learning field advanced at a blistering pace. And the project sprawled internally as more and more people were involved, leading to a lack of focus on the parts that originally made TensorFlow the go-to tool, people close to the project said.
This kind of frantic game of cat-and-mouse is a frequent problem for many companies that are first to market, experts told Insider. Google, for example, wasn’t the first company to build a search engine; it was able to learn from the mistakes of predecessors like AltaVista or Yahoo.
PyTorch, meanwhile, launched its full production version in 2018 out of Facebook’s artificial intelligence research lab. While both TensorFlow and PyTorch were built on top of Python, the preferred language of machine learning experts, Meta heavily invested in catering to the open source community. PyTorch, too, benefited from a level of focus on doing a small number of things well that the TensorFlow team had lost, people close to the TensorFlow project say.
“We mostly use PyTorch; it has the most community support,” Patrick von Platen, a research engineer at machine learning startup Hugging Face, said. “We think PyTorch is probably doing the best job with open source. They make sure the questions are answered online. The examples all work. PyTorch always had a very open source first approach.”
Some of the largest organizations—including those that relied on TensorFlow—spun up projects running on PyTorch. It wasn’t long before companies like Tesla and Uber were running their most difficult machine learning research projects on PyTorch.
Each additional feature, at times to copy the elements that made PyTorch popular, made TensorFlow increasingly bloated for its original audience of researchers and users. One such example was the addition of “eager execution” in 2017, a native Python feature that makes it substantially easier for developers to analyze and debug their code.
Enter JAX, the future of machine learning at Google
As the battle between PyTorch and TensorFlow played out, a small research team inside Google worked on a new framework that would make it easier to access the custom-built chips — called tensor processing units, or TPUs — that underlie its approach to artificial intelligence and were accessible only through TensorFlow.
The team researchers included Roy Frostig, Matthew James Johnson, and Chris Leary. Frostig, James Johnson, and Leary released a paper in 2018 titled “Compiling machine learning programs via high-level tracing,” describing what would eventually become JAX.
Adam Paszke, one of the original authors of PyTorch during a prior stint at Facebook, also began working with Johnson in 2019 as a student, and joined the JAX team full-time in early 2020.
The new project, JAX, offered a more straightforward design for handling one of the most complex problems in machine learning: spreading the work of a large problem across multiple chips. Rather than run individual pieces of code for distinct chips, JAX automatically distributes the work. The demand came from an impressive perk of working at Google: immediate access to as many TPUs as you need to do whatever you need.
It solved a fundamental problem Google’s researchers faced when working on increasingly large problems and needing more and more computational power.
Catching the wind of JAX, developers and researchers inside Google began adopting the skunkworks project. It offered a way to skip much of the developer unfriendliness of TensorFlow and quickly spread complex technical problems across multiple TPUs, people familiar with the project said.
Google’s largest challenge with JAX is pulling off Meta’s strategy with PyTorch
At the same time, both PyTorch and TensorFlow started in the same way. They were first research projects, then curiosities, then the standard in machine learning research. Then researchers took them out of academia and into the rest of the world.
JAX, however, faces several challenges. Its first is that it still relies on other frameworks in many ways. JAX doesn’t offer a way to load data and pre-process data easily, developers and experts say, requiring TensorFlow or PyTorch to handle much of the setup.
JAX’s underlying framework, XLA, is also highly optimized for Google’s TPUs. The framework also works with more traditional GPUs and CPUs, though people close to the project said the project still had a ways to go for GPU and CPU optimization to reach parity with TPUs.
A Google spokesperson said the emphasis on TPUs resulted from organizational and strategic confusion from 2018 to 2021 that led to underinvestment and suboptimal prioritization for GPU support, as well as a lack of collaboration with massive GPU provider Nvidia, both of which are rapidly improving. Google’s own internal research was also largely focused on TPUs, leading to a lack of good feedback loops for GPU usage, the spokesperson said.
That improvement will be critical going forward as companies look to spread their work across many different kinds of machine learning-focused hardware, said Andrew Feldman, CEO of Cerebras Systems, a $4 billion startup building large machine learning-focused chips.
“Anything done to advantage one hardware over another will immediately be recognized as bad behavior, and it will be rejected in the open source community,” he said. “No one wants to be locked into a single hardware vendor, that is why the machine learning frameworks emerged. Machine learning practitioners wanted to be sure that their models were portable, that they could take them to any hardware platform they chose and not be locked in to only one.”
At the same time, PyTorch itself is now almost 6 years old — well past the age where TensorFlow first started showing signs of slowing down. It’s not clear if Meta’s project will meet a similar fate as its Google-backed predecessor, but it could mean that the time is right for something new to emerge. And several experts and people close to the project, pointed to Google’s size, cautioning critics to never count out the search giant.
Be the first to comment