DeepSeek has dominated AI headlines in both the US and China recently, shaking up the industry with its latest breakthroughs. Following the launch of its V3 model, positioned as a rival to GPT-4o, and the R1 model challenging o1, DeepSeek made waves in Silicon Valley’s AI landscape and disrupted the tech stock market.
Amid discussions around Ilya Sutskever’s “End of the Pretraining Era” theory and DeepSeek’s game-changing high-performance models, a critical question emerges: is the demand for AI computational power reaching an inflection point?
This topic takes center stage in the second session of our January 17th event. Zack Li, CTO and Co-founder of Nexa AI, joins Xiyue Xiang, Principle Member of Technical Staff at AMD and former Senior Engineering Manager at SambaNova Systems, for an in-depth dialogue.
From the perspectives of chip and device-level AI, they’ll explore shifts in pretraining and inference, the evolving demand for AI compute, software-hardware co-optimization, opportunities in edge intelligence, and the future trajectory of the industry.
The following is the full transcript of their engaging dialogue.
Zhaoyang Wang: Welcome to our event and the first panel discussion.
The topic of this panel is "Pre-training will end—What comes next?" As many of you in the industry may know, this phrase originated from Ilya Sutskever. Today, we’ll explore what lies ahead, and I’m thrilled to have two distinguished guests with us. Xiyue is from AMD, and Zack is from Nexa AI, an incredible on-device AI startup. Let’s start with some introductions.
Xiyue Xiang: Thank you for inviting me, it’s a pleasure to be here. My Chinese name is XiYue, and my English name is Anderson. I’m a Principal Member of Technical Staff at AMD. My primary focus is leveraging AI technologies to tackle challenges in silicon engineering. I’m also involved in developing several AI-driven products for our SOCs. Before joining AMD, I was part of the founding team at SambaNova Systems, where I worked as a senior manager leading a team delivering system firmware and FPGA prototypes for AI accelerators. That’s a bit about me.
Zhaoyang Wang: Thank you, Xiyue. And Zack?
Zack Li: Sure, hi everyone! I’m Zack Li, the CTO and Co-founder of Nexa AI. Prior to Nexa, I worked at Google and Amazon for several years, focusing on device AI. Nexa AI is a Stanford spin-off startup specializing in on-device AI models and AI infrastructure. One of our key products is Octopus, an on-device AI agent model. We also developed OmniVision and OmniAudio in collaboration with Google and AMD. These models, which were open-sourced, became quite popular on Hugging Face. We’ve also introduced the Nexa SDK, which supports running on-device AI models on laptops and mobile devices, achieving over 5,000 GitHub stars in the last three months. Our customers include HP, Lenovo, Samsung, and more. That’s an overview of what we do.
Zhaoyang Wang: It’s great to have two experts from different perspectives—one from the chip industry and the other from AI models. Let’s dive into the topic of pre-training. As pre-training is coming to an end, perhaps we can start with you, Xiyue. Since not everyone here may be familiar with the chip industry, could you walk us through some basics, such as the differences between chips used for training and those used for inference?
Xiyue Xiang: Absolutely. As you know, the more data we use for model training, the smarter the model becomes. Training, at its core, is a throughput game. This means the infrastructure must process massive amounts of input data at high speed and in real time. Different operations during training—such as compute, memory, or network-intensive tasks—demand specific capabilities.
From a compute perspective, during the forward pass, the system performs heavy matrix multiplication and addition. During the backward pass, it calculates gradients for each layer's weights. These tasks are highly compute-intensive, which is why silicon vendors focus on adding more FLOPs or TOPs to chips.
From a memory perspective, during the forward pass, the model weights must be loaded, and during the backward pass, gradients are calculated and stored. Increasing batch size to boost throughput intensifies memory demands, so chips need to provide both large memory capacity and extremely high memory throughput.
When it comes to distributed training, chips must also support fast networking to distribute input data and forward activations across devices while handling operations like AllReduce for weight updates. This requires high-speed interconnects and scalable solutions.
For inference, however, the requirements differ significantly. Inference is a latency game, meaning the system must process data in real time to provide immediate responses. For example, in autonomous driving, real-time object detection can be a life-or-death matter. Most inference tasks are memory-bound for large models, but they can also be compute-bound if the entire model fits on-chip, which is rare.
Overall, chip vendors have gained a deeper understanding of AI workloads and are tailoring their products to meet the unique demands of different markets.
Zhaoyang Wang: That’s fascinating. Based on my understanding, many AI practitioners use the same chips for both training and inference. Could you elaborate on whether that’s true? And if pre-training is indeed ending, how do chip companies plan to adapt?
Xiyue Xiang: Sure. I believe inference will play a more prominent role going forward. Drawing from my experience at SambaNova, we initially focused on building training solutions but eventually pivoted to the inference market due to its growing importance.
For hyperscalers, it often makes sense to use the same hardware and software stack for both training and inference because it’s cost-effective and practical. However, for customers focused solely on inference, a more cost-effective solution might be preferable. That’s my perspective.
Zhaoyang Wang: Okay, thanks. My next question is for Zack. When we talk about the idea of pretraining coming to an end, there’s also an underlying theory that computing power is shifting from the training phase to the inference phase. Specifically, inference can be divided into two parts: one in the cloud and the other, which is your focus, on devices. Do you think this trend will become a reality? What does this mean for your company? I believe your company is about two years old now, correct?
Zack Li: Yes, that’s correct—two years old.
For Nexa AI, we have been focused on device AI from the very beginning, investing heavily in research and development for inference. We’ve developed an SDK as well as a quantization solution that shrinks model sizes to fit on devices like tablets or mobile phones. This trend aligns perfectly with our company’s vision and R&D efforts.
We’ve noticed that while models are becoming smaller and smarter, there’s still a significant gap in infrastructure. Take PyTorch, for example. There isn’t yet a mature infrastructure to efficiently run large models on mobile devices. We’re betting on this gap, and we’ve developed toolkits to compress and deploy models on edge devices.
Zhaoyang Wang: That’s interesting. Could you share more about your advancements in technology for smaller models?
Zack Li: Sure. To enable models to run efficiently on devices, the toolkits need to be lightweight and versatile. Let me ask: how many of you have used PyTorch or worked with basic tensor operations? If you have, please raise your hand.
—Okay, that’s more people than I expected.
Now, how many of you have tried using tools like Hugging Face on your laptops to run models? Please raise your hand.
—It seems like only a few people have.
That’s the problem. If you’ve used PyTorch or cloud-based solutions, you’ll know they often come with large toolkits—typically over 1 GB to download. These toolkits are also not universally compatible across different backends. For instance, PyTorch requires a Metal backend for Apple devices, CUDA for NVIDIA devices, and so on. This lack of scalability is a major challenge.
To address this, we’ve developed a streamlined toolkit that we provide to enterprise customers. These toolkits allow models—whether downloaded from us or platforms like Hugging Face—to be compressed and scaled across a wide variety of devices, including mobile phones, laptops, robotics, and even autonomous systems.
Zhaoyang Wang: Cool. A follow-up question: You’ve mentioned that your company, as a two-year-old startup, is betting on smaller models and the growing importance of inference. However, competition must be a factor. Big players like Meta, Google, and OpenAI are also building their own large models, and smaller models often come from distilling these larger ones. Some argue that only companies with the capability to create the best large models can also produce the best small models. How do you view this competition?
Zack Li: That’s a very great and also very tough question.
So I’ll share an interesting story. n May 2024, we were invited to Apple Park to give a talk to their executives. Just a week later, at WWDC, Apple announced Apple Intelligence, their on-device AI solution. They’ve been putting a lot of effort into it, offering different model checkpoints, from smaller on-device versions to larger ones. There’s even been internal talk that these on-device models are distilled from larger checkpoints to create the smaller versions.
When it comes to on-device AI, big players like Apple, Google, and Microsoft have significant advantages in computational power and access to data. But their approach often involves scaling down cloud-based solutions to fit on devices, aiming to create models that can handle a broad range of tasks. On the other hand, we focus on specific, real-world use cases for on-device AI, making our models highly specialized and optimized for our customers’ needs.
For example, we’ve developed the Optimus model, which is specifically designed for on-device AI agents. It achieves GPT-4-level function-calling accuracy and performs exceptionally well in reasoning and question answering. This specialization allows us to deliver more targeted and effective solutions.
Another key difference is infrastructure. Large companies like Apple design their systems primarily for their own hardware. Apple Intelligence, for instance, is essentially a way to promote their latest iPhones. They have no incentive to support older models or Android devices. In contrast, we’ve built an infrastructure that works across platforms—whether it’s Android, iOS, macOS, Windows, or Linux.
Our solutions are compatible with a wide range of devices, including those with lower bandwidth. Big companies typically focus on their own ecosystems, like Apple’s Mac ecosystem or Google’s Pixel ecosystem, and don’t prioritize cross-platform compatibility. That’s where we see a huge opportunity to make a difference.
Zhaoyang Wang: That’s very interesting. I think it means you need to strike a balance between becoming a domain expert and building a platform that is scalable across different ecosystems.
Zack Li: Exactly, scalable to different ecosystems.
Zhaoyang Wang: This also reflects a broader trend in AI. The boundaries are becoming less clear—software engineers need to understand hardware, and chip designers need to know how algorithms operate. My next question builds on this. OpenAI is replacing GPT with its new O1 series,we also see DeepSeek-like models. These are incredibly advanced models, and their success seems to come from maximizing hardware capabilities through infrastructure designed specifically for their models. This suggests that the competition is becoming more intertwined between hardware and software. Xiyue, how does AMD approach software development to strengthen its position in this evolving space?
Xiyue Xiang: Before diving into AMD’s strategy, I’d like to take a step back to fto frame the question better. Right now, there are two main challenges when it comes to scaling AI capabilities. The first is enhancing the scalability of the AI models themselves, and the second is reducing the cost of training these models.
To illustrate this, everyone likely knows that OpenAI exhausted most high-quality training datasets while training GPT-4. Whether the famous scaling laws still hold is unclear, but a few months later, they released GPT-4 Turbo, which introduced multimodality, and then they added a new dimension: test-time compute. Test-time compute essentially allows a model to "think longer" before providing an answer, enabling it to refine and validate responses for improved scalability. This is the core idea behind models like O1 and O3.
From a cost perspective, consider the announcement of DeepSeek V3 last December. They managed to train a 600-billion-parameter model for $5 million—dramatically less than the $500 million typically required to train a model of similar size with H100 GPUs. They achieved this through innovations like mixture-of-experts architectures and mixed-precision training, which allowed them to use much cheaper compute resources while delivering comparable results.
These challenges can’t be solved without end-to-end optimization across both software and hardware. Simply having a powerful chip isn’t enough. I’ve seen many great companies build excellent chips, but they often struggle to create scalable and efficient software. To put this into perspective, you can design a chip with one petaflop of computational power, but if your software is inefficient, you might only use 20% of that capability—leaving 80% wasted.
When it comes to AMD, we are addressing this by scaling our software capabilities in three major ways. First, we’ve developed our own open-source software platform called ROCm (Radeon Open Compute), which is designed to program GPUs and AI accelerators like the MI300. Second, we are scaling through strategic acquisitions. For example, we acquired Silo AI last year and Xilinx in 2022, and we’re likely to continue pursuing similar deals. Finally, and perhaps most importantly, we’re committed to building a robust ecosystem by enabling seamless integration with popular frameworks like PyTorch and TensorFlow. We’re also working closely with major AI infrastructure vendors to ensure our hardware and software solutions are both scalable and efficient.
Zhaoyang Wang: Cool. And now for Zack.
In terms of maximizing hardware utilization for machine learning workflows and training, how does Nexa AI better “squeeze” the full potential out of hardware?
Zack Li: That’s a great question.
I’ve noticed that some companies adopt a hardware-software co-design approach. I think this is a great business model because it allows them to gain more profit by selling hardware directly. However, it also comes with challenges, such as managing the logistics chain and dealing with manufacturing processes.
At Nexa AI, since my core team is primarily composed of algorithm and AI experts, we focus more on the model side. We develop toolkits that allow developers to deploy them on a variety of devices—whether it’s a laptop, a mobile phone, or robotics. But I can provide some feedback while I tried different chip companies’ toolkits. - This goes really, really harsh. Okay, so I'm not a mean person. Nexa AI is actively collaborating with AMD. We were just at CES 2025, where AMD invited us to showcase our work at their booth.
Over the years, we have tried Nvidia’s software. We have tried intel’s, also tried AMD’s. What I’ve come to realize is that software is becoming a critical factor for chip companies to attract both customers and developers, especially individual developers. Take Intel’s OpenVINO, for example. If you buy an Intel desktop or laptop, you can fully leverage their NPU (Neural Processing Unit), which is open source. I’ve seen three or four startups build their software stacks entirely on Intel’s NPU. Similarly, I’ve seen startups basing their operations on AMD’s GPUs.
So, software efficiency is shaping up to be a key differentiator, particularly for startups. While specs like FLOPs and RAM are important, chip companies often have very similar products in those areas. The software stack, therefore, becomes the deciding factor for many developers and customers when choosing hardware.
Zhaoyang Wang: From your perspective, what makes good software?
Zack Li: Well, I’m not sure how many NVIDIA or AMD folks are in the audience, but personally I believe the reason toolkit is pretty good, and we definitely want to work more closely to further improve it.
Zhaoyang Wang: I’ll stop asking tougher questions for now. Maybe we can discuss the future :what’s the next big thing in chips, and what can we expect in terms of technological breakthroughs?
Xiyue Xiang: When we talk about what might happen on the chip design side, I’d like to tackle this from five perspectives: process technology, compute, memory, network, and packaging.
Process Technology
First, process technology. Everyone’s saying Moore’s Law is coming to an end, and maybe that’s true. But there’s still an undeniable trend that process nodes will continue to shrink. This allows us to pack more transistors and reduce power consumption, albeit at a slower pace than before. This is evident from TSMC and Intel’s advancements. I believe this trend will continue until disruptive technologies, like quantum computing, become mature.
Compute
Second is compute. Chip vendors and startups are putting significant effort into designing specialized compute elements to support various precisions and sparsity. They’re also exploring emerging architectures, such as dataflow architectures, to overcome the limitations of the traditional von Neumann system.
Memory
Third, let’s talk about memory, specifically HBM (High Bandwidth Memory). HBM has been adopted to address memory bandwidth and latency issues, which are critical in the AI era. I believe HBM will continue to scale in terms of performance, density, and capacity. However, it’s incredibly expensive. To balance cost and performance, chip vendors are likely to explore tiered memory hierarchies that combine SRAM, HBM, and DDR memory. This approach makes sense for optimizing trade-offs between cost, bandwidth, and latency.
Network
Fourth is networking. Networking scalability has two dimensions: scaling up and scaling out.
Scaling up involves enhancing the performance of a single system or node.
Scaling out means replicating multiple systems to solve a single problem, which requires innovation in transport protocols like RoCE (RDMA over Converged Ethernet), NVLink, or emerging counterparts like UCIe (Universal Chiplet Interconnect Express). It also requires switch vendors to create more scalable and economical solutions to build larger networks.
Packaging
Finally, packaging. We currently have 2.5D packaging (CoWoS) and 3D packaging (TSV-based technology). Recently, Broadcom announced something called 3.5D packaging, though the details aren’t entirely clear yet. My guess is that it’s a combination of 2.5D and 3D technologies, enabling the stitching together of multiple dies to form a larger chip. This aligns with the trend toward chiplet-based systems-on-module (SoM), which is driving the evolution of packaging technology.
Zhaoyang Wang: Very good. My last question is for Zack.
In terms of on-device AI, more and more people believe it will be the next big thing. That also implies potential fundamental changes in business models. For example, with cloud-based AI, most monetization is tied to how many tokens you use—sending data to the cloud, processing it, and then receiving the output, all of which you pay for. However, with on-device AI, when someone buys a smartphone, they’ve essentially already paid for the compute power, since everything happens locally on the device. They wouldn’t need to pay for tokens sent to the cloud. Do you, as a startup founder, see opportunities for new business models here?
Zack Li: The business model for on-device AI—how to commercialize or monetize it—is definitely different from cloud-based solutions. First of all, I agree that on-device AI is gaining momentum. Apple Intelligence has helped generate public awareness of on-device AI, showcasing its capabilities. Additionally, Jennifer Lee, Managing Partner at a16z, mentioned that 2025 would be the year edge AI takes off. We’ve even highlighted this quote in our office to keep the team motivated.
Now, in terms of monetization, on-device AI requires a different approach. Unlike the token-based model in cloud AI, on-device AI monetization typically involves close collaboration with chip companies like AMD and OEMs such as smartphone and laptop manufacturers. The model generally charges per device, per installation, for edge devices. This isn’t unique to us—several other on-device AI companies are using this approach as well.
Zhaoyang Wang: To wrap up this panel, since you two come from different perspectives within this industry, I’ll give you a chance to ask each other a question. Xiyue, you go first.
Xiyue Xiang: Given the clear trend of more AI capabilities moving from the cloud to the edge, what do you think is the biggest opportunity in 2025?
Zack Li: That’s a great question. I think the biggest opportunity lies in creating a solution that is scalable across a variety of hardware platforms. The key difference between the cloud and hardware ecosystems is that in the cloud, you can use a single toolkit like CUDA, but on edge devices, the hardware landscape is much more fragmented. A single laptop, for example, might have a CPU, GPU, and NPU from different vendors, and it’s not easy to create a scalable solution that efficiently leverages all these components.
Anyone who can solve this problem in a clever way will have a significant advantage. That’s why we’re investing heavily in this area—to ensure AI models can run efficiently across diverse hardware. Right now, only 1 in 10 people might have experience using PyTorch, and maybe only 1 in 100 have used a toolkit to run models on their laptops. I hope that by the end of this year, we’ll see 10 or even 20 out of 100 people running large edge AI models on their devices.
Zhaoyang Wang: Cool. So Zack, what’s your question for Xiyue?
Zack Li: Sure. Xiyue, with the shift you mentioned from cloud to edge, do you think we’ll soon see hardware capable of supporting personalized AI that can fully comprehend and learn on the device itself? If so, when might this happen?
Xiyue Xiang: First of all, I completely agree that there is a strong demand for personalized AI solutions. I would love for my phone to learn my habits and provide tailored recommendations when I’m making decisions. Secondly, for a seamless AI experience, we need AI systems that can remember and adapt based on our personal experiences, as this defines who we are and how we approach problems.
From these two perspectives, the demand is undeniable. Moreover, I’ve noticed that mainstream AI frameworks are beginning to support on-device training. For example, PyTorch, TensorFlow Lite, and ONNX have all started to enable this functionality. Google is actively working on it, and Apple has also begun supporting these efforts. This shows that vendors recognize the strong demand for on-device training and see it as a viable approach.
In my view, it’s very likely that we’ll see mature products in this space roll out within the next couple of years.
Zhaoyang Wang: Thank you all!