"AGI is just another open source away."
Hugging Face Secrets and the Rotating Throne of Open-Source AI
At our March 20th meetup panel, “AGI is just another open source away,” we were joined by Yifeng Yin (former Hugging Face engineer, now stealth startup founder), Junping Du (founder and CEO of Datastrato), and moderator Zhaoyang Wang.
Drawing from first-hand experience inside Hugging Face and deep involvement in the data infrastructure community, the panel unpacked the reality behind the fast-paced evolution of open-source foundation models. Beyond the surface-level model rankings, they explored the dynamics of distribution power, trust in enterprise AI, and the strategic rotations happening in China and beyond.
💡 Key Takeaways:
• Open-source success is cyclical: Models like LLaMA, Qwen, and DeepSeek take turns dominating—it’s a fast-moving rotation, not a winner-takes-all.
• Distribution > product: A strong community and social reach often matter more than having the best model.
• Open source builds trust: Especially for enterprises handling sensitive data, open models win on flexibility, security, and compliance.
• China’s open-source scene is rising fast, with potential for more unexpected players to enter the game.
Below is the full transcript:
Zhaoyang Wang: Do you know what the most popular model on Hugging Face is?
Audience: Qwen.
Zhaoyang Wang: It’s actually DeepSeek R1. Last month, for the first time, DeepSeek R1 received more stars than any other model. That’s a remarkable achievement. Here’s another interesting story. A friend from Hugging Face asked me something last October. He asked if I knew which was the most popular model on Hugging Face last year.
Audience: Mistral?
Zhaoyang Wang: That was my guess too. But it was actually BERT—Google’s BERT. Even with the rise of large language models, BERT remained the most downloaded model last year on Hugging Face. This tells us two things. First, open source has a long-standing history. BERT remains popular even while everyone is talking about newer models. Second, this brings us to the main question I want to discuss today, by introducing a new phrase in open source. Before we dive into that topic, let me give some context. I borrowed this phrase from several researchers: “AGI is just another open source away”. It’s meant as a tribute to open source, because open source is incredibly important.
Now, maybe you can briefly introduce yourselves to the audience.
Junping Du: I’m Junping Du from Datastrato, the founder and CEO. I started the company in 2023, but I’ve been contributing to open source for over 14 or 15 years. I was at Hortonworks, then VMware, later I joined Tencent Cloud, where I worked on cloud data warehouses, BI, and data lakes. After that, I left to start my own company focused on building an open-source data platform for matching next-generation AI—that’s what we’re working on now.
Yifeng Yin: I’m Yifeng Yin. I used to work at Hugging Face and now I’ve started my own business.
Zhaoyang Wang: Great. The first question I want to ask is about DeepSeek. DeepSeek has been around for about three months. The initial global hype has started to fade, so I think it’s a good time to talk about it from an insider’s perspective. What makes DeepSeek’s approach to open source different, if anything?
Yifeng Yin: Open source is a very old game, ever since software became open source. Open source is one of the best ways to challenge incumbents. That's what Google did - they open-sourced Android. Boom. You have two players in the field.
Now in the age of AI, some things stay the same, but others have evolved. Open source is still a battlefield of giants—it’s often winner-takes-all. If you look at players like Meta, Alibaba, or even open-source contributors in Bitcoin and foundational models, they all share something in common: they don’t need to make money directly from the open-source models.
This is key. They’re less commercially constrained. But open source is still commercially viable. Just look at how much Meta’s stock went up after launching Llama, or how Alibaba’s stock rose after launching Qwen. They didn’t profit from the models themselves, but from the positive PR, the community engagement, and the talent they attracted. That’s how they sustain and grow.
Now back to DeepSeek—they also don’t need to worry about monetization. They’re backed by a quantum computing company. Through open source, DeepSeek managed to establish global distribution overnight without spending a dime on advertising. That’s the brilliance of it—they focused entirely on doing one thing right, and the hype converted into visibility. Their low-cost distribution brought them recognition worldwide. Their product company now has access to some of the best data and increased asset value. What makes DeepSeek different from others? People recognize them as the leader of this current era—just like Qwen dominated the last era, and the cycle continues.
It’s a rotating throne, really. So while the game hasn’t fundamentally changed, DeepSeek plays it well. It’s the same open-source game we’ve seen since the beginning of modern software.
Junping Du: I think it's a game of large competitors joining the foundation model race. But DeepSeek actually started as a smaller startup with some rich companies behind it. If you look at open-source history, many successful companies started small—like Databricks. It began as a modest project but eventually became the industry standard. The same thing happened with Android. It didn’t start at Google, but once Google acquired it, Android ended up dominating all non-iPhone markets. That’s another big success story.
So yes, I believe open-source foundation models will shape the future. I think at least half of the market—or more—will be occupied by them. Whether they come from startups or tech giants, the key lies in how they function and scale.
Open source makes it easy to build ecosystems. When you open source your work, you share knowledge—not just code. You share weights, algorithms, frameworks—everything from tooling to best practices. That’s essential for application building and talent development. Engineers don’t like black-box systems—they prefer tools they can understand, modify, and control.
That’s the biggest difference. Open source isn’t just about transparency; it’s about empowerment. This is just the beginning. We’re nowhere near the end.
Zhaoyang Wang: I also discussed with my friend how companies now use open source differently. Before, second-place companies would open source to challenge leaders. Now, companies use open source strategically to become market leaders themselves, creating free ecosystems as part of their path to dominance.
Junping Du: So maybe adding more - the competition for foundation models is actually about acquiring users. You have to acquire users first. How do you get 1-2 million users or even more? I think only OpenAI has achieved this kind of user base, and you have to find ways to attract so many users.
Zhaoyang Wang: So yeah, the next question is also about open source in this industry. Before DeepSeek, there were many campaigns about "Open AI" versus "Closed AI." If we're talking about AGI, it means everyone will be impacted by decisions made by OpenAI and Sam Altman. That's why people are excited about this new wave of open source - everyone can control their own destiny. What's your take on this new mindset in the industry?
Yifeng Yin: Talking about OpenAI and Closed AI, Hugging Face has deep roots collaborating with open source markets. This conversation really picked up about two years ago. So, how do you disrupt a closed-source giant? You need two things.
First, think about Meta. Just a couple years ago, it was still branding itself as a metaverse company—remember that uncanny selfie of Zuckerberg in the metaverse? But then Meta launched LLaMA, and suddenly proved to the world that open source can deliver powerful models too. That moved a lot of investor attention—and money—away from OpenAI and toward open-source efforts. It actually hurt OpenAI’s cash flow.
Second, open-sourcing makes it easier to build ecosystems among developers. Customers don't care if something is closed or open source - they use products, not technology. But developers are different. They care a lot. And many of them have the influence—and the money—to shape markets through their choices. So you can't kick OpenAI out of the market, but open source can target good market segments.
Junping Du: Yeah, it’s like we need both the iPhone and Android. Both ecosystems matter. OpenAI can move fast without caring about many restrictions since they're closed - they don't have various API constraints. Meanwhile, open foundation models build ecosystems and demonstrate security features, becoming more trustworthy to the community and AI users. Many enterprises have internal policies against using OpenAI or third-party APIs. They must deploy their own solutions because they can't send sensitive business data over the internet to vendors. For the long-term question of how your foundation model gains trust, open source models have advantages. That's why even OpenAI has different opinions now compared to a year ago, when Sam Altman prioritized control over performance.
Zhaoyang Wang: Next I'll ask specific questions to each of you personally. First, Yifeng, my question goes to your experience working as a machine learning engineer. What was it like to work at Hugging Face? From the outside world, Hugging Face seems very cute and almost like a non-profit organization, though it's actually a profitable company. What's it like from the inside?
Yifeng Yin: We used a lot of emojis in our daily messaging—which kind of reflects the vibe there. From a traditional commercial perspective, Hugging Face might seem like the last company that would ever make a profit. But it actually does, through a very unique business model.
Hugging Face is essentially the Amazon of models—an e-commerce-style platform. The sellers are those who host models, and the buyers are those who use them. But models, especially open-source ones, are a very unique type of commodity. The sellers don’t charge anything, so there’s no reason to take a cut from transactions—because the transaction cost is zero. And any percentage of zero is still zero. As a community builder, I don’t think it would even be ethical to charge for that. But once you download a model, you often need hosting, deployment, or integration support—and that’s where Hugging Face can help, through its libraries and tools.
So the core business model is: make money through post-download, value-added services—kind of like after-sales support in e-commerce. That’s how Hugging Face becomes profitable. Working there was probably the best experience I've had as an employee. I wouldn't be an employee for long in my life, but those two years were the best. There's not much competition internally, and you have the freedom to work how you like.
What really sets Hugging Face apart is that it moved early. It was already doing this before GitHub even realized, “Oh, we need to do this too.” The whole story of Hugging Face’s success is basically: move fast before the big players notice. By the time they catch on, you’ve already established yourself as the default in the space.
Zhaoyang Wang: You mentioned that you cannot talk about your stealth startup yet. But how does your experience at Hugging Face influence your next chapter? What are the takeaways?
Yifeng Yin: First of all, remote work works. Why would I want to set up an office in any specific city? It's just extra cost. When you have an office in a city, you're restricting your talent pool to basically 30 miles around that office. But without an office, your talent pool is pretty much anyone with an internet connection. That's huge for us - we need that level of talent. You can expand your talent pool from a 30-mile radius to the entire world.
Second, distribution channels are really important. At Hugging Face, we had one of the most valuable assets: community access. We were right at the center of the open-source ML community. Anytime we wanted to promote something, we could just tweet it, and the entire community would see it. That kind of distribution makes everyone jealous. Even a slightly inferior product can be more popular if it has better distribution channels. So as an entrepreneur, I'll probably spend most of my time building channels rather than just developing deeper tech.
Zhaoyang Wang: Junping, what’s the story behind Datastrato? You’re running a business, but still heavily involved in open source. How are you leveraging it?
Junping Du: Actually, we’re not just leveraging open source—we’re contributing to it. In fact, we’re the main contributors behind the Apache Gravitino project. We created the project, and it became the foundation of our company.
Let me explain a bit about why we started a company like this. Over my 15+ years in the industry, I’ve had two key moments where I witnessed major shifts in cloud technology.
The first was in 2009. At the time, cloud computing was still very new—Amazon S3 had just launched, and Google was still pushing App Engine. Many people didn’t believe in cloud back then, including some senior tech leaders. It was still in the early management phase, very complex, and hard to imagine a single company managing all of that infrastructure for the world. So, while the idea of cloud made sense in theory, the actual adoption and execution were a different story—due to real-world complexity, different industry demands, and fragmented infrastructure. Even with the best virtualization technology at that time, adoption was slow.
The second opportunity came when I joined Hortonworks, working with the Hadoop ecosystem. We saw the rise of big data, but again, the transformation didn’t happen inside our company—it was happening at others, like Databricks. When I joined Hortonworks, it was still a small startup. Later on, I left my job to pursue something I believed in—a new trend I wanted to be part of.
That trend is this: data is no longer just consumed in traditional ways, like transactional databases or batch queries. Today, data is everywhere—on IoT devices, across clouds, and in real-time environments. It’s not only being queried—it’s being used for training, decision-making, and even influencing autonomous jobs.
We need a new way to handle this: a “data versioning application” layer that makes data access more rational and consistent—no matter where it’s stored or how it’s accessed. Whether it’s a query engine or a training workload with frameworks like Ray, or agent-based applications, the system should support this more dynamic, agentic style of data interaction.
That’s what we’re building now. And we strongly believe this must be based on an open data architecture. Open data has evolved—from the early Hadoop days, to Spark, and now to AI workloads and competition around foundational models. We see Datastrato as building the next generation of standards and infrastructure for this open data and AI era.
Zhaoyang Wang: You mentioned you don’t want to miss the next big trend—and that this trend is also bringing a new type of business model. Can you talk more about that?
Junping Du: Yes, I think business models will evolve too. We’re building a foundation of open, modular software—things like open standards for catalog services. But beyond that, we’re working on creating a “smart catalog” system that delivers more value to high-end users. The software itself remains open to everyone, but higher-value customers often want intelligent capabilities—more than just raw access. We’re developing metadata models that help make data management smarter. That’s where the new value lies. It’s similar to what OpenAI does: offering different models with different performance levels and pricing. The more advanced the model, the more it costs. I believe software will go in the same direction—smarter software will be priced differently. The same software foundation can offer different tiers or levels of intelligence, and I think that’s what will make the business ecosystem more sustainable and healthy in the long run.
Zhaoyang Wang: So, for the last question—let’s talk about China and the open-source community. What are your thoughts or predictions for what’s coming next?
Yifeng Yin: Probably Llama 4. Like I said earlier, this is a very typical cycle. Every model takes the throne for about three months, then gets knocked off—and maybe comes back a year later. That’s just how the top-tier open-source game works. But there’s more than just general foundation models. I think in every vertical—like spatial intelligence, computer vision, or even text-based models—there’s room for one or two dominant players at a time. So we’ll probably see a rotation of leaders across these domains.
The open-source market is big and diverse enough to allow for multiple winners. It’s not always a winner-takes-all scenario. In the general-purpose open-source model space, my guess—based on what I’ve seen from insiders—is that the next big wave will be Llama 4, after that, maybe Qwen, then DeepSeek. And after DeepSeek, probably Llama again. It just keeps rotating. That’s the reality now—like it or not, that’s how the game is played.
Junping Du: I agree—it’s definitely a competitive cycle. But surprises can always happen. There are still a lot of major players in the industry who haven’t fully joined the race or haven’t been paying enough attention yet. Take DeepSeek as an example. In the beginning, when they released versions like V1 or V2, not many people noticed. But once they launched R1, suddenly everyone knew about them. So I think there’s still a real possibility for newcomers to break through and join the competition.
Zhaoyang Wang: Okay, thanks! That wraps up our panel.