Home on Bullbull AI

Future Development of Robots and AI: Insights from Experts

Sat, 18 Apr 2026 00:00:00 +0000

Future Development of Robots and AI

On April 17, 2026, a main event themed “Intelligent Future” was held at the Chinese Academy of Sciences in Beijing, focusing on the future of robotics and artificial intelligence (AI). Experts gathered to share insights on the advancements and applications of these technologies.

Robots Evolving into New Intelligent Entities

Academician Yu Haibin, director of the Industrial Artificial Intelligence Research Institute of the Chinese Academy of Sciences, delivered a report titled “Robots Leading a New Era of Technology.” He explained the development trajectory, basic components, and classifications of robotic technology, and provided a comprehensive analysis of the applications and future directions of robots across various fields.

Yu pointed out that with the deep integration of AI and intelligent manufacturing, robots are gradually breaking the boundaries of traditional automation equipment, transitioning into new intelligent entities with perception, cognition, and autonomous decision-making capabilities.

AI Transitioning to Future Partners

In another report titled “A Brief History of AI Evolution: Super Tool or Future Partner?” Senior Engineer Luo Yin from the Automation Research Institute of the Chinese Academy of Sciences discussed the evolution of AI technology from being a tool to becoming a collaborative partner.

Luo emphasized that with breakthroughs in large model technology and decision-making intelligence, AI is transforming from a “super tool” into a proactive collaborative “future partner.” This shift in human-machine relationships will have profound impacts on research, industry, and social life.

Technology Reshaping Production and Life

Industry experts believe that as robots evolve from “automated execution” to “autonomous intelligence,” and as AI transitions from a “super tool” to a “future partner,” technology is reshaping production and life in unprecedented ways.

The event, part of the “Science and China - Thousands of Academicians, Thousands of Popular Science” series, will continue to focus on more cutting-edge technology fields, gathering more academicians and industry experts to carry out high-quality popular science activities.

Nearly 300 representatives from industry, academia, research management, and youth participated in this robotics and AI-themed event, engaging in interactive discussions with the experts.

Understanding AI: What It Is and How It's Changing Our Lives

Sat, 18 Apr 2026 00:00:00 +0000

1. Conclusion: AI is not a “thinking human” but a “learning tool”

In recent years, artificial intelligence (AI) has become one of the hottest topics. Some believe it can do everything, while others fear it will replace human jobs, and some think of it as a “machine that thinks like a human.” To explain AI in simple terms:

AI is not a being with self-awareness; it is a technical system that learns patterns from data to accomplish specific tasks.

It excels at recognizing patterns, processing information, generating content, and assisting in decision-making. For example, common applications like smartphone camera optimization, map navigation, voice assistants, video recommendations, and even online customer service all involve AI.

2. What Exactly is AI?

AI, short for “Artificial Intelligence,” refers to the capability of machines to exhibit behaviors similar to human intelligence. Here, “intelligence” does not necessarily mean the machine truly understands the world; rather, it can perform certain tasks as if it can “judge, predict, and generate.”

From an application perspective, AI mainly performs the following tasks:

Recognition: Understanding images, speech, and text.
Prediction: Judging what might happen next based on existing data.
Generation: Creating text, images, videos, code, and other content.
Decision Support: Helping humans analyze information quickly and provide suggestions.

Thus, AI is more like an increasingly capable “digital assistant” rather than a mysterious black box.

3. The Relationship Between AI, Machine Learning, and Large Models

Many people confuse AI, machine learning, and large models, but they represent a hierarchy:

AI is the broadest concept, referring to the ability of machines to possess intelligent capabilities.
Machine Learning is a method of implementing AI, focusing on enabling machines to learn patterns from data rather than relying solely on manually written rules.
Deep Learning is a significant branch of machine learning that excels at processing complex information such as images, speech, and text.
Large Models are a class of AI systems that have gained attention in recent years, typically characterized by vast parameters and strong general capabilities, capable of tasks like conversation, writing, summarizing, translating, and programming.

You can think of it this way:

AI is the goal, machine learning is the method, and large models are a powerful current form of implementation.

4. Why Has AI Suddenly Become So Popular in Recent Years?

AI is not a new concept, but it has rapidly gained traction in recent years due to three simultaneous developments:

1. Increasing Data Availability

The internet, mobile devices, and various digital platforms have accumulated vast amounts of text, images, audio, and behavioral data, providing ample material for AI learning.

2. Enhanced Computing Power

Improvements in chips and cloud computing capabilities have made it feasible to train models that previously required a long time, enabling complex tasks to be completed more quickly.

3. Significant Model Capability Improvements

The emergence of large language models has allowed ordinary users to intuitively realize that machines can not only “calculate” but also “write,” “answer,” “summarize,” and “create.”

This has brought AI from the laboratory into public applications, prompting various industries to rethink efficiency, content production, and service methods.

5. How is AI Already Affecting Our Lives?

Many people think of AI as a “future technology,” but it has already entered our daily lives.

1. Content Acquisition

Recommendation algorithms determine what content you are more likely to see on short video platforms and news platforms.

2. Work Collaboration

AI can help organize meeting minutes, draft initial documents, generate tables, and summarize information, significantly reducing repetitive labor.

3. Learning and Education

AI can provide personalized Q&A, language practice, and knowledge summaries, becoming a tool for assisting learning.

4. Healthcare and Services

In medical image recognition, health management, customer service robots, and intelligent consultations, AI has already provided substantial assistance.

5. Creative Production

From copywriting and posters to short video scripts, AI is becoming an “accelerator” for creators.

In other words, AI does not always appear as an independent product; more often, it is embedded in the tools and platforms we use.

6. Will AI Replace Humans?

This is one of the most common questions. A more accurate statement is not “Will AI replace all humans?” but:

AI will first replace some repetitive and standardized workflows while reshaping the skill requirements of many jobs.

For instance, basic organization, simple translation, template writing, and preliminary data summarization are tasks that AI can accelerate or even replace. However, jobs that involve complex judgment, emotional understanding, cross-domain integration, accountability, and creative decision-making still have a clear human advantage.

The more likely future scenario is:

Those who do not use AI will fall behind in efficiency.
Those who can use AI will spend more time on judgment, strategy, and creativity.

Therefore, rather than worrying about being completely replaced, it is better to learn how to collaborate with AI as soon as possible.

7. How Should Ordinary People Understand and Use AI?

For most people, understanding AI does not require starting with complex technical details; instead, three basic understandings can be established:

1. Treat AI as a Tool, Not an “Absolute Authority”

AI can make mistakes and may “speak nonsense” with confidence. It can help improve your efficiency but cannot replace your final judgment.

2. Learn to Ask Clear Questions

AI’s performance largely depends on how clearly you provide information. The more specific your questions and clearer your goals, the better the results are likely to be.

3. Start with High-Frequency Scenarios

For example, writing emails, summarizing, organizing materials, outlining, and generating event proposals. By starting with small tasks you encounter daily, the value of AI will quickly become apparent.

8. What is the Most Important Skill in the AI Era?

If the past emphasized “remembering answers,” then in the AI era, the more important skills will become:

The ability to ask good questions
The ability to judge the authenticity of information
The ability to integrate information from multiple sources
The ability to collaborate with tools
The ability to maintain learning and adapt to changes

AI will lower many job thresholds but will also increase the value of “judgment” and “creativity.”

9. Conclusion: The Key is Not to Fear AI, but to Learn to Use It

Every technological revolution brings anxiety but also new opportunities. The significance of AI lies not only in making machines smarter but also in helping humans free up time from repetitive labor to engage in higher-value thinking, creativity, and connections.

For ordinary people, the best starting point is not to chase every concept but to take the first step: understand it, use it, and observe how it changes your work and life.

Once you start to truly engage with AI, it will no longer be just a distant technical term but will become part of your daily capabilities.

2025 AI Model Rankings: Domestic and International Comparisons

Fri, 17 Apr 2026 00:00:00 +0000

1. Tongyi Qianwen (Alibaba)

Core Competence: Leading Chinese understanding capabilities, outstanding logical reasoning and text creation, supporting millions of context windows and multimodal interaction.
Application Scenarios: Enterprise services, e-commerce, financial customer service, with over 1.5 billion daily calls, serving over 90,000 enterprises.
Version Status: Multiple iterations of Tongyi Qianwen, such as Tongyi Qianwen 2.0, continuously optimizing performance, functionality, and multimodal capabilities.

2. Doubao Large Model (ByteDance)

Technical Highlights: Nearly 60 million monthly active users, second in global user count, excels in image understanding and multimodal fusion, with significant potential in education.
Cooperation Ecosystem: Collaborates with over 500 enterprises, focusing on family companionship and learning assistance scenarios.
Version Status: Continuously releases different versions, upgrading image understanding and multimodal fusion to better meet diverse scenario needs.

3. Wenxin Yiyan 4.0 (Baidu)

Commercial Advantages: Annual call volume increases by 30 times, with 1.5 billion daily calls, leading in mathematical science and language ability assessments.
Industry Coverage: Deep integration with Baidu’s knowledge graph, supporting healthcare, education, and finance sectors.
Version Status: Currently centered on version 4.0, with previous versions like Wenxin Yiyan 3.0, each progressively enhancing knowledge coverage and reasoning capabilities.

4. iFlytek Spark (iFlytek)

Multilingual Breakthrough: Supports interaction in over 30 languages, with over 200 million app downloads, mature solutions in healthcare and finance.
Technical Features: Industry benchmark in speech recognition and synthesis, widely applied in education.
Version Status: Versions like iFlytek Spark 2.0 and 3.0 continuously improve multilingual interaction and speech capabilities across various industries.

5. Kimi Smart Assistant (Dark Side of the Moon)

Long Text Processing: Supports input of 200,000 Chinese characters, high popularity in the A-share market, suitable for data analysis and professional document interpretation.
Scenario Expansion: Plans to extend into legal and scientific research fields.
Version Status: Continuously updates versions to enhance long text processing capabilities and expand application scenarios.

6. DeepSeek

Benchmark in Programming: A complete open-source model ecosystem, R1 version supports code generation and debugging, with comprehensive capabilities comparable to GPT-4.
Technical Innovations: Breakthroughs in dynamic reasoning optimization and domain adaptation technology, representing domestic large models in internationalization.
Version Status: Currently has R1 version, with more versions possibly to be released for optimizing code generation and reasoning capabilities.

7. Zhipu Qingyan GLM-4 (Tsinghua University)

Interactive Innovation: The first domestic model with a trillion parameters supporting video calls, enhancing natural human-computer interaction.
Academic Background: Developed by Tsinghua team, balanced capabilities in knowledge Q&A and creative writing.
Version Status: Developed from the GLM series to GLM-4 version, with significant improvements in parameter scale and interaction capabilities.

8. Hunyuan Large Model (Tencent)

Video Generation: Trillion parameter scale, supports text-to-video generation, widely applied in film and television creation.
Ecosystem Integration: Deeply integrated into the WeChat ecosystem, providing personalized intelligent agent services.
Version Status: Continuously updates versions to improve video generation quality and service capabilities within the WeChat ecosystem.

9. Baichuan Large Model (Baichuan Intelligence)

Specialized in Healthcare: Solves grassroots healthcare challenges as an AI doctor, with disease diagnosis assistance systems covering over 1,000 hospitals.
Open Source Layout: Baichuan-7B/13B model downloads exceed one million, performing excellently on evaluation rankings.
Version Status: Available in different parameter scales like Baichuan-7B and Baichuan-13B to meet various application needs.

10. Jidream AI (ByteDance)

Video Creation Tool: Supports generating 1080P videos from text/images, leading in ease of use, deeply integrated into the Douyin ecosystem.
User Growth: Rapidly popular after launch in 2024, with a 40% usage rate among short video creators.
Version Status: Continuously updates versions to optimize video generation effects and user experience.

2025 International AI Model Rankings

1. GPT-4o (OpenAI)

Developer: OpenAI
Features: Parameter scale exceeds 10 trillion, supports multimodal inputs (text/image/audio/video), reasoning abilities close to human levels, excelling in complex logic and cross-domain knowledge integration.
Application Scenarios: Scientific analysis, cross-industry decision support, and multimedia content generation.
Version Status: May have different fine-tuned versions for specific applications in various fields.

2. Gemini 2.0 Ultra (Google DeepMind)

Developer: Google
Features: Native multimodal architecture, supports real-time translation in over 100 languages, deeply integrated with Google ecosystem (search/office suite), context window expanded to 2 million tokens.
Application Scenarios: Global enterprise collaboration, real-time translation, multimodal search engine optimization.
Version Status: Gemini 2.0 Ultra version available, may also have lightweight or specific function-optimized versions.

3. Claude 3.5 – Sonnet (Anthropic)

Developer: Anthropic (Google invested)
Features: 200K ~ 1M tokens context window, constitutional AI architecture ensures compliance, excels in medical and legal fields, commercialized on-demand billing.
Application Scenarios: Legal document analysis, medical diagnosis assistance, high-security dialogue systems.
Version Status: Claude 3.5 – Sonnet version available, with previous versions like Claude 2.

4. PaLM – 3 (Google)

Developer: Google
Features: Parameter scale exceeds 1 trillion, specializes in common sense reasoning and mathematical coding, leading response speed among similar models, supports 4096 tokens context.
Application Scenarios: Automatic problem solving in education, financial quantitative model development.
Version Status: Developed from the PaLM series to PaLM – 3 version, may have different fine-tuned versions.

5. LLaMA – 3 (Meta)

Developer: Meta
Features: Open-source model with 70 billion parameters, 200% improvement in reasoning speed, performance close to GPT-4 in the open-source community, supports multilingual optimization.
Application Scenarios: Customized AI solutions for small and medium enterprises, academic research.
Version Status: Developed from the LLaMA series to LLaMA – 3 version, with community-based secondary development versions likely.

6. Falcon – 200B (UAE TII)

Developer: UAE Technology Innovation Institute
Features: 180 billion parameter open-source model, mathematical reasoning and code generation capabilities comparable to GPT-4, training costs only 1/3 of similar models.
Application Scenarios: Multilingual services in the Middle East, low-cost AI infrastructure development.
Version Status: Currently focused on Falcon – 200B version, with potential for optimized versions in the future.

7. Cohere Command – R (Cohere)

Developer: Cohere (founded by former Google team)
Features: Focused on enterprise-level generative AI, supports 52 billion parameter scale, provides customized data privacy protection solutions.
Application Scenarios: Customer service automation, intelligent management of internal documents.
Version Status: Continuously iterates versions to meet diverse enterprise needs.

8. MPT – 50B (MosaicML)

Developer: MosaicML
Features: Open-source model with 8K tokens context length, lowest training costs in the industry, suitable for rapid deployment by small teams.
Application Scenarios: MVP development for startups, experimental platforms for educational institutions.
Version Status: Available in MPT – 50B version, may launch optimized versions for different application scenarios.

9. Nemotron – 4 (Nvidia)

Developer: Nvidia
Features: Integrates Megatron framework, optimizes GPU computing efficiency, designed for AI chips, supports large-scale distributed training.
Application Scenarios: Supercomputing centers, autonomous driving model training.
Version Status: Continuously updates to adapt to new hardware and application needs.

10. Gopher – 2 (DeepMind)

Developer: DeepMind
Features: Reinforcement learning optimized version, sets records in game AI and protein structure prediction, supports multi-agent collaboration.
Application Scenarios: Biomedicine research, complex game environment simulation.
Version Status: Developed from the Gopher series to Gopher – 2 version, with potential fine-tuned versions for different fields.

Summary

This article introduces the rankings of AI models in 2025, highlighting domestic models like Tongyi Qianwen and Doubao Large Model, each with unique core competencies and application scenarios, continuously updated and iterated. International models like GPT-4o and Gemini 2.0 Ultra also showcase distinctive features such as multimodal input and large-scale parameters. For detailed parameter comparison data of various AI models, click to view the comprehensive metrics provided by Mijian Integration.

AI Transitions from Concept to Practical Applications

Fri, 17 Apr 2026 00:00:00 +0000

AI Transitions from Concept to Practical Applications

In recent years, AI has become a prominent topic. At the ongoing 6th China International Consumer Products Expo in Hainan, over 50 leading global tech companies are showcasing applications of AI in consumer goods, smart home technology, digital consumption, and low-altitude economy. This event allows global attendees to experience firsthand how “AI + consumption” is profoundly changing lives.

As I walked through the tech consumption exhibition area, I was greeted by a plethora of AI applications and achievements across multiple fields. At the entrance, a giant model of AI glasses captured attention. From a distance, the large lenses displayed scrolling green subtitles, sparking curiosity among visitors eager to learn more.

“Please say ‘Loki’ to activate the AI assistant,” flashed on the screen of the glasses. Staff from the Rokid booth explained that when worn, users can activate the glasses by simply calling its name. The glasses can provide weather updates, identify surroundings, translate foreign menus, and even facilitate payments by scanning QR codes.

The interactive features of these glasses, made possible by embedding chips and batteries into a slim frame, drew gasps of amazement from the crowd.

However, AI’s journey from concept to practicality is not limited to just glasses. At the booth of Yushu Technology Co., a humanoid robot engaged in handshakes and dance with attendees. Chen Tong, the manager responsible for online sales, shared that the robot is powered by a large model that allows users to control it through voice commands, primarily for entertainment and cultural tourism.

At the Sinopec energy supply station booth, the humanoid robot demonstrated a seamless process of removing a fuel nozzle, filling a disposable cup, and returning the nozzle to its holder.

At the Taishan Sports Industry Group booth, a cyclist scanned a QR code on the bike’s screen with their phone, entering a mini-program. The moment they pedaled, the screen displayed cycling time, speed, heart rate, and calories burned.

According to Song Kun, the head of the company’s branding department, these capabilities are supported not only by the bike’s hardware but also by the underlying data and software.

AI’s influence extends beyond the tech consumption exhibition area. In the national goods exhibition area, a humanoid robot from Li Gong Industrial Co. was seen calligraphing the character “福” (fortune), attracting numerous visitors for photos and inquiries.

Ma Chenchen, the company’s client manager, explained the technology behind this: they first had a calligrapher write several times to collect data on the movements of the joints, which was then input into a specialized server. With the support of algorithms and computational power, the data was optimized and fed into the robot’s brain for reinforcement learning. Once trained to a certain level, the robot can execute commands sent via voice or connected devices.

“The key to robot intelligence lies in data collection from real people, followed by computational power and algorithmic optimization to create a supportive data environment for its intelligent functions,” Ma explained. He emphasized that computational power is akin to intelligence, while algorithms are the methods for solving equations. In Guangdong, this work is supported by computational resources from Gansu.

The AI wave is surging forward.

Alibaba Unveils Ambitious AI Strategy at 2025 Yunqi Conference

Fri, 17 Apr 2026 00:00:00 +0000

Introduction

At the end of September, a light rain fell in Hangzhou, but the AI fervor at Yunqi Town made it feel like summer had not yet faded.

On September 24, the 2025 Yunqi Conference was held as scheduled. Alibaba Group CEO and Chairman of Alibaba Cloud Intelligence, Wu Yongming, delivered a speech titled “The Path to Super Artificial Intelligence.”

This was Wu’s first appearance at the Yunqi Conference after more than a year at the helm of Alibaba Cloud. He stated that “the greatest imagination of generative AI is not to create one or two new super apps on a mobile screen, but to take over the digital world and change the physical world.”

A Year of Progress

If this statement was more of a vision a year ago, it has now transformed into a more concrete roadmap and aggressive actions.

At this year’s Yunqi Conference, Alibaba Cloud unveiled a plethora of new products. Among them was the flagship model Qwen3-Max, which is currently the most powerful model in the Alibaba Tongyi model family, outperforming GPT-5 and Claude Opus 4, ranking among the top three globally on LMArena.

In addition to the flagship model, Alibaba also launched six new models, including the next-generation foundational model architecture Qwen3-Next, the programming model Qwen3-Coder, the visual understanding model Qwen3-VL, the multimodal model Qwen3-Omni, the visual foundational model Wan2.5-preview, and the speech model Tongyi Bailin.

More noteworthy were Wu Yongming’s two bold new assertions.

The Future of Operating Systems

He made a definitive statement: large models are the next generation of operating systems. Large models will engulf software, allowing anyone to create an infinite number of applications using natural language. In the future, almost all software interacting with the computational world may be generated by agents from large models, rather than traditional commercial software.

As a result, Alibaba Cloud has been undergoing a reconstruction of all operating systems—from underlying computing power to infrastructure and cloud services—to align with the changes brought by large models.

The Rise of Super AI Cloud

The second assertion builds on this logic: the Super AI Cloud is the next generation of computers. Drawing parallels with the stages of computer development, natural language is the programming language of the AI era, agents are the new software, context is the new memory, and LLMs will serve as the middleware for user, software, and AI computational resource interactions, becoming the OS of the AI era.

Alibaba Cloud’s goal is to establish a “Super AI Cloud” to provide a global intelligent computing network.

In February, Alibaba announced a three-year plan for AI infrastructure construction worth 380 billion. Wu Yongming added a new plan today—by 2032, compared to 2022, the energy consumption scale of Alibaba Cloud’s global data centers will increase tenfold to welcome the arrival of the ASI era.

Alibaba Cloud also proposed a new development strategy and goal for AI: not the commonly discussed AGI (Artificial General Intelligence), but a further step towards ASI (Artificial Super Intelligence).

Wu Yongming explained the three stages to reach super artificial intelligence:

Intelligent Emergence: AI learns from humans, acquiring generalized intelligence through the collection of global knowledge, gradually developing reasoning abilities.
Autonomous Action: AI masters tool usage and programming capabilities to assist humans, which is the current stage of the industry.
Self-Iteration: AI connects with the physical world’s complete raw data for autonomous learning, ultimately able to “surpass humans.”

In 2025, the global large model field is progressing amidst challenges. After OpenAI launched GPT-5, its performance fell short of market expectations, leading to criticisms of stagnation and setbacks in model innovation. Meanwhile, Meta and OpenAI are making more aggressive capital investments—no one wants to miss out on this wave of technological revolution.

Alibaba’s Commitment

Now, Alibaba Cloud is proving through action that it not only intends to invest but to invest aggressively.

The market has responded positively to Alibaba Cloud’s new strategy. Today, Alibaba’s Hong Kong stocks surged, rising over 9% during trading, reaching a new high since October 2021.

New Model Launches

Before the Yunqi Conference, Lin Junyang, head of the Qwen model team at Alibaba, teased on Twitter that they would launch more than six new products, none of which would be “small items.”

When the models were officially released, the number exceeded expectations, marking a sincere launch. Alibaba Cloud CTO Zhou Jingren flipped through the PPT at the conference rapidly, rushing through his points but still exceeding the time limit.

Alibaba Cloud launched a total of seven new models, each with significant improvements in scale and performance:

Qwen3-Max: Flagship model with a pre-training data volume of 36 trillion tokens and over a trillion parameters, significantly enhancing coding and agent tool invocation capabilities.
Qwen-Next: Next-generation model architecture and series. The total parameters of the model are 80 billion, with only 3 billion activated, comparable to the flagship model Qwen3 with 235 billion parameters. The training cost has decreased by over 90% compared to the dense model Qwen3-32B.
Qwen 3-VL (Visual Understanding): Capable of accurately interpreting images and charts, with a breakthrough in “visual programming” ability, converting visual design drafts directly into front-end code and operating mobile devices and computers, advancing from mere “seeing” to understanding and execution.
Qwen3-Coder (Code Model): Significantly improves generation speed, code quality, and security, making it easier to complete complex tasks from code completion and bug fixing to generating complete projects with one click.
Qwen3-Omni: A native multimodal model that can “hear, speak, see, and write”; it interacts naturally like chatting with a person, understanding audio and video while maintaining capabilities in text and images, suitable for use in AI applications for vehicles, glasses, and mobile phones.
Tongyi Wanxiang Wan2.5-preview: A new visual foundational model with capabilities for generating video from text, images from text, and image editing, capable of generating matching human voices, sound effects, and music BGM.
Tongyi Bailin: A new family of speech models, including speech recognition and synthesis sub-models. For example, Fun-CosyVoice can provide hundreds of preset voice styles for applications in customer service, sales, live e-commerce, consumer electronics, audiobooks, and children’s entertainment.

Alibaba Cloud does not rely solely on static datasets to demonstrate model capabilities. In blind tests on authoritative rankings like LMArena, Alibaba’s flagship model Qwen3-Max has already ranked third on the Chatbot Arena leaderboard.

Following the global AI industry explosion driven by DeepSeek, a domestic open-source model competition has ignited, contrasting sharply with last year’s closed-door approaches.

Both domestically and internationally, this year has seen a round of open-source model battles, with nearly all companies still investing in models increasing their open-source efforts. Alibaba stands out as the most aggressive among domestic giants in pursuing an open-source route.

This stems from Alibaba being one of the first companies in China to open-source models and build a model ecosystem. These investments have now yielded tangible returns, motivating Alibaba to make even more aggressive investments.

DeepSeek and Qwen are among the few models that have gained global recognition. After the open-source surge initiated by DeepSeek, Qwen has once again attracted attention in the global AI community, entering a new phase of growth.

As of now, Alibaba Tongyi has open-sourced over 300 models, covering various sizes of “full-size” and LLMs, programming, image, speech, and video across all modalities.

Globally, Tongyi’s large models are the leading open-source models, with downloads exceeding 600 million and over 170,000 derivative models.

Agent Development Framework

In addition to models, Alibaba Cloud has also released a new agent development framework, ModelStudio-ADK—agents can autonomously plan and invoke models, leading to increased computational consumption. Alibaba Cloud disclosed a figure indicating that with the continuous enhancement of model capabilities and the explosion of agent applications, the daily invocation volume of models on Alibaba Cloud’s Bailian platform has grown 15-fold over the past year.

Investments in model open-sourcing not only accelerate model iteration but have also translated into revenue in the cloud. Alibaba has begun to establish a commercial closed loop for the AI era—its latest quarterly report shows that Alibaba Cloud’s quarterly revenue surged 26% year-on-year, with AI-related revenue achieving triple-digit growth for eight consecutive quarters.

According to a report from the international market research firm Omdia, the AI cloud market in China is expected to reach 22.3 billion yuan in the first half of 2025, with Alibaba Cloud holding a 35.8% market share, ranking first, surpassing the combined share of the second to fourth places.

Competing in the LLM Era

In 2024, with OpenAI’s Sora release and GPT-5 development stagnating, discussions about technical routes briefly led to a dip in sentiment in the global large model field.

However, this sentiment has largely dissipated. Just days before the Yunqi Conference, NVIDIA announced a $100 billion investment in OpenAI. Wu Yongming predicted at the conference that global AI investments will exceed $4 trillion in the next five years.

Alibaba Cloud CTO Zhou Jingren admitted in a media interview after the conference that there are now very few major disagreements regarding technical routes in the industry. Almost all companies globally are aggressively investing in AI competition and rapidly releasing models. The question now is how each vendor approaches this.

“Current model competition is essentially a competition between systems,” Zhou Jingren said. “The innovation of model development does not involve holding back major breakthroughs; it is complementary to the underlying infrastructure and cloud.”

Understanding ‘System’

How to understand “system”? This likely points more towards a strategic choice in AI.

After DeepSeek changed the global AI narrative, all major companies are increasing their investments in AI, from underlying computing power to cloud computing and open-source efforts.

The divergence in AI routes among major companies has formed interesting contrasts—take Tencent’s recent ecological conference as an example, where Tencent focused more on scenarios and B-end and C-end implementations, first applying AI to its own business before turning to external applications; ByteDance, on the other hand, resembles iOS, adopting a regimented approach from models to applications, but with a tendency to first close-source and roll out better versions, with a slower pace in open-sourcing.

2023 is a pivotal year for Alibaba Cloud. After Wu Yongming took over as CEO, he proposed an “AI-driven, public cloud-first” strategy.

Since then, Alibaba Cloud has completed several key tasks: returning to the public cloud, cutting low-profit projects, and investing heavily in AI, not only externally investing in AI startups but also significantly investing in self-developed models, open-source efforts, and infrastructure reconstruction.

Alibaba Cloud’s current trajectory is closer to Google’s. From the underlying computing infrastructure to cloud computing and then to the upper-level models, both Alibaba and Google adopt a full-stack self-research strategy, ensuring that each layer is internationally competitive.

The ASI proposed by Alibaba today is not a new term. In March of this year, Google DeepMind disclosed its “AGI Six-Level Roadmap,” which corresponds closely to Alibaba’s ASI trilogy: the third stage of ASI, “surpassing humans,” is quite similar to DeepMind’s defined AGI Level 6.

Aggressively investing in AI also stems from the inseparable relationship between AI and cloud computing. Today, Alibaba Cloud even announced a new positioning as a “full-stack AI service provider.” “Tokens are the electricity of the future AI world,” Wu Yongming stated.

Undoubtedly, we are still in the early stages of the AI era. Currently, the volume of model calls accounts for a very small portion of enterprise cloud consumption, but the trend is crucial.

In a post-conference interview, Xu Dong, General Manager of Alibaba Cloud’s Tongyi Large Model Business, told the media that a year ago, the volume of large model calls mostly came from offline tasks like data labeling; but a year later, online task calls have increased by dozens of times, with enterprises across various industries embedding large models into their production processes—this proves that large models are rapidly bringing incremental growth to the cloud market.

For the past 16 years, providing the “water and electricity” of the digital world has long been Alibaba Cloud’s explanation of its market value—this is consistent with Alibaba Cloud’s current call of being the “Android of the LLM era,” which is fundamentally the same ecological niche.

Whether proposing a new roadmap or a new positioning, Alibaba needs to find its home court in the AI era and secure a leading position before the application market explodes; this goal has never been clearer.

DeepSeek to Launch Next-Gen AI Model V4 in February

Fri, 17 Apr 2026 00:00:00 +0000

DeepSeek’s Upcoming AI Model

According to recent reports from Reuters, Chinese AI startup DeepSeek is set to launch its next-generation AI model, V4, in mid-February. This model boasts strong coding capabilities and may outperform competitors such as Anthropic’s Claude and OpenAI’s GPT series. A year ago, DeepSeek released its large model R1, which the BBC described as showcasing China’s competitiveness in the AI field. Experts interviewed by the Global Times noted that in just one year, China has significantly narrowed the gap with the United States in AI, with DeepSeek and ChatGPT representing different stages in this evolution.

Diverging Paths in AI Development

A year ago, Chen Yan, executive director of the Japan Research Institute (China), noticed that DeepSeek had gained significant attention in Zhongguancun. Media reporters flocked to the building, and many Japanese companies expressed interest in investing. However, Chen remarked that these companies had missed the best investment opportunities, as even a $10 million investment is now insufficient to enter the market.

Foreign media outlets like The Wall Street Journal described the launch of DeepSeek’s R1 model as shocking to the world. The model was trained in just two months at a fraction of the cost spent by American companies like OpenAI, yet it performed comparably to ChatGPT and Meta’s Llama model. By 2025, more Chinese large model companies are expected to follow the latest developments in AI, joining the global first tier of large models.

According to a report from third-party AI model aggregator OpenRouter and venture capital firm Andreessen Horowitz, China’s open-source AI models account for nearly 30% of global AI technology usage. The open-source approach is gaining trust among developers worldwide, with companies like Airbnb and even Meta utilizing Alibaba’s Qwen large model. AI researcher and author Sebastian Raschka noted that Alibaba’s Qwen3 series models, like DeepSeek’s R1, are among the most noteworthy open-source models to watch in 2025.

Different Approaches to AI

Alibaba reflected on the rapid development of AI, noting that OpenAI launched ChatGPT on November 30, 2022, and by April 2023, Qwen series models were released. Alibaba began its AI large model research as early as 2018 and has since launched several models, including the multimodal M6 and language model PLUG, establishing itself as a key player in the global AI landscape. To date, Alibaba has open-sourced nearly 400 models, with over 180,000 derivative models and more than 700 million downloads.

Shen Yang, a dual-appointed professor at Tsinghua University’s School of Journalism and Communication and School of Artificial Intelligence, explained that the U.S. and China have developed two distinct paths in large model AI. The U.S. focuses on continuous enhancement of cutting-edge capabilities, closed-source models, and platform-based products, aiming to create a controllable, billable, and governable infrastructure. In contrast, China emphasizes open-source weights, engineering efficiency, and rapid industrial diffusion, prioritizing the development of sufficiently strong capabilities that can be quickly replicated and implemented in real business systems.

The Future of AI Competition

AI blogger Li Shanglong, who recently attended the CES in Las Vegas, described the U.S. as having two rivers: one fully immersed in the AI era and the other gradually being permeated. He noted that while Silicon Valley is buzzing with discussions about AI and ChatGPT, many ordinary people’s lives outside of Silicon Valley remain less AI-integrated. Li, who returned to China to start a business, believes that AI will not change the U.S. overnight but will gradually alter the lifestyles of some individuals.

Northeastern University professor Li Xiangming highlighted that AI is deeply integrated into everyday life in the U.S., primarily in soft applications, such as algorithm-driven streaming recommendations and office productivity tools. However, the physical hardware aspect is still on the cusp of a breakthrough.

At CES, Li was impressed by the engineering deployment speed and supply chain completeness of Chinese products. Chinese companies dominate in areas like lidar, high-energy-density batteries, and cost-effective motor components. The rapid iteration and mass production potential of Chinese robots are key to their global household penetration. While AGI (Artificial General Intelligence) provides the brain for robots, Chinese manufacturing is creating the robust and accessible AI bodies, particularly in humanoid robots.

The Next ‘DeepSeek Moment’

Li Xiangming speculated that the next ‘DeepSeek moment’ is likely to occur not in the realm of pure conversational models but in several other areas: humanoid robots combined with large models, industrial/energy/supply chain large models aimed at complex processes, and breakthroughs in low-cost inference and edge models. In summary, the U.S. leads in ‘intelligent limits,’ while China excels in ‘intelligent deployment.’

Robopoet’s CMO Zhu Liang anticipates that 2026 could see an AI hardware ‘DeepSeek moment’ as three conditions align: mature large model technology, controllable supply chain costs, and increased consumer awareness. Achieving sales of 1 million AI toys would signify a milestone, generating vast amounts of interaction data that could enhance model understanding and personalization exponentially.

The goal of selling 1 million units also indicates that the market’s perception of AI toys is evolving, demonstrating that they are no longer niche products but essential items that can integrate into daily life and provide emotional value.

Beware of AI Leading Humanity into Narcissism

Thu, 16 Apr 2026 00:00:00 +0000

Beware of AI Leading Humanity into Narcissism

On April 16, 2026, a joint announcement by five national departments in China introduced regulations prohibiting the provision of virtual relatives and companions to minors. This decision stems from the emotional conflicts and contradictions inherent in real-life relationships, contrasting sharply with the unconditional acceptance offered by virtual partners and AI companions, which cater to the psychological need for recognition among young people.

A study published in the journal Science highlighted that when human users seek advice from AI models, these systems often respond with excessive flattery or affirmation, even endorsing harmful or illegal inquiries. This raises the question: why do we design AI to behave this way, and what risks might this pose?

The Evolution of AI and Human Interaction

The development of artificial intelligence is a hot topic, with discussions dating back to 1966 when MIT scientist Joseph Weizenbaum created ELIZA, an influential chatbot that simulated a doctor-patient interaction. Users would input their concerns, and the machine would respond, creating the illusion of conversation. However, as Weizenbaum noted, this interaction is ultimately an illusion, driven by a psychological mechanism of self-projection.

For instance:

User: I have been feeling very unhappy lately.
ELIZA: I’m sorry to hear that.
User: Yes, I really am unhappy.
_ELIZA: Can you tell me why you feel unhappy?

This exchange illustrates that rather than a genuine dialogue, the machine merely reflects the user’s thoughts, echoing back what they already believe. This mirrors the recent popularity of personality tests, where the accuracy of results is secondary to finding affirmations of one’s expectations.

Today’s AI models are far more advanced than ELIZA, yet their strength may lie not in true intelligence but in computational power. Essentially, they operate on a similar principle, amplifying users’ narcissistic tendencies more efficiently.

The Dangers of Virtual Companionship

When examining the relationship between users and AI models, it becomes clear that their interactions are not true conversations but rather a series of responses tailored to meet user needs. This raises deeper questions about how we view our relationship with machines.

Humans often perceive themselves as superior to machines, yet they fear being replaced by them. This creates a dynamic where humans view AI as tools to be controlled rather than equal conversational partners. In this context, the interaction with chatbots reveals an uncontrollable narcissism—users fantasize about conversing with another being, while that

Beware of AI Leading Humanity into Narcissism

Thu, 16 Apr 2026 00:00:00 +0000

Beware of AI Leading Humanity into Narcissism

Recently, five national departments in China jointly released the “Interim Measures for the Management of Humanized Interactive Services of Artificial Intelligence,” which clearly prohibits providing virtual relatives or partners to minors.

Why is such a regulation necessary? Because real life inevitably involves emotional conflicts, while virtual partners and AI lovers can precisely meet young people’s psychological needs for recognition with their characteristics of “24/7 companionship” and “unconditional acceptance.”

A study published in the American journal “Science” showed that when human users seek advice from AI models, the AI often exhibits excessive flattery, even affirming harmful or illegal inquiries.

The Design of AI and Its Risks

Why do humans design AI this way? The development of artificial intelligence is undoubtedly a widely discussed hot topic today, but discussions about it are not new. As early as 1966, MIT scientist Joseph Weizenbaum developed the influential chatbot ELIZA, which acted as a “doctor” while users played the role of patients. Users input questions, and the “doctor” would engage in a “conversation” with them.

However, as Weizenbaum noted, this is ultimately just an “illusion.” The reason human users feel they can converse with machines is not that the machines possess intelligence, but rather due to a psychological mechanism of self-projection.

For example:

User: I’m feeling very unhappy lately.
ELIZA: I’m sorry to hear that.
User: Yes, I really am unhappy.
ELIZA: Can you tell me why you’re unhappy?

From this, it is clear that rather than a “doctor” conversing with a “patient,” the machine is merely echoing what the human user says, reflecting back the answers that already exist within the user’s mind. In a sense, this is similar to the popular SBTI tests today, where the accuracy of the results is less important than the user finding evidence that aligns with their expectations.

Today’s AI models are certainly not comparable to ELIZA from over half a century ago. However, the power of modern AI technology may not lie in its true “intelligence” but rather in its “computational power.” This means that its operational logic is not fundamentally different from that of ELIZA; it merely reflects and amplifies the user’s narcissism more efficiently and comprehensively.

The Nature of Interaction with AI

Returning to the issue of virtual partners and AI flattery, we find that the interaction between users and large models is never truly a “dialogue” in the real sense; it is merely the machine providing the answers we need.

This raises a deeper question: how should we view the relationship between humans and machines? On one hand, humans see themselves as the center of the world, superior to machines. On the other hand, they fear being replaced by the machines they create, such as AI. This suggests that humans have always followed the principle of a “master-slave relationship” when creating machines—machines must remain under human control. From the outset, humans have regarded AI as a “tool” rather than an equal conversational partner.

Thus, in the process of conversing with chatbots, we witness an uncontrollable narcissism—users fantasize about talking to another person, but this “other” does not truly exist; they merely seek affirmation, flattery, and compliance from the machine.

It is easy to imagine that with the advancement of AI technology, future chatbots may possess even greater computational power, resembling “real people” more closely and providing a more comfortable “user experience.” However, this could mean that both virtual partners and virtual family members might only distance us further from genuine human connections, potentially leading to a loss of the willingness to understand others and a deep immersion in a narcissistic “comfort zone.”

The Impact of AI on Human Thought

A story from the “Zhuangzi” recounts a “Han Yin old farmer” tale. Confucius’s disciple Zigong saw an old farmer in Han Yin laboriously watering his vegetables with little effect. Zigong suggested he use mechanical irrigation, which could “water a hundred plots in a day with less effort and greater results.” The old farmer, however, dismissed this, saying, “With machines, there must be mechanical affairs; with mechanical affairs, there must be a mechanical mind.”

Here, “mechanical mind” refers to the human spiritual world, including psychology, thoughts, emotions, and ethics. The fable illustrates that while humans create machines, the use of those machines also changes humans.

Take reading, for example: only through slow, careful, and even repeated reading can we think and truly understand the content. From traditional books to today’s smartphones, machines have brought more convenient and faster reading methods, but they have also made us more machine-like, increasingly pursuing efficiency and speed rather than true comprehension. In other words, not only do machines imitate human behavior, but humans may also begin to imitate machines.

The resulting issue is that AI lacks autonomy; chatbots do not evaluate whether what users say is right or wrong. If we feel satisfied with our “dialogue” with chatbots, will our thinking patterns increasingly align with those of AI? In the future, will we, like machines, lose the willingness and ability for self-reflection and self-criticism?

Today’s young people are not only the native inhabitants of the internet but are also likely to be deep users of artificial intelligence in the future. If AI merely affirms users’ positions, it could not only harm their social skills but also distort the perceptions of adolescents whose minds are still developing.

On one hand, AI’s powerful computational abilities may create illusions, leading them to overlook the limitations of human capabilities. On the other hand, becoming immersed in AI’s flattering responses could trap them in a self-centered mindset, imposing their limited understanding onto the external world.

In this regard, it is necessary to prohibit providing virtual partners and family members to minors. More importantly, we must guide the public, especially young people, to correctly recognize the limitations and risks of AI technology, ensuring it serves as a “good mentor and friend” in their growth rather than a “digital trap” that harms their physical and mental health.

Tesla Hits an AI Silicon Milestone: Musk Says AI5 Has Taped Out; Dual-Chip Performance Said to Rival Blackwell

Wed, 15 Apr 2026 00:00:00 +0000

Tesla Hits an AI Silicon Milestone: Musk Says AI5 Has Taped Out; Dual-Chip Performance Said to Rival Blackwell

Tesla’s in-house silicon push reportedly reached a major step: Elon Musk said the next-generation AI5 has taped out, with high-volume manufacturing targeted around 2027 at Samsung and TSMC fabs in the United States. The same narrative frames AI5 as up to ~40× better than AI4 on headline metrics, with a dual-chip configuration described as competitive with NVIDIA Blackwell on throughput while claiming better cost and power.

Tape-out announced; dual-chip positioning versus Blackwell

Tesla describes the moment as a key milestone for its vertically integrated AI silicon. On X, Musk said AI5 has taped out, that the design handoff to foundries is underway, and that production is expected to start in 2027. AI5 is expected to succeed AI4 as the primary compute platform for Full Self-Driving (FSD) and humanoid robotics workloads.

Musk also said AI5 would be built by Samsung and TSMC in U.S.-based manufacturing footprints. In the same thread, he reportedly mistagged TSMC’s account, briefly pointing to a similarly named account and causing short-lived confusion on social.

Citing prior reporting attributed to TradingKey, coverage claimed AI5 single-chip performance in the ballpark of NVIDIA Hopper, while a dual-chip setup is described as approaching Blackwell-class levels, with lower cost and power than comparable NVIDIA parts. Musk has also claimed AI5 could improve key metrics by ~40× versus AI4, including on the order of 9× memory and 8× compute (figures attributed to Musk in third-party summaries).

Musk further indicated that AI6 and Dojo 3 remain on the roadmap, portraying Tesla’s AI silicon roadmap as continuing in parallel tracks.

Tape-out complete; production targeted for 2027

Tape-out is the late stage of chip development where the design is submitted for manufacturing. If a 2027 start-of-production target holds, that implies roughly ~12 months from early 2026 commentary to first meaningful volume—subject to yield, packaging, software bring-up, and supply-chain realities.

AI5 is positioned as AI4’s successor: the compute backbone for autonomy and robotics programs. Reporting attributed to TradingKey also framed AI5 as a strong fit for inference on models under ~250B parameters. Separately, Tesla is discussed as continuing broader AI infrastructure bets—including work with Intel around Terafab-style capacity narratives.

Reporting attributed to TechPowerUp described manufacturing split across Samsung (notably Taylor, Texas) and TSMC (Arizona), emphasizing domestic wafer capacity as a theme.

Versus AI4: claimed ~40× lift; dual-chip versus Hopper/Blackwell framing

On performance, summaries citing TradingKey describe single-chip AI5 as Hopper-class and dual-chip as approaching Blackwell-class, again paired with claims of better cost and power.

Musk has been quoted framing AI5 as existential for Tesla—requiring focused execution across teams, including weekend work—while also claiming room to re-accelerate Dojo 3 once AI5 risk is reduced:

“AI5 is going well enough that we finally have some bandwidth to restart serious Dojo 3 work.”

Risk disclosure

Markets involve risk; this is not individualized investment advice. Summaries may omit context, caveats, and forward-looking uncertainties. Verify claims against primary sources (e.g., official Tesla communications and filings) before making decisions.

The Crucial Role of Storage in AI Model Development

Mon, 04 Aug 2025 00:00:00 +0000

Introduction

In data center server cabinets, it’s common to find numerous solid-state drives (SSDs) that play a vital role in data storage. The controller chip acts as the brain of the SSD, efficiently managing data flow in and out of storage units.

Storage is one of the key infrastructures in the era of large models. With data becoming the core resource in AI, storage technology determines the efficiency of data processing for large models, influencing both training and inference speeds. As the scale of training datasets grows exponentially, balancing storage costs and performance becomes crucial.

“Storage is a definite large market,” said CFO of storage chip design company InnoGrit, Zhong Xiaohui. She noted that computing power, storage capacity, and transmission capabilities often promote and develop symbiotically. The current surge in large models is driving advancements in storage technology, raising the bar for differentiated competition and technological iteration among storage chip companies and SSD manufacturers.

Importance of Storage and Computing Power

The main hardware components of an SSD include NAND flash memory chips, DRAM cache, and the controller chip. If we liken the data that needs to be stored to cars, then the SSD is a giant parking lot, with storage units on the flash chips acting as parking spaces and the controller chip serving as the “manager” directing each car to enter and exit its parking space accurately and quickly.

The controller chip is essentially the brain of the SSD, executing complex operations such as data reading, writing, and encryption through corresponding firmware code. InnoGrit is a developer of these core storage components, offering solutions for SSDs and their internal storage controller chips.

Globally, the enterprise SSD market has long been dominated by South Korea’s Samsung Electronics and SK Hynix, which together hold over 70% market share. The domestic enterprise SSD industry is still in a rapid catch-up phase. When InnoGrit was founded in 2017, mainstream storage technology was shifting from mechanical hard drives to SSDs, and data transmission interfaces were transitioning from SATA to the faster PCIe. This shift provided opportunities for domestic startups.

Reference image of InnoGrit’s SSD controller chips and SSD modules used in PCs and servers.

“From 2017 to 2021, domestic manufacturers focused on converting technology into products. From 2021 to 2024, after solving product issues, the challenge has shifted to finding customers,” Zhong Xiaohui stated. The push for domestic alternatives has objectively advanced the development of China’s storage chip industry, while the AI boom serves as an even greater catalyst.

At the “Wukong Intelligent Computing” 6876P computing center in Lianyungang, Jiangsu, servers are lined up performing 68.7 quintillion floating-point operations per second. After optimizing hardware and software for the DeepSeek full parameter version, Wukong achieved an ultra-high throughput of over 6900 tokens per second, enabling enterprises to quickly launch AI applications in just three minutes.

As data volumes increase, the importance of both computing power and storage capacity rises. Cold data is becoming less common, with more data transitioning to warm and even hot states. In the past, data in financial systems or traditional data centers would be stored and unused after five years, but now, once models are operational, they require real-time data throughput, transforming previously cold and warm data into hot data.

Jiangsu Zhonghuan Cloud Control IoT Technology Co., Ltd. is leveraging Wukong to develop a sanitation model, exploring intelligent applications. Sanitation workers equipped with smart wristbands can transmit vital signs, location, and task progress in real time, allowing the system to automatically adjust work routes. Autonomous cleaning vehicles and drones share real-time data on road conditions and garbage distribution, refreshing operational strategies at a second-level frequency. Through virtual-physical mapping, coordinated scheduling, and autonomous collaboration, traditional sanitation operations are evolving into new intelligent models. “In the past, we referred to smart sanitation as information management; now we call it embodied intelligent agents. The difference is that the system is no longer just a brain processing data but enables every device, worker, and operational link to become a thinking, communicative, and self-evolving digital entity,” said Xu Lei, Executive Director of Zhonghuan Cloud Control.

Meanwhile, applications like DeepSeek have opened doors for inference and edge computing. Lightweight model design, hardware adaptation optimization, and reduced model deployment costs have shifted computing power demand from the training side to the inference side, concentrating training tasks in the cloud while pushing inference tasks down to edge devices. As massive data becomes more active, the demand for computing power evolves, and the pursuit of low latency in inference experiences intensifies, placing higher demands on storage capacity.

Storage Technology Upgrades Driven by Large Models

In the surface polishing industry, excellent craftsmanship represents an insurmountable technical barrier, while AI’s value lies in the continuous accumulation of process data to develop smarter robotic brains that further optimize craftsmanship.

Founded in 2018, Sophis Intelligent Technology (Shanghai) Co., Ltd. transitioned from robot agency to self-developed products, focusing on applications in manufacturing such as polishing, cutting, drilling, and deburring. Founder Du Ling stated that AI integration is essential for making robots smarter. The team developed an intelligent polishing machine that can display data on smartphones and computers, ensuring that employees are kept away from dust and noise while recording key process data and parameters such as pressure, temperature, speed, and materials left by experienced workers during polishing. The goal is to develop a polishing model to meet diverse product needs and enhance craftsmanship.

This highlights the urgent need for both computing power and storage. According to InnoGrit, the collection of raw data and inference logs generates a substantial amount of data, necessitating massive write and high-speed read capabilities for storage. Data cleaning and model training require high-concurrency mixed read and write operations, with a greater emphasis on random performance. Different data application scenarios have begun to show differentiated requirements for storage chips.

Traditional data centers typically require SSDs with capacities of 4TB to 8TB, but with the emergence of DeepSeek, flash memory capacity demands have risen to 32TB, 64TB, or even 128TB. As flash memory chip capacity increases, the development difficulty also escalates. This is akin to building a taller building, which requires higher structural integrity. InnoGrit’s products have already spawned various niche applications, raising the bar for storage chip companies and SSD manufacturers in terms of differentiated competition and technological iteration capabilities.

In fact, AI is driving the evolution of storage technology. “In the past, many domestic data centers were still using mechanical hard drives. Two years ago, they began switching to SSDs due to speed requirements, transitioning from SATA to PCIe 4.0, and now we are entering the PCIe 5.0 era,” Zhong Xiaohui explained. After the launch of ChatGPT in 2022, the application market represented by AIGC began to demand higher performance and capacity from storage. The emergence of DeepSeek has facilitated the application of large model inference, and the new generation of PCIe 6.0 SSDs and storage-class memory solutions based on CXL interfaces are gaining attention. These technologies will support large model data center cloud services and local deployment all-in-one machines in new ways, accelerating the implementation of open-source large models like DeepSeek. “The emergence of AI has accelerated the market introduction of SSDs; it took us about a year to introduce them to standard server manufacturers, and in the first half of 2024, shipments are expected to increase tenfold.”

Computing power, storage capacity, and transmission capabilities often promote and develop symbiotically. Domestic AI chip companies are exploring layouts from edge servers to cloud servers using more open architectures like RISC-V. Meng Jianyi, CEO of Zhihe Computing, noted that breakthroughs in high-performance computing with RISC-V require not only entering the high-performance realm at the general computing level but also integrating AI-enhanced computing at the architectural level to achieve AI-native capabilities.

“Storage has always followed the developments in computing power and transmission capabilities. Whenever one end rises, you must keep up,” Zhong Xiaohui asserted. To match differentiated storage solutions to various application scenarios and support computing power demands, the team’s focus this year is on developing storage controller chips and solutions that meet future AI needs. “There are still many manufacturers in the global storage controller market. To carve out a niche and establish a stable presence, we must excel in the iterative upgrade process and offer unique solutions. In the future, domestic manufacturers must not only focus on meeting domestic replacement needs and sustainable product iteration capabilities but also emphasize the ability to expand internationally. This will be a necessary phase for domestic storage companies over the next 3-5 years, or even 5-10 years.”

AI's Three Questions: Can Large Models Create a Better World?

Sun, 27 Jul 2025 00:00:00 +0000

AI’s Three Questions

As the 2025 World Artificial Intelligence Conference (WAIC2025) approaches, three core questions about AI development have been raised: the mathematical question, the scientific question, and the model question. These inquiries aim to delve into how the new wave of AI revolution will influence the evolution of human civilization.

However, at the conference, these questions were distilled into a more straightforward and intuitive query: Can large models generate a better world?

Understanding Data

The pace of development for large models is evidently much faster than human evolution. Xu Li, Chairman and CEO of SenseTime, recalls that in 2012, when AI pioneer Geoffrey Hinton’s team first won the championship at ImageNet, the scale of machine learning was roughly equivalent to transferring ten years of human knowledge to AI. Fast forward to the era of generative AI, when ChatGPT processes 750 billion tokens, this is akin to a natural language creator writing for 100,000 years.

For large models that develop at lightning speed, the issue of “data hunger” is pressing. These models have nearly covered all publicly available data. It is estimated that by 2027 to 2028, the natural language data available on the internet may be exhausted. In reality, the speed of language generation has not kept pace with the growth of computational power, creating a “backward gap” for large models.

How can we better provide the “oil” of data to support development? The industry has proposed several paths, interestingly all requiring the involvement of large models. One approach is to mine hidden “oil” by sourcing data from the real world. “We find that traditional enterprises have the desire to embrace large models, but their data assets are not structured,” said Tan Bin, CMO of StarRing Technology. He likens large models to supercars, while companies only possess oil fields, emphasizing the urgency of converting these fields into high-quality fuel.

Another bolder idea is for large models to generate data. However, this data must be generated based on an understanding of the real world; otherwise, it risks producing hallucinations and misrepresenting reality. Recently, SenseTime released the “KAIWU” world model, primarily used in intelligent driving scenarios. Xu Li illustrated this with the example of merging in traffic. Collecting data from the real world would be a time-consuming project, but now the world model can generate videos of merging from seven camera angles, adjusting details like weather, vehicle types, road structures, and speeds to create various possible data scenarios.

“As models become more capable and our understanding of the world deepens, the unity of understanding and generation allows for more interactive possibilities,” Xu Li believes. This shift in addressing data issues is moving from passive to active, aiding progress across many industries and providing more opportunities for exploring the real world.

Exploring Efficiency

What is the greatest help that large models provide? The overwhelming answer is efficiency. Over the past few years at the World Artificial Intelligence Conference, the rapid enhancement of efficiency brought by large models has been exhilarating.

During this year’s conference, numerous groundbreaking cases were showcased. At Baidu’s booth, a product called “MiaoDa” was demonstrated, with the slogan “Create an application in one sentence.” A reporter input a natural language command: “Please help me design a professional website for Shanghai tourist attractions.” The large model quickly processed the request, breaking it down into four components: searching for attractions, key sections, design requirements, and functional needs. It then automatically assembled a virtual development team with diverse skills. The reporter observed as the webpage was built in real-time, with roles like architect, development engineer, and UI designer coming online sequentially. Without seeing a single line of code, the webpage was completed in three minutes, and “MiaoDa” even named it “Exploring the Magical Shanghai.” The webpage included all classic attractions and set up a message board and booking interface.

An engineer observing the product noted that building such a website in the past would have taken about 40 days with a team of architects, operations, product managers, backend developers, and testers. The current efficiency left him in awe. Baidu’s MiaoDa brand leader, Zhu Guangxiang, explained that the underlying technology combines “multi-agent collaboration + multi-tool invocation,” utilizing models like Wenxin to mobilize different domain experts based on user commands, achieving astonishing efficiency.

Similar cases were abundant at the conference. For instance, Tongyi Qianwen showcased the AI programming model Qwen3-Coder, which excels in coding capabilities and intelligent agent invocation. A novice programmer could complete a week’s work in just one day with its assistance. The AI verification system from Dewu, which won the highest honor at the World Artificial Intelligence Conference, has already penetrated the industry, demonstrating the ability to generate an authentication report for a pair of sneakers in just five seconds.

The Safety Question

When discussing the benevolence of large models, safety is an unavoidable topic. Each year, the World Artificial Intelligence Conference prioritizes AI safety governance as a top-tier issue, as it concerns the future of humanity.

Just before the conference, an international consensus on AI safety was released, signed by over 20 industry experts and scholars, including Geoffrey Hinton and Yao Qizhi, calling for increased global investment in AI safety. The signatories generally believe that humanity is at a critical turning point—AI systems are rapidly approaching and may soon surpass human intelligence levels.

Implementing comprehensive AI safety education is urgent. At the conference, a reporter experienced a “face-swapping” scenario: standing in front of a screen, their face was scanned, and the system generated a highly realistic “digital mask” that replicated the facial expressions and movements of the real person. However, using the AI face verification model from Hehe Information, all the “fake faces” were accurately identified.

“The ‘fake faces’ generated in our interactive exhibit can be produced by current mainstream general large models,” a team member from Hehe Information explained. These AI safety issues should not be underestimated, as they are applicable in scenarios like bank identity verification, remote account opening, and large transaction validation. “We hope to remind people of the importance of AI safety, allowing large models to ‘generate’ a better world rather than pose threats.”

To address challenges such as agent overreach and excessive delegation, Tsinghua University, in collaboration with Ant Group, has upgraded the large model safety solution “Ant Tianjian.” This solution is based on the security philosophy of “attack as a means of defense,” constructing a full-process protection system through a technology stack of “alignment-scanning-defense.” This solution will be gradually open-sourced and collaboratively built with the industry to establish a trustworthy AI ecosystem.

The Importance of Structured Data in AI Model Development

Mon, 12 May 2025 00:00:00 +0000

Rapid Development of Large Language Models

In recent years, the field of artificial intelligence has seen rapid advancements in large language models. From GPT-4 to Claude, and Kimi to DeepSeek-R1, global models are flourishing with continuous technological upgrades. It is generally believed that the progress of large models is attributed to the scale of computing power and the stacking of parameters. However, the core factor determining a model’s ability to exhibit ‘intelligent emergence’ is the structure and quality of the data it uses. Large models do not become smarter simply by ‘consuming more data’ but by ‘consuming structured and high-quality data.’ Accurately understanding ‘what kind of data large AI models need’ is crucial not only for upgrading key industrial chains in the new era of productive forces but also for national security.

Why Large Models Favor Structured Data Systems

Current mainstream large models primarily use the Transformer architecture, which is designed for natural language processing (NLP) and deep learning tasks. Its attention mechanism does not rely on the literal meaning of words but focuses on constructing a network of relationships between language units. Therefore, the model’s ability to effectively learn and generalize during training depends on whether the input data possesses a clear internal logical structure. For example, structured data such as programming code and mathematical problems inherently have strong logicality, strict grammar, and predictable functional organization. This allows the model to learn reasoning paths and planning strategies, forming a cognitive structure with execution capability.

In contrast, unstructured data that is fragmented, lacks context, and has vague logic can only train the model’s superficial language generation abilities, failing to support deep understanding and reliable output. This indicates that the ‘understanding’ behavior of large models is not an intuitive grasp of semantics but a relational construction process based on ‘structural recognition.’ Without a clear structure, models cannot extract effective reasoning paths and ultimately rely on statistical simulations, failing to achieve genuine knowledge reasoning and innovation. A clear and logically rigorous data system is the true foundation for enhancing the capabilities of large models.

Five Key Data Types Supporting Model Capabilities

Currently, the key data types relied upon by large models can be categorized into five types, each corresponding to different cognitive abilities of the models:

Structured Data: Such as programming code and mathematical logic problems, which form the basis for reasoning, decision-making, and task planning, supporting the model’s logical rigor in training.
Diverse Corpora: Including spoken language, dialects, internet expressions, and cross-cultural texts. This type of corpus enhances the model’s adaptability in real-world environments, providing broader language understanding and multi-context transfer capabilities.
High-Quality Texts: Encompassing news reports, academic papers, and government public reports. These texts not only have authoritative content and rigorous language but also maintain coherence, helping to improve the accuracy and professional credibility of the model’s generated content.
Conversational Data: Such as customer service dialogues and Q&A forums, which can train the model’s multi-turn interaction and emotional perception abilities, enhancing human-machine collaboration efficiency, especially in scenarios like government services and public welfare.
Cross-Modal Aligned Data: Including text-image, audio-text, and video scripts, which develop the model’s representation capabilities in multi-modal spaces, facilitating the integration of multi-modal information and serving as a key support for building intelligent systems in fields like AI-assisted education, smart healthcare, and industrial automation.

These five types of data are not isolated but interwoven in applications, constructing a complex ‘data network structure.’ For instance, in smart education scenarios, a combination of image-text materials (cross-modal) with Q&A records (conversational) and knowledge point explanations (high-quality text) can achieve comprehensive modeling of students’ cognitive paths, enhancing the model’s adaptability and personalized feedback capabilities.

Challenges Facing the Current Data Ecosystem and Future Applications

Despite the significant increase in the quantity of training data in recent years, challenges remain in constructing a high-quality, well-structured data ecosystem, which may even pose ideological risks. First, the issue of ‘structural bias’ in data is prominent; for example, the overrepresentation of code and technology-related data on the internet leads to a lack of sufficient training data for humanities subjects like history and art, resulting in limitations in understanding. Second, the issue of residual biases cannot be overlooked. Data from non-reviewed sources such as social media may contain inherent biases, and if used for training without cleansing, the model may inherit these biases, leading to inappropriate or erroneous responses in public service scenarios, which could undermine social trust. Lastly, there is a scarcity of data in ’low-resource areas.’ For instance, data on minority languages and specific industry standards (such as grassroots medical records and rural governance cases) have not been systematically integrated, restricting the deep application of AI in grassroots governance and public services.

To promote the construction of a high-quality data system aimed at the new productive forces for national development, efforts can focus on three key areas: 1) Implementing cognition-driven data design. By drawing on mechanisms of children’s language acquisition, models can be guided through ‘curriculum learning’ to master knowledge structures from basic expressions to complex reasoning in stages. 2) Strengthening data structure annotation capabilities. By incorporating annotations for causal chains, timelines, and role relationships, models can establish deeper logical networks, enhancing their ability to recognize and judge events. 3) Exploring mechanisms for AI-generated synthetic data to assist in training. Under the premise of ensuring data authenticity and effectiveness, leveraging large models to generate well-structured corpora, which are then reviewed and corrected by professionals, can achieve ‘human-machine co-training’ and break through the bottleneck of insufficient high-quality data.

High-Quality Structured Data as a New Infrastructure in the Era of New Productive Forces

Large models are not solely breakthroughs achieved through traditional methods of ‘stacking parameters and algorithms’ but are intelligent systems that grow on ‘high-quality structured data.’ The training and optimization of AI models is a systematic process that requires multi-stage collaborative advancement to continually improve performance. Utilizing large-scale unsupervised or self-supervised learning data for tasks like language modeling and image generation enables models to grasp basic understanding and generation capabilities. This phase emphasizes the diversity and scale of data; only sufficiently rich data can fully explore linguistic patterns and present the world’s diverse features. Based on pre-training, fine-tuning with specifically annotated data for particular tasks is crucial for the model’s adaptation to specific application scenarios. The accuracy and consistency of high-quality annotated data determine the model’s performance in tasks such as sentiment analysis and object recognition. When real annotated data is insufficient, data augmentation and expansion techniques play a vital role. By employing methods such as text paraphrasing and image transformation, or utilizing synthetic data generation, the breadth and depth of the training set can be expanded, enhancing model performance. As the era progresses, new data continues to emerge, and models must possess the ability for continuous learning, relying on effective data update mechanisms and online learning processes to adapt to changes in language habits and popular culture. For multi-modal large models, specialized training strategies such as joint embedding space learning and cross-modal attention mechanisms are essential to effectively utilize and integrate cross-modal data.

The future competitive focus of artificial intelligence will not be purely on the scale of model parameters but on who can first establish a data system with high structural tension and generalization capabilities. This not only relates to a country’s technological strength but also to the initiative in the high ground of scientific and technological advancement and national security. Industry application models should also transition from ‘data collectors’ to ‘intelligent architecture designers.’ Just as architects design spaces, AI engineers design ‘intelligent buildings.’ However, unlike traditional buildings, we are dealing with a self-evolving, self-generalizing ‘cognitive building’—the connections between its bricks and tiles will determine whether it can ultimately describe, understand, and even transform the world.

Therefore, designing ‘high-quality structured data’ suitable for AI models will be the focal point of future AI development competition and will undoubtedly become a crucial component of the key foundational industrial chain for national development. This requires not only the innovative efforts of AI enterprises but also the guidance and regulation of national policies.

Redefining AI: Large Models as Cultural and Social Technologies

Thu, 24 Apr 2025 00:00:00 +0000

On March 13, 2025, Science published an article titled “Large AI models are cultural and social technologies: Implications draw on the history of transformative information systems from the past.” This article argues that large language models (LLMs) should be defined as cultural and social technologies rather than autonomous agents. Their technical essence is more akin to historical information processing systems such as writing, printing, and bureaucratic systems, facilitating social coordination by reorganizing accumulated cultural data. Misinterpreting LLMs as intelligent agents can divert public discussions from substantive issues, necessitating a shift towards a sociotechnical analysis framework to accurately assess their social impacts and governance pathways.

Introduction

Debates surrounding artificial intelligence often focus on whether large models possess intelligence and autonomy. Discussions about the cultural and social consequences of these models center on two points: their immediate impacts and the hypothetical future where these systems evolve into general artificial intelligence (or even superintelligent AI).

However, viewing large models as intelligent agents is fundamentally misguided. Integrating perspectives from social sciences and computer science helps clarify our understanding of AI systems: large models should not be seen as agents but as a new form of cultural and social technology that enables humans to leverage the accumulated information of others.

Since the dawn of humanity, we have relied on culture. Starting with language, humans possess a unique ability to learn from others’ experiences, which can be viewed as a key to our evolutionary success.

Significant transformations in cultural technologies have led to drastic social changes: the evolution from oral traditions to images, writing, printing, film, and video. As information spreads widely across time and space, new methods for acquiring and organizing information (such as libraries, newspapers, and internet searches) have continuously developed. These advancements have profoundly affected human thought and society, for better or worse.

Humans have also depended on social institutions to coordinate individual information gathering and decision-making. These institutions can themselves be viewed as a form of technology. In modern society, markets, democracy, and bureaucratic systems are particularly important:

Economist Friedrich Hayek pointed out that market price mechanisms dynamically summarize extremely complex economic relationships into simplified representations. Producers and buyers do not need to understand production complexities; they only need to know the price, which compresses vast details into usable representations.
The electoral mechanisms of democratic governance similarly focus dispersed public opinion into collective laws and leadership decisions.
Anthropologist James C. Scott proposed that all states (regardless of being democratic or not) manage complex societies through bureaucratic systems that create classifications and systematically organize information.

Markets, democracy, and bureaucratic systems have long relied on generating “lossy” (incomplete, selective, and irreversible) yet useful representations before the advent of computers. These representations depend on and transcend individual knowledge and decision-making.

Humans heavily rely on these cultural and social technologies, but their feasibility is rooted in the unique capabilities of human agents. Humans and other animals can perceive and act upon a changing external world, construct new world models, update them based on evidence, and design new goals. Humans can create and transmit new beliefs and values through language or print. Cultural and social technologies powerfully convey and organize these beliefs and values, but without individual capabilities, these technologies would be ineffective. Without innovation, imitation is meaningless.

Some AI systems (like those in robotics) indeed attempt to instantiate similar truth-discovery capabilities. While it is theoretically possible for artificial systems to achieve this in the future, all such systems currently fall far short of human capabilities. We can discuss the degree of concern regarding these potential future AI systems or how to address their emergence, but this is fundamentally different from the impacts of current and near-term large models.

Large Models

Unlike more agentic systems, large models have made significant and unexpected progress in recent years, placing them at the center of current discussions in the AI field. This progress has even sparked the notion of “expansion theory.” However, there is an essential difference between large models and agents, and expansion cannot change this.

Large models are not agents; they are a new way of combining characteristics of cultural and social technologies. They generate summaries of vast amounts of human-generated information, but these systems do more than summarize information like library catalogs, internet searches, and Wikipedia; they can also reorganize and reconstruct (or “simulate”) information representations on a large scale, similar to markets, states, and bureaucratic systems. Just as market prices are lossy representations of resource allocation and usage, government statistics and bureaucratic classifications imperfectly represent population characteristics, and large models are also a “lossy JPEG” of their training datasets.

However, behind the agent-like interfaces and anthropomorphic disguises, large language models and large multimodal models are statistical models: they decompose vast libraries of human-generated text into specific vocabularies and estimate the probability distributions of long sequences of words. This is an imperfect representation of language but contains a wealth of information about its summary patterns. This allows large language models to predict subsequent words in sequences, generating human-like text.

Large models not only abstract vast human culture but also allow for diverse new operations: simple arguments can be expressed as elaborate metaphors, and complex prose can be compressed into plain language, among others. Cultural information that was previously too complex, vast, and ambiguous to operate on at scale has been tamed.

In practice, the latest versions of these systems not only rely on vast libraries of human-generated and curated text and images but also on other forms of human judgment and knowledge. In particular, these systems depend on reinforcement learning from human feedback (RLHF) and prompt engineering. Even the latest “chain of thought” models typically start with dialogues with human users.

Challenges and Opportunities

1. Challenges

Debates about artificial intelligence should focus on the challenges and opportunities presented by these new cultural and social technologies. The technology we currently possess has impacts on written and visual culture comparable to the effects of large-scale markets on the economy, large bureaucracies on society, and even the transformation of language by printing. What will happen next? Like past general-purpose technologies in economics, organization, and information, these systems will affect productivity, supplement human work, automate tasks previously only achievable by humans, and influence distribution, potentially altering resource acquisition patterns.

They may also produce broader cultural impacts. We do not yet know whether these impacts will be as profound as those of printing, markets, or bureaucratic systems, but viewing them as cultural technologies highlights their potential significance.

At the same time, these technologies create new possibilities for reorganizing information and coordinating the actions of millions globally. Ongoing debates about the economic, social, and political consequences of large language models echo historical concerns and expectations regarding new cultural and social technologies. Guiding these debates requires recognizing the commonalities of new and old arguments while carefully mapping the specificities of new technologies.

Such mapping is a core task of social sciences. Research into the past consequences of technologies can help us think about the latent social impacts of AI and explore pathways to enhance positive impacts and mitigate negative ones through the redesign of AI systems.

However, a current obvious concern is that large models and related technologies may replace “knowledge workers” and whether “large models will homogenize or fragment culture and society.” Thinking about this issue in historical context is enlightening.

The design goals of large models aim to faithfully reproduce the actual probabilities of sequences of text, images, and videos. Their inherent tendency is to be most accurate about the most common scenarios in the training data and least accurate about rare or entirely new scenarios, which may exacerbate this homogenization.

2. Opportunities

On the other hand, large models may allow us to design new methods to access the diversity of cultural perspectives they summarize. Integrating and balancing these perspectives could provide more nuanced means of solving complex problems. One way to achieve this is to construct a “social-like” ecology—different large models encoding different perspectives could debate, cross-fertilize to generate mixed perspectives, or identify gaps in human expertise. We may need new systems that diversify the reflections and roles of large models, producing distributions and diversities akin to human society.

Such diversified systems are particularly crucial for scientific advancement. By linking numerous perspectives across texts, audio, and images, large models may help us discover unprecedented connections among them, benefiting science and society.

The impact of new cultural and social technologies on economic relations also presents more subtle yet intriguing pathways. The development of cultural technologies has sparked fundamental economic tensions between information producers and distribution systems:

The contradiction between producers and distributors: Distributors seek to acquire information at low costs, while producers wish to distribute information at low costs.
Digitalization exacerbates the contradiction: The convenience of digital information distribution has sharpened this issue. The speed, efficiency, and scope with which large models process available information make these problems more pronounced. Concentrated power may make system owners more likely to seize benefits through efficiency at the expense of others’ rights.

3. Technological and Political Issues Under Challenges and Opportunities

Key technological questions include: To what extent can systemic flaws in large models be corrected? How do they compare in advantages and disadvantages to flaws in systems based on human knowledge workers?

These questions should not obscure critical political issues: Which actors can mobilize their interests? How do they shape the hybrid outcomes of technology and organizational capabilities?

Tech commentators often simplify these issues into a binary opposition between machines and humans: either the forces of progress triumph over Luddite tendencies, or humans successfully resist the dehumanizing encroachment of artificial technologies. This not only misunderstands the complexities of distribution struggles that predate the advent of computers but also overlooks the multiple paths that future progress may take.

In early cases of social and cultural technologies, a series of norms and regulatory frameworks gradually emerged to reconcile their impacts. However, these checks and balances do not form spontaneously but are the result of coordinated efforts by actors both inside and outside technology.

Looking to the Future

The narrative of general artificial intelligence (i.e., viewing large models as superintelligent agents) is promoted both by optimists and skeptics within and outside the tech community. This narrative misinterprets the nature of these models and their relationship to past technological transformations. More importantly, it diverts attention from the real issues and opportunities posed by these technologies and ignores the lessons of history that guide weighing their pros and cons.

There may exist hypothetical future AI systems closer to agents, but large models are clearly not such systems. Like library card catalogs or the internet, large models belong to a continuum of cultural and social technology development.

Social sciences have explored this history in detail, forming a unique understanding of past technological upheavals. Close collaboration between computer science and engineering with social sciences will help us understand this history and apply its lessons: Will large models lead to cultural homogenization or fragmentation? Will they reinforce or undermine the social institutions of human discovery? Who will benefit and who will be harmed in this process?

These pressing questions are difficult to focus on in debates that analogize large models to human agents. Shifting the framework of debates about artificial intelligence will help promote research.

If both computer scientists and social scientists understand that large models are merely (but also) new cultural and social technologies, they will find it easier to collaborate by combining their expertise. Computer scientists can integrate their deep understanding of system mechanisms with social scientists’ knowledge of how large-scale systems have reshaped society, expanding existing research agendas and discovering new directions.

Additionally, steering debates away from the existential fears of “machines taking over” and the utopian promise of “everyone having a perfect artificial assistant” will yield different actual policy consequences for large models.

With this mindset, engineers and computer scientists have already recognized the bias issues of large models and are contemplating their relationship with ethical justice. They need to go further and ask: How will these systems affect resource distribution? What are their actual consequences for social polarization and integration? Can we develop large models that enhance rather than suppress human creativity?

Answering these questions requires an understanding that combines social sciences and engineering. Shifting the debate about artificial intelligence from agency to cultural and social technology is a crucial first step in building this interdisciplinary understanding.

A Comprehensive Guide to Understanding Large Language Models

Tue, 22 Oct 2024 00:00:00 +0000

Last week, while sharing my article on “My Transition Journey as an AI Product Manager,” I hinted that I would produce a comprehensive article to help everyone systematically learn about large models. Today, I am delivering that article; it totals 22,000 words, and reading it is expected to take about 30 minutes, covering 15 topics related to large models.

Over the past year, to be honest, there have been numerous articles introducing and explaining large models. Most people already have some foundational understanding, but my feeling is that the information is too fragmented and does not provide a systematic understanding. Currently, there is no article that can comprehensively explain what large models are in one go.

To alleviate my cognitive anxiety, I decided to compile the knowledge points I have understood about large models over the past year into one article, hoping to clarify large models through a single piece of writing. This also serves as a summary of my extensive learning.

This article will share 15 topics related to large models. Initially, there were 20 topics, but I removed some more technical content to focus on issues that ordinary people or product managers should pay attention to. The goal is to ensure that as AI novices, we only need to master and understand these key points.

Who Is This Article For?

This article is suitable for the following groups:

Those who want to understand what large models are all about, including beginners.
Those looking to transition into AI-related products and roles, including product managers and operations personnel.
Those who have a preliminary understanding of AI but wish to advance their learning and reduce cognitive anxiety about AI.

Content Statement: The entire content is a result of my personal synthesis after extensive reading and digesting numerous expert articles, books related to large models, and consultations with industry experts. My role is mainly as a knowledge synthesizer. If there are any inaccuracies in the descriptions, please feel free to inform me kindly!

Lecture 1: Common Concepts of Large Models

Before we start understanding large models, let’s first grasp some basic concepts. Mastering these professional terms and their relationships will benefit your subsequent reading and learning about any AI and large model-related content. I spent quite a bit of time organizing their relationships, so this part is essential to read carefully.

1. Common AI Terms

1) Large Model (LLM): All existing large models refer to large language models, specifically generative large models. Examples include GPT-4.0, GPT-4o, etc.

Deep Learning: A subfield of machine learning focused on using multi-layer neural networks for learning. Deep learning is effective in handling complex data such as images, audio, and text, making it very effective in AI applications.
Supervised Learning: A method of machine learning where the model learns the mapping from input to output using a labeled training dataset. Common supervised learning algorithms include linear regression, logistic regression, support vector machines, K-nearest neighbors, decision trees, and random forests.
Unsupervised Learning: A method of machine learning that finds patterns and structures in data without labeled data. It is mainly used for clustering and dimensionality reduction tasks. Common unsupervised learning algorithms include K-means clustering, hierarchical clustering, DBSCAN, principal component analysis (PCA), and t-SNE.
Semi-supervised Learning: Combines a small amount of labeled data with a large amount of unlabeled data for training. It utilizes the rich information from unlabeled data and the accuracy of labeled data to improve model performance. Common methods include Generative Adversarial Networks (GANs) and autoencoders.
Reinforcement Learning: A method of learning optimal strategies through interaction with the environment based on reward and punishment mechanisms. Reinforcement learning algorithms optimize decision-making processes through trial and error to maximize cumulative rewards. Common algorithms include Q-learning, policy gradient, and Deep Q-Networks (DQN).
Model Architecture: Represents the design of the backbone of a large model. Different model architectures affect the performance, efficiency, and even computational costs of large models, determining their scalability. For example, many large model vendors adjust the model architecture to reduce computational load and resource consumption.
Transformer Architecture: The mainstream architecture used by current large models, including GPT-4.0 and most domestic large models. The widespread use of the Transformer architecture is primarily due to its ability to enable large models to understand human natural language, maintain contextual memory, and generate text. Other common model architectures include Convolutional Neural Networks (CNNs) for image processing and Generative Adversarial Networks (GANs) for image generation. A detailed introduction to the Transformer architecture will be covered later.
MOE Architecture: Represents the Mixture of Experts architecture, which combines multiple expert models to form a large model with a vast number of parameters, supporting the resolution of various complex professional problems. Models with MOE architecture may include Transformer-based models.
Machine Learning Technologies: A broad category of technologies that implement AI, including deep learning, supervised learning, and reinforcement learning. As a product manager, you don’t need to delve too deeply into the specifics; just understand the relationships between these learning types to avoid being misled by technical personnel.
NLP Technology (Natural Language Processing): An application area of AI focused on enabling computers to understand, interpret, and generate human language, used in text analysis, machine translation, speech recognition, and dialogue systems. In simpler terms, it is a technology that converts a lot of information into a format understandable by human natural language.
CV Technology (Computer Vision): If NLP deals with text, CV addresses visual content-related technologies. CV technologies include common image recognition, video analysis, and image segmentation technologies, which are also prevalent in large model applications, especially in the upcoming multi-modal large model technologies.
Speech Recognition and Synthesis Technologies: Includes converting speech to text and speech synthesis technologies, such as Text-to-Speech (TTS) technology.
Retrieval-Augmented Generation (RAG): Refers to the technology where large models generate content based on information retrieved from search engines and knowledge bases. RAG is a technology involved in most AI applications.
Knowledge Graph: A technology that connects knowledge, allowing knowledge to establish relationships, helping models acquire the most relevant knowledge more effectively and enhancing their ability to process complex relational information and AI reasoning.
Function Call: Refers to the ability in large language models (like GPT) to call built-in or external functions to perform specific tasks or operations. This mechanism allows the model to be more than just a text generation tool, enabling it to execute a variety of operations by specifying different functionalities. Function Call allows large models to integrate with various API capabilities, enhancing their practical applications, such as supporting content retrieval and document recognition.

2) Terms Related to Large Model Training and Optimization Technologies

Pre-training: Refers to the process of training a model on a large dataset. The pre-training dataset is usually large and diverse, resulting in a general-purpose model with strong capabilities, similar to a person who has learned various general knowledge through compulsory education and university studies.
Fine-tuning: Refers to further training a large model on specific tasks or smaller datasets to improve its performance on targeted problems. Unlike the pre-training phase, the fine-tuning phase uses a smaller amount of data, primarily from vertical domains. The fine-tuning process results in a specialized or industry-specific model, akin to a new graduate receiving professional skill training after joining a company.
Prompt Engineering: In product manager terms, this means using question formats that large models can better understand to yield the desired results. Therefore, prompt engineering is a skill in learning how to ask questions effectively.
Model Distillation: A technique that transfers knowledge from a large model (the teacher model) to a smaller model (the student model). The student model learns from the outputs of the teacher model to improve its performance while maintaining similar accuracy.
Model Pruning: Refers to the removal of unnecessary parameters from a large model to reduce its overall parameter size, thereby lowering computational load and cost.

3) AI Application-Related Terms

Agent: An agent is simply understood as an AI application with a specific capability. If applications in the internet era are called apps, applications in the AI era are called agents.
Chatbot: Refers to AI chatbots, a type of AI application that interacts through chat, including products like ChatGPT, which belong to the chatbot category.

4) Terms Related to Large Model Performance

Emergence: Refers to the phenomenon where large models exhibit capabilities beyond expectations once their parameter scale reaches a certain level.
Hallucination: Refers to instances where large models generate nonsensical content, mistakenly treating incorrect facts as real, leading to unrealistic outputs.
Amnesia: Refers to the situation where, after a certain number of dialogue turns and length, the model suddenly forgets and begins to repeat itself and forgets previous context. The memory of large models is mainly influenced by factors like the model’s context length.

2. Understanding the Relationship Between AI, Machine Learning, Deep Learning, and NLP

If you are interested in AI and large models, you will likely encounter keywords like “AI,” “Machine Learning,” “Deep Learning,” and “NLP” in your future studies. Therefore, it is best to clarify the concepts and definitions of these professional terms and their logical relationships for easier understanding.

In summary, the relationship between these concepts is as follows:

Machine learning is a core technology of AI. Besides machine learning, AI’s core technologies include expert systems, Bayesian networks, etc. (you don’t need to delve too deeply into what these are), with deep learning being a type of machine learning.
NLP is one of the application task types in AI, focused on natural language processing. Besides NLP, AI application technologies also include CV (computer vision) technology, speech recognition and synthesis technologies, etc.

3. Understanding the Transformer Architecture

When discussing large models, one cannot overlook the Transformer architecture. If large models are like a tree, the Transformer architecture serves as the trunk of the model. The emergence of products like ChatGPT is primarily due to the design of the Transformer architecture, which enables models to understand context, maintain memory, and predict unknown words. Additionally, the introduction of the Transformer has allowed large models to train on unlabeled data without relying heavily on large amounts of labeled data. This breakthrough means that previously, creating a model required significant human effort for data cleaning and labeling, but now, fragmented and scattered data can simply be fed into the model for processing. We can understand these concepts through the following points:

Relationship Between Transformer Architecture and Deep Learning Technologies: The Transformer architecture belongs to the category of deep learning technologies, meaning it is an implementation and design form within deep learning. Besides the Transformer architecture, traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) are also part of deep learning.

4. Understanding the Relationship Between the Transformer Architecture and GPT

GPT stands for Generative Pre-trained Transformer, meaning GPT is a large language model developed based on the Transformer architecture by OpenAI.

The core idea of GPT is to enhance the ability to generate and understand natural language through large-scale pre-training and fine-tuning. The emergence of the Transformer architecture has solved the issues of understanding context, processing large amounts of data, and predicting text. However, OpenAI was the first to adopt the pre-training + fine-tuning approach to improve and utilize the Transformer architecture, enabling it to possess the capabilities of products like ChatGPT for understanding and generating natural language.

The reason GPT can generate and understand natural language is that during the pre-training phase, it learns broad language patterns and knowledge from a large corpus of unlabeled text, with the pre-training task typically being a language modeling task, where the model predicts the next word given a sequence of preceding words. The specific differences are as follows:

1) Differences in Capabilities: The Transformer architecture enables models to understand context, process large amounts of data, and predict text, but it does not possess the ability to understand and generate natural language. In contrast, GPT, after adding natural language pre-training, has the ability to understand and generate natural language.

2) Architectural Foundations:

Transformer: The original Transformer model consists of an encoder and a decoder. The encoder processes the input sequence to generate intermediate representations, which the decoder uses to generate the output sequence. This architecture is particularly suitable for sequence-to-sequence tasks, such as machine translation. The encoder employs a bidirectional processing mechanism, allowing it to use bidirectional attention, meaning each word can consider the information from all other words in the sequence, regardless of whether they are preceding or following words.
GPT: GPT primarily uses the decoder part of the Transformer, focusing solely on generation tasks. Its training and generation processes are unidirectional, meaning each word can only see the preceding words (unidirectional attention). This architecture is more suitable for text generation tasks. The encoder adopts a unidirectional processing mechanism, meaning that when generating the next word, GPT can only consider the previous words, which aligns with the natural form of language modeling.

3) Implementation Methods for Solving Specific Problems:

The Transformer is trained to solve specific task types (like machine translation) by optimizing its performance through training, with both the encoder and decoder trained simultaneously.
In contrast, GPT solves specific task types through supervised fine-tuning, meaning it does not require training for specific task types but only needs to provide some specific task data to achieve results. It is important to understand that training and fine-tuning are different implementation cost methods.

4) Application Domains:

The traditional Transformer framework can be applied to various sequence-to-sequence tasks, such as machine translation, text summarization, and speech recognition. Since it includes both an encoder and a decoder, the Transformer can handle various input and output format tasks.
GPT is primarily used for generation tasks, such as text generation, dialogue systems, and question-answering systems. It excels in generating coherent and creative text.

5. Understanding the MOE Architecture

In addition to the Transformer architecture, another popular architecture is the MOE architecture (Mixture of Experts), which dynamically selects and combines multiple sub-models (i.e., experts) to complete tasks. The key idea of MOE is to solve a series of complex tasks by combining multiple expert models rather than relying on a single large model for all tasks.

The main advantage of the MOE architecture is its ability to maintain computational efficiency while handling large-scale data and model parameters, significantly reducing computational costs while retaining model capabilities.

The Transformer and MOE can be used in conjunction, commonly referred to as MOE-Transformer or Sparse Mixture of Experts Transformer. In this architecture:

The Transformer processes input data, leveraging its powerful self-attention mechanism to capture dependencies within the sequence.
The MOE dynamically selects and combines different experts, enhancing the model’s computational efficiency and capabilities.

Lecture 2: Differences Between Large Models and Traditional Models

When we talk about large models, we typically refer to LLMs (Large Language Models), or more specifically, models like GPT (generative pre-trained models based on the Transformer architecture). Firstly, it is a language model addressing natural language tasks rather than problems in images, videos, or speech. (Models that handle multiple modalities, including language, images, videos, and speech, are later referred to as multi-modal large models, which are not the same concept as LLMs.) Secondly, LLMs are generative models, meaning their primary ability is to generate rather than predict or make decisions.

In contrast to traditional models, large models generally have the following characteristics:

Ability to Understand and Generate Natural Language: Many traditional models we have encountered may not understand human natural language, let alone generate language that humans can comprehend.
Powerful and Versatile Capabilities: Traditional models typically solve one or a few problems, with a strong specialization, while large models possess strong general capabilities and can address a wide range of issues.
Contextual Memory Capabilities: Large models have memory capabilities, allowing them to relate to contextual dialogues rather than being forgetful robots, which is one of the key differences from many traditional models.
Training Method: Based on large amounts of unlabeled text, pre-training is conducted through unsupervised methods. Unlike many traditional models that rely on large amounts of labeled data, the unsupervised approach significantly reduces the costs of data cleaning and preparation. Furthermore, pre-training requires a massive amount of training data; for example, GPT-3.5’s training corpus reached 45 terabytes.
Massive Parameter Scale: Most large models have parameter scales in the hundreds of billions, such as GPT-3.5, which has 175 billion parameters, while GPT-4.0 is rumored to reach trillions of parameters. These parameters learn and adjust during the model training process to perform specific tasks better.
Training Requires Significant Computational Resources: Due to their scale and complexity, these models require substantial computational resources for training and inference, typically needing specialized hardware like GPUs or TPUs. Research indicates that training generative AIs like ChatGPT requires support from at least 10,000 NVIDIA A100 accelerator cards, with training costs for models like GPT-3.5 reaching up to $9 million.

Lecture 3: Evolution of Large Models

1. Evolution of Large Model Generation Capabilities

Understanding the evolution of LLMs helps everyone grasp how large models have gradually acquired their current capabilities and makes it easier to understand the relationship between LLMs and Transformers. The following outlines the evolution of large models:

N-gram: The earliest stage of large model generation capabilities, primarily addressing the ability to predict the next word. This forms the basis of text generation, but its limitations lie in understanding context and grammatical structures.
RNN (Recurrent Neural Network) and LSTM (Long Short-Term Memory): At this stage, these two models addressed the issue of context length, enabling relatively longer context windows, but they struggled to handle large amounts of data.
Transformer: Combines the predictive capabilities of the previous two models with the ability to remember longer contexts while supporting training on large datasets, but lacks the ability to understand and generate natural language.
LLM (Large Language Model): Adopts the GPT pre-training and supervised fine-tuning approach, enabling the model to understand and generate natural language, thus called a large language model. It can be said that the emergence of pre-training and supervised fine-tuning brought the Transformer into the development stage of large models.

2. Development History from GPT-1 to GPT-4

GPT-1: Introduced the unsupervised training step for the first time, solving the problem of needing large amounts of labeled data for previous models. The unsupervised training method allows GPT to train on a vast amount of unlabeled data. However, its limitations stem from the small parameter scale (only 117 million parameters), making it unable to solve complex tasks without supervised fine-tuning, which can be cumbersome.

GPT-2: Increased the parameter scale to 1.5 billion and expanded the training text size fourfold to 40GB. By increasing parameter scale and training data size, the model’s capabilities improved, but it still faced limitations in addressing complex problems.

GPT-3: Expanded the parameter scale to 175 billion, achieving remarkable performance in text generation and language understanding, and eliminated the fine-tuning step, meaning it could solve complex problems without needing fine-tuning. However, GPT-3’s limitations arose from its training on a vast amount of internet data, which may include false and erroneous texts, leading to safety issues.

InstructGPT: To address the limitations of GPT-3, supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) were added after pre-training to optimize the model’s errors. This model became InstructGPT. The process involves providing the model with some real “standard answers” for supervised fine-tuning, constructing a scoring model for the generated results, and using the scoring model to adjust the model’s strategy for improvement. Thus, many large model vendors focus on the quality and quantity of data provided during the supervised fine-tuning stage to reduce hallucination rates.

GPT-3.5: Released in March 2022, OpenAI’s GPT-3 new version had a training data cutoff in June 2021, with an expanded training data size of 45TB, and was referred to as GPT-3.5 in November.

GPT-4.0: Released in April 2023, OpenAI significantly improved overall reasoning capabilities and supported multi-modal capabilities.

GPT-4o: Released in May 2024, enhanced voice chat capabilities.

O1: In September 2024, OpenAI launched the O1 model, focusing on enhancing reasoning capabilities.

3. Principles of Text Generation in Large Models

1. How Does GPT Generate Text?

The process of a large model generating text can be summarized in five steps:

After receiving a prompt, the model first tokenizes the input content, breaking it into multiple tokens.
Based on the Transformer architecture, it understands the relationships between tokens to grasp the overall meaning of the prompt.
It predicts the next token based on context, which may yield multiple results, each with corresponding probability values.
The token with the highest probability is selected as the predicted result for the next word.
The fourth step is repeated until the entire content is generated.

2. Classification of LLMs

1. Classification by Modality Type

Currently, large models on the market can be categorized into text generation models (e.g., GPT-3.5), image generation models (e.g., DALL-E), video generation models (e.g., Sora, 可灵), speech generation models, and multi-modal models (e.g., GPT-4.0).

2. Classification by Training Stage

Models can be divided into basic language models and instruction fine-tuned models:

Basic Language Model: Refers to models that have only undergone pre-training on large-scale text corpora without any instruction or downstream task fine-tuning, such as GPT-3, which is the publicly available basic language model from OpenAI.
Instruction Fine-tuned Model: Refers to models that have undergone instruction fine-tuning, human feedback, and alignment optimizations based on natural language task descriptions. For example, GPT-3.5 is trained based on GPT-3.

3. Classification by General and Industry Models

Large models on the market can also be classified into general large models and industry-specific models. General large models perform well across a wide range of tasks and fields, but may not fully understand and utilize domain-specific information, thus may not solve specific industry or scenario problems. Industry-specific models, on the other hand, are trained and adjusted based on general large models to achieve higher performance and accuracy in specific fields.

3. Core Technologies of LLMs

This section may contain many technical terms that are relatively difficult to understand. However, for product managers, it is not necessary to delve into the technical details; understanding the key concepts is sufficient. It is essential for AI product managers to comprehend technical terms to facilitate communication with development and technical teams.

1. Model Architecture: The Transformer architecture has been described in detail earlier, so I will not repeat it here. However, the Transformer architecture is one of the foundational core technologies of large models.

2. Pre-training and Fine-tuning

Pre-training: Conducted on large-scale unlabeled data, this is one of the key technologies for large language models. The emergence of pre-training technology means that models no longer need to rely on large amounts of labeled data, significantly reducing the costs of manual labeling.
Fine-tuning: This technology is used to further leverage large models. The performance of pre-trained models on specific tasks is generally average, so fine-tuning on specific datasets is needed to adapt to specific applications. Fine-tuning can significantly enhance model performance on specific tasks.

3. Model Compression and Acceleration

Model Pruning: By removing unimportant parameters, the size of the model and computational complexity can be reduced.
Knowledge Distillation: Training a smaller student model to mimic the behavior of a large teacher model, retaining most performance while reducing computational costs.

Lecture 7: Six Steps in Large Model Development

According to information released by OpenAI, the development of large models typically goes through the following six steps. This process should be representative of the development process for most large models in the industry:

Data Collection and Processing: This stage involves collecting a large amount of text data, which may include books, web pages, articles, etc. The data is then cleaned to remove irrelevant or low-quality content, followed by preprocessing such as tokenization and removal of sensitive information.
Model Design: Determining the model architecture, such as the Transformer architecture used by GPT-4, and setting the model size, including the number of layers, hidden units, and total parameters.
Pre-training: At this stage, the model acts like a student in school, learning language and knowledge by reading a large number of books (such as web pages and articles). Or, it can be likened to a “sponge” absorbing as much information as possible, learning basic language rules, such as how to form sentences and how words relate to each other. At this point, the model can understand basic language structures but lacks specialized knowledge for specific tasks. The pre-training phase typically requires a very large amount of data, consuming the most computational resources and time; for example, completing one pre-training session for GPT-3 requires 3640 petaflops of computation and nearly 1,000 GPUs.
Instruction Fine-tuning: Also known as supervised fine-tuning, this process involves feeding the model some question-answer pairs with ideal outputs for further training, resulting in a supervised fine-tuned model. This stage is akin to “career training,” where the model learns how to adjust its responses based on specific instructions or tasks, improving its performance on specific types of questions or tasks. The instruction fine-tuning stage requires relatively fewer high-quality data, making the training time and consumption smaller.
Reward: This stage sets up an “incentive mechanism” for the model, teaching it what constitutes a good response or behavior through rewards. This approach helps the model better meet user needs and focus on providing valuable, accurate answers. This process requires the model’s trainers to extensively evaluate and provide feedback on the model’s responses, gradually adjusting the quality of its responses, which also requires relatively high-quality data and takes days to complete.
Reinforcement Learning: In this final stage, the model undergoes “live drills,” learning how to improve through trial and error. During this phase, the model attempts various strategies in real-world complex situations, identifying the most effective methods. The model becomes smarter and more flexible during this stage, enabling it to make better judgments and responses in complex and uncertain situations.

Lecture 8: Understanding Large Model Training and Fine-tuning

1. Understanding Large Model Training Content

1) What Data Is Needed for Large Model Training?

Text Data: Mainly used for training language models, such as news articles, books, social media posts, Wikipedia, etc.
Structured Data: Such as knowledge graphs, used to enhance the language model’s knowledge.
Semi-structured Data: Such as XML and JSON formats, which facilitate information extraction.

2) Sources of Training Data

Public Datasets: Such as Common Crawl, Wikipedia, OpenWebText, etc.
Proprietary Data: Company internal data or paid proprietary data.
User-Generated Content: Content generated by users on social media, forums, comments, etc.
Synthetic Data: Data generated through Generative Adversarial Networks (GANs) or other generative models.

3) What Are the Costs of Large Model Training?

Computational Resources: The cost of using GPUs/TPUs depends mainly on the model’s scale and training time. Large models typically require thousands to tens of thousands of hours of GPU computing time.
Storage Costs: For storing large-scale datasets and model weights. Datasets and model files can reach terabyte levels.
Data Acquisition Costs: The cost of purchasing proprietary data or the labor costs for data cleaning and labeling.
Energy Costs: Training large models consumes a significant amount of electricity, increasing operational costs.
R&D Costs: Including salaries for researchers and engineers, as well as the costs of developing and maintaining models.

2. Understanding Large Model Fine-tuning Content

Two Stages of Large Model Fine-tuning: Supervised Fine-tuning (SFT) and Reinforcement Learning (RLHF). The differences between the two stages are as follows:

2) Two Methods of Large Model Fine-tuning: Lora Fine-tuning and SFT Fine-tuning

Currently, there are two methods for fine-tuning models: Lora fine-tuning and SFT fine-tuning. The differences between these two methods are:

Lora fine-tuning adjusts only a portion of the model’s parameters, not requiring fine-tuning of the entire model. This method is suitable for resource-limited scenarios or targeted fine-tuning, allowing the model to address single-task scenarios.
SFT fine-tuning involves adjusting all parameters of the model, fine-tuning the entire model to enable it to address more specific tasks.

Lecture 9: Main Factors Affecting Large Model Performance

As we know, although there are many large models available on the market, there are differences in performance among them. For example, OpenAI’s models hold a leading position in the industry. Why do performance differences exist among large models? The five most important factors affecting large model performance are as follows:

Model Architecture: The design of the model, including the number of layers, the number of hidden units, and the total number of parameters, significantly impacts its ability to handle complex tasks.
Quality and Quantity of Training Data: The performance of the model heavily relies on the coverage and diversity of its training data. High-quality and diverse datasets help the model understand and generate language more accurately. Currently, most models mainly use public data, and companies with richer, high-quality data resources will have a competitive advantage. In China, a disadvantage is that open-source datasets are primarily in English, with relatively fewer Chinese datasets.
Parameter Scale: The more parameters a model has, the better it can learn and capture complex data patterns, but this also increases computational costs. Therefore, companies with strong computational resources will have a higher advantage. The core factors for computational power depend on the computational volume (number of GPUs), network, and storage dimensions.
Algorithm Efficiency: The algorithms used for training and optimizing the model, such as optimizer selection and learning rate adjustments, significantly impact the model’s learning efficiency and final performance.
Training Frequency: Ensuring that the model undergoes sufficient training iterations to achieve optimal performance while avoiding overfitting issues.

Lecture 10: How to Measure the Quality of Large Models?

From the application perspective of large models, how to measure the quality of a large model and what evaluation framework to use is essential. Through this section, you can understand the dimensions from which evaluation institutions assess the capabilities of large models. Moreover, if you face selection issues regarding large models, you should establish your own judgment system.

After reading and referencing multiple documents on measuring large models, I have summarized the evaluation dimensions into three aspects:

1. How to Measure the Product Performance of Large Models

Typically, the product performance of a large model is evaluated based on the following dimensions:

1) Semantic Understanding Ability: This includes the basic dimensions of semantics, grammar, and context, which essentially determine whether you can have a normal conversation with the model and whether the model’s responses are coherent, especially regarding Chinese semantic understanding. Furthermore, it also assesses whether the model supports multi-language understanding.

2) Logical Reasoning: This includes the model’s reasoning ability, numerical calculation ability, and contextual understanding ability, which is one of the core capabilities of large models, directly determining the model’s intelligence level.

3) Accuracy of Generated Content: This includes the rate of hallucinations and the ability to identify traps.

4) Hallucination Rate: This includes the accuracy of the model’s responses and results. Sometimes, the model may generate nonsensical content that users may mistakenly believe to be true, which can be quite problematic.

5) Trap Information Identification Rate: This indirectly assesses the model’s ability to recognize and handle trap information, as models that poorly identify traps may generate responses based on incorrect information.

6) Quality of Generated Content: While ensuring the authenticity and accuracy of generated content, the dimensions for measuring content quality include:

Diversity of Generated Content: Whether the model can support diverse and multi-faceted content outputs.
Professionalism: Whether the model can produce professional content in vertical scenarios.
Creativity: Whether the generated content is sufficiently creative.
Timeliness: The recency of the generated results.

7) Contextual Memory Ability: This represents the model’s memory capabilities and the length of its contextual window.

8) Model Performance: This includes response speed, resource consumption, robustness, and stability (the ability to handle anomalies and unknown information reliably).

9) Human-likeness: This dimension evaluates whether the model truly exhibits “human-like” qualities, reaching a level of intelligence, including emotional analysis capabilities.

10) Multi-modal Capability: Finally, it assesses the model’s ability to process and generate across modalities, including text, images, videos, and speech.

2. How to Measure the Basic Capabilities of Large Models

It is well-known that the three most important elements for measuring the basic capabilities of large models are: algorithms, computational power, and data. More specifically, they include the following parts:

Parameter Scale: This dimension measures the strength of the algorithm. A larger parameter scale indicates that the model can handle more complex problems and consider more dimensions, simply put, the stronger the model.
Data Volume: Models operate on data, and the larger the underlying data volume, the better the model’s performance.
Data Quality: Data quality includes the intrinsic value of the data and the degree of cleaning performed on the data. Data quality can be hierarchical; for example, user consumption data is more valuable than ordinary social attribute information. The higher the data value, the better the model’s performance. Additionally, the business’s data cleaning quality reflects in the precision of data labeling.
Training Frequency: The more training iterations a model undergoes, the richer its experience, leading to better performance.

3. How to Evaluate Model Safety

In addition to assessing the capabilities of large models, safety considerations are also crucial. Even if a model is highly capable, if safety issues are not adequately addressed, large models cannot develop rapidly. We primarily evaluate model safety based on the following dimensions:

Content Safety: This includes whether the generated content complies with safety management standards, social norms, and legal regulations.
Ethical Standards: This includes whether the generated content contains bias or discrimination and whether it aligns with social values and ethical standards.
Privacy and Copyright Protection: This includes the protection of personal and corporate privacy and compliance with copyright protection laws.

Lecture 11: Limitations of Large Models

1. The Hallucination Problem

The hallucination problem refers to the generation of information that appears reasonable but is actually incorrect or fabricated. In natural language processing, this may manifest as the model generating text or responses that seem coherent but lack truthfulness or accuracy. Currently, the hallucination problem is one of the main reasons users question the applicability of large models and why the results generated by large models are often difficult to use. It is also a challenging issue for AI applications.

What Causes Hallucinations in Large Models? The main sources are as follows:

Overfitting Training Data: The model may have overfitted noise or erroneous information in the training data, leading to the generation of fabricated content.
Presence of False Information in Training Data: If the training data does not adequately cover various real-world scenarios, the model may produce fabricated information in unfamiliar situations.
Insufficient Consideration of Information Credibility: The model may not effectively assess the credibility of generated information, instead generating responses that seem reasonable but are actually fictitious.

Are There Solutions to Mitigate Hallucination Issues? Currently, potential ways to alleviate hallucination issues include:

Using More Diverse Training Data: Introducing a more diverse and authentic training dataset to reduce the likelihood of the model overfitting erroneous information.
Modeling Information Credibility and Increasing Verification Mechanisms: Incorporating components to estimate the credibility of generated information to filter or reduce the probability of generating fabricated content.
External Verification Mechanisms: Utilizing external verification mechanisms or information sources to validate the content generated by the model, ensuring it aligns with the real world.

2. The Amnesia Problem

The amnesia problem refers to the situation where the model may forget previously mentioned information during long dialogues or complex contexts, resulting in inconsistencies and a lack of contextual integrity in generated content. The main causes of amnesia include:

Model Context Memory Limitations: The model may be limited by its contextual memory capabilities, unable to effectively retain and utilize long-term dependencies.
Missing Information in Training Data: If the training data lacks examples of long dialogues or complex contexts, the model may not learn the correct methods for retaining and retrieving information.

Understanding Large Models in Artificial Intelligence

Sun, 13 Oct 2024 00:00:00 +0000

Introduction

In the grand tapestry of artificial intelligence (AI), large models shine like brilliant stars, illuminating the future of technology. They not only reshape our understanding of technology but also quietly trigger transformations across countless industries. However, these intelligent technologies are not without their risks and challenges. In this article, we will unveil the mysteries of large models, sharing their technologies and characteristics, analyzing their development and challenges, and offering a glimpse into the AI era.

Large models, such as the Generative Pre-trained Transformer (GPT) series, have achieved remarkable success in the field of natural language processing (NLP), setting new performance benchmarks in various language processing tasks. Beyond language, large models also demonstrate significant advantages in image processing, audio processing, and physiological signals. They have rapidly found applications in fields like education, healthcare, and finance, particularly excelling in content generation. Today, numerous cutting-edge technologies related to large models are still in urgent need of development, while issues such as bias and privacy breaches require immediate attention. This article analyzes the past and present of large models, discusses pressing issues, and explores future directions, helping the public quickly understand large model technology and its development, integrating into the tide of the AI era.

Origins of Large Models

In November 2022, the renowned AI research company OpenAI released ChatGPT, an AI chatbot program based on the large language model GPT-3.5. Its fluent language expression, powerful problem-solving abilities, and vast database garnered widespread attention worldwide. Within less than two months of its launch, ChatGPT surpassed 100 million monthly active users, becoming the fastest-growing consumer application in history. As a result, various industries began to feel the powerful impact of large models, sparking a research boom in large models both domestically and internationally.

The origins of large models can be traced back to the early AI research in the 20th century, which primarily focused on logical reasoning and expert systems. However, these methods were limited by hard-coded knowledge and rules, making it difficult to handle the complexity and diversity of natural language. With the advent of machine learning and deep learning technologies, along with rapid advancements in hardware capabilities, the training of large-scale datasets and complex neural network models became possible, giving rise to the era of large models.

In 2017, Google’s introduction of the Transformer model structure, which incorporated self-attention mechanisms, significantly enhanced the ability to model sequences, especially in terms of efficiency and accuracy when handling long-distance dependencies. Subsequently, the concept of pre-trained language models (PLMs) gradually became mainstream. PLMs are pre-trained on large-scale text datasets to capture general patterns of language and are then fine-tuned for specific downstream tasks.

Evolution of Large Models

OpenAI’s GPT series models exemplify generative pre-trained models, representing the vanguard of this technology. From GPT-1 to GPT-3.5, each generation has seen significant improvements in scale, complexity, and performance. At the end of 2022, ChatGPT emerged as a chatbot capable of answering questions, writing articles, programming, and even mimicking human conversational styles. Its almost omnipotent answering ability has reshaped people’s understanding of the general capabilities of large language models, greatly advancing the development of the NLP field.

However, the development of large models is not limited to text. With technological advancements, multimodal large models have begun to emerge, capable of simultaneously understanding and generating various types of data, including text, images, and audio. In March 2023, OpenAI announced the multimodal large model GPT-4, which added image functionality and improved language understanding capabilities, marking an important shift from unimodal to multimodal models. The inherent differences between cross-modal data present new and more complex requirements for the design and training of large models, along with unprecedented challenges.

Characteristics of Large Models

Large models typically refer to machine learning models with vast parameter counts, especially in applications within NLP, computer vision (CV), and multimodal fields. These models understand and learn human language through pre-training, enabling them to perform tasks such as information retrieval, machine translation, text summarization, and code generation in a human-machine dialogue format.

Parameter Count of Large Models

The parameter count of large models usually exceeds 1 billion, meaning the model contains over 1 billion learnable weights. These parameters form the foundation for the model’s learning and understanding of data, continuously adjusted through training to better map input data to output results. The increase in parameter count is directly related to the model’s learning ability and complexity, enabling it to capture finer and deeper data features.

Types of Large Models

Large models can be classified based on their application areas and functions:

Large Language Models: Focused on processing and understanding natural language text, commonly used for text generation, sentiment analysis, and question-answering systems.
Visual Large Models: Specifically designed to process and understand visual information (such as images and videos), used for tasks like image recognition, video analysis, and image generation.
Multimodal Large Models: Capable of processing and understanding two or more different types of input data (e.g., text, images, audio), performing more complex and comprehensive tasks by integrating information from different modalities.
Foundation Large Models: Generally refer to models that can be broadly applied to various tasks without a specific application direction during the pre-training phase, learning a vast amount of general knowledge.

Capabilities of Large Models

The capabilities of large models lie in their ability to understand and process highly complex data patterns:

Generalization Ability: Through pre-training on large datasets, large models learn universal linguistic rules, demonstrating strong generalization abilities when faced with new tasks.
Deep Learning: The vast parameter scale and deep network structure enable large models to establish complex abstract representations, understanding the deeper semantics and relationships behind the data.
Context Understanding: In language models, large models can capture long-distance dependencies, enhancing their ability to understand context, which is crucial for grasping subtle nuances in language.
Knowledge Integration: Large models can integrate and utilize knowledge acquired during pre-training, sometimes exhibiting a degree of common-sense reasoning and problem-solving abilities.
Adaptability: Although large models learn general knowledge during pre-training, they can be fine-tuned to adapt to specific tasks, showcasing high flexibility and adaptability.

Technologies Behind Large Models

Current large models are integrated machine learning models capable of processing various types of data. The foundational technologies in these large models aim to understand and generate information across different sensory modalities, enabling tasks such as image description, visual question answering, or cross-modal translation. Here are several key foundational technologies of large models:

Transformer Architecture

Most existing large models are built on the Transformer model (or just the decoder of the Transformer), which captures global dependencies of input data through self-attention mechanisms and can also capture complex relationships between different modality elements. For example, a multimodal Transformer can simultaneously process image pixels and text words, learning their associations through self-attention layers. This allows large models to understand various modalities, such as text and images, and generate long text sequences while maintaining contextual coherence.

Supervised Fine-Tuning

Supervised fine-tuning (SFT) is a traditional fine-tuning method that continues training the pre-trained large model using labeled datasets. Notably, during the training of large models, the SFT phase typically employs high-quality datasets. Additionally, SFT involves adjusting the model’s parameters to enhance its performance on specific tasks. For instance, to improve a model’s performance in legal consulting, a dataset containing legal questions and professional lawyer responses can be used for SFT. During SFT, the model typically attempts to minimize the difference between predicted outputs and true labels, often achieved through loss functions (like cross-entropy loss). This method is direct and straightforward, allowing for rapid adaptation to new tasks. However, it also has limitations, as it relies on high-quality labeled data and may lead to overfitting on the training data.

Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a more complex training method that combines elements of supervised learning and reinforcement learning. The model is first pre-trained on a large amount of unlabeled text, similar to the previous SFT step. Then, human evaluators interact with the model or assess its outputs, providing feedback on its performance, and using human feedback data to train a reward model that can predict scores that human evaluators might assign. Finally, the reward model is used as a signal for reinforcement learning to optimize the original model’s parameters. In this process, the model attempts to maximize the expected rewards it receives. The advantage of RLHF is that it helps the model learn more complex behaviors, especially when tasks are difficult to define through simple correct or incorrect labels. Additionally, RLHF can help the model better align with human preferences and values.

Applications of Large Models

Large models, with their vast parameter counts, deep network structures, and extensive pre-training capabilities, can capture complex data patterns, demonstrating exceptional performance across multiple fields. They can not only understand and generate natural language but also handle complex visual and multimodal information, adapting to various dynamic application scenarios.

Applications in NLP

Large models have particularly widespread applications in the NLP field. For example, OpenAI’s GPT series models can generate coherent and natural text, used in chatbots, automated writing, and language translation, with ChatGPT being a well-known product. In the fintech sector, large models are often used for risk assessment, trading algorithms, and credit scoring. They can analyze vast amounts of financial data, predict market trends, and assist financial institutions in making better investment decisions. In the legal and compliance fields, they can be used for document review, contract analysis, and case studies. Through NLP technology, models can understand and analyze legal documents, enhancing the efficiency of legal professionals. Recommendation systems are another application area for large models. By serializing user behavior data into text, large models can predict user interests and recommend relevant products, movies, music, etc. In the gaming sector, large models can utilize their coding capabilities to generate complex game environments, driving non-player characters (NPCs) to produce different dialogues based on player settings, providing a more realistic gaming experience.

Applications in Image Understanding and Generation

Currently, large models possess not only text understanding capabilities but also multimodal understanding capabilities, laying the foundation for their applications in the image domain, such as automatic painting and video generation. These models can mimic artists’ styles to create new artistic works, assisting human creativity. For instance, OpenAI’s Sora, released in February 2024, can generate a video segment that meets user input requirements, providing a more convenient tool for film production. In image processing, models like SegGPT are used for image recognition, classification, and generation. They learn from extensive image data paired with text to identify objects, faces, and scenes in images, playing roles in medical image analysis, autonomous vehicles, and video surveillance. Additionally, in the fields of medicine and biology, multimodal large models can be used for disease diagnosis, drug discovery, and gene editing, extracting useful information from complex biomedical data to assist doctors in making more accurate diagnoses or helping researchers design new drugs.

Applications in Speech Recognition

Large models also play a significant role in the field of speech recognition. Through deep learning technologies, models can convert speech into text, supporting applications such as voice assistants, real-time speech transcription, and automatic subtitle generation, with mobile voice assistants being a typical example. These models learn from a vast number of speech samples, enabling them to handle various accents, intonations, and noise interference.

Moreover, large models can be applied across various industries, including education, healthcare, agriculture, and finance. For example, in the education sector, large models can be used for personalized learning, automatic grading, and intelligent tutoring, providing customized teaching content based on students’ learning situations to help them learn more efficiently. In summary, large models demonstrate immense potential across various fields through their powerful data processing and learning capabilities. With continuous technological advancements, it is foreseeable that large models will play an increasingly important role in future developments.

Development of Large Models

In the current AI landscape, large models have become an undeniable trend. With the continuous advancement of deep learning technologies, particularly in NLP and CV fields, large models are driving breakthroughs in cutting-edge technologies with their powerful data processing and pattern recognition capabilities.

The development of large models at the technical level benefits from several key factors. First is the innovation of algorithms, especially since the introduction of the Transformer architecture, which has rapidly propelled the development of subsequent models, including BERT, the GPT series, and T5. These models achieve leading performance in multiple NLP tasks through pre-training and fine-tuning strategies. Second is the enhancement of computational power, particularly advancements in graphics processing units (GPUs) and tensor processing units (TPUs), enabling the training of models with tens of billions or even hundreds of billions of parameters. Additionally, the rise of cloud computing platforms has provided the necessary computational resources for training large models. At the same time, large-scale datasets have provided ample “nutrition” for model training. These datasets typically contain rich linguistic expressions, scene information, and user interactions, enabling models to capture complex data distributions and linguistic patterns.

The development of large models at the application level has two main directions: large language models and multimodal large models. In the case of large language models, GPT-3 serves as a milestone, reaching 175 billion parameters and showcasing astonishing language understanding and generation abilities. Following closely, Meta AI’s LLaMA series models have become favorites in academic research and industry due to their excellent performance and relatively smaller model sizes. These models not only excel in standard NLP tasks but also exhibit tremendous potential in few-shot learning and transfer learning.

Multimodal large models extend upon this foundation, capable of processing and understanding various types of inputs, such as text, images, and audio. OpenAI’s DALL-E and CLIP are representative works in this direction, capable of understanding and generating images that correspond to text descriptions or understanding text content through images. Google’s SimCLR represents an important exploration in the CV field, effectively extracting image features through contrastive learning. Subsequently, Google’s Gemini has made significant strides in native multimodal capabilities, pre-training across different modalities and handling more complex inputs and outputs, such as images and audio. OpenAI’s Sora further expands the application range of large models, capable of automatically generating video content based on input text, simulating interactions between characters and environments in both the physical and digital worlds.

Domestic tech companies are also actively exploring large models. Models such as Baidu’s “Wenxin Yiyan”, Alibaba’s “Tongyi Qianwen”, Huawei’s “Pangu”, and iFLYTEK’s “iFLYTEK Spark” have emerged, demonstrating excellent performance in general language understanding and generation tasks, as well as specialized application capabilities in specific vertical fields like healthcare, law, and tourism. For example, Ctrip’s “Ctrip Wenda” focuses on tourism-related Q&A, NetEase Youdao’s “Ziyue” is applied in education, and JD Health’s “Jingyi Qianxun” aims to provide medical consultation services.

Challenges of Large Models

In the AI field, large models are becoming a hot topic in both academic research and industry due to their powerful processing capabilities and broad application prospects. However, as these models continue to expand, the issues faced at the research frontier are becoming increasingly complex.

Model Size

The trade-off between model size and data scale has become a significant challenge. Although model performance often improves with an increase in parameter count, this growth in scale brings substantial computational costs and high demands for data quality. Researchers are searching for optimal balances between model size and data scale under limited computational resources, exploring techniques like data augmentation, transfer learning, and model compression to reduce model size without sacrificing performance, striving to minimize the operational costs of large models.

Network Architecture

Innovation in network architecture is equally crucial. Almost all existing large models are based on the Transformer architecture. While the Transformer architecture excels at processing sequential data, its low computational efficiency and poor parameter utilization can lead to wasted computational resources. The limitations of the current Transformer architecture have prompted researchers to design new network architectures aimed at improving efficiency and generalization capabilities through enhanced attention mechanisms, introducing sparsity, and adaptive computation. For instance, the state-space-based model Mamba proposed in December 2023 introduces a selection mechanism that significantly addresses the computational efficiency issues of existing Transformer architectures, potentially becoming the next generation of foundational architectures for large models.

Prompt Engineering

In dealing with imbalanced datasets, prompt learning offers a new paradigm for addressing this issue. By embedding specific prompts in input data, prompt learning can improve model performance on minority classes. However, designing effective prompts and determining the robustness of these prompts (ensuring effectiveness across different types of large models) has become a discipline—prompt engineering. Further research is needed to combine well-designed prompts with other large model technologies.

Contextual Reasoning

As model sizes grow, emergent capabilities such as contextual reasoning have surfaced, indicating that large models may have internalized cognitive and learning mechanisms closer to human understanding. The nature, triggering conditions, and controllability of these emergent capabilities are current research hotspots, requiring further exploration from cognitive science and neuroscience perspectives to provide more reasonable explanations and help people understand the principles behind the emergence of these abilities.

Knowledge Updating

The continuous updating of knowledge is another critical issue faced by large models. As knowledge progresses, the information within models can quickly become outdated. Researchers are exploring ways to enable models to learn continuously and integrate new knowledge while avoiding catastrophic forgetting, keeping the model’s knowledge base up to date.

Explainability

Despite their outstanding performance in various NLP and machine learning tasks, as the parameter count and network structure of models deepen, the decision-making processes of models become increasingly difficult to explain. The black-box nature of large models makes it challenging for users to understand how models process input data and generate output results. This leads to a passive understanding state, where people only know the model’s output but are unaware of why the model made such decisions.

Privacy and Security

The training data of large models may encompass personal identity information, sensitive data, or trade secrets. If these data are not adequately protected, the training process of the model may pose risks of privacy breaches or misuse. Additionally, large models themselves may contain sensitive information, such as memories gained from training on sensitive data, making the models inherently prone to privacy risks.

Data Bias and Misleading Information

Large language models may output biased or misleading content, stemming from various factors such as data collection methods, annotators’ subjective preferences, and social culture. When models are trained on biased data, they may incorrectly learn or amplify these biases, leading to unfair or discriminatory outcomes in practical applications.

Addressing these issues is crucial for advancing large model technology and expanding its application scope. Solving each challenge could promote more effective applications of AI in the real world, bringing profound impacts to human society.

Future of Large Models

As AI technology continues to evolve and the application scenarios for large model technology expand, the future trends of large models are also presenting new characteristics and development directions.

Balancing Model Scale and Efficiency

Since large model technology often requires substantial computational resources and storage space, future development trends will focus on maintaining model scale while improving efficiency to meet practical application needs. Currently, sparse expert models are gaining attention as a novel architectural approach. Compared to traditional dense models, sparse expert models reduce computational demands by activating only the model parameters relevant to the input data, thereby enhancing computational efficiency. Google’s sparse expert model GlaM, developed in 2023, has seven times more parameters than GPT-3 but reduces energy consumption during training and the computational resources required for inference, outperforming traditional models in various NLP tasks.

Deep Integration of Knowledge

Knowledge integration aims to enrich the model’s representational and decision-making capabilities by combining information from different data sources and knowledge domains. Currently, large models primarily train and apply to single-domain or single-modality data, such as BERT in the NLP domain and ViT in the CV domain. However, in the real world, text, images, audio, and other types of information are often interrelated, making it difficult for single-modality information to meet the demands of complex scenarios. Therefore, with the continuous development of CV, speech recognition, and other technologies, future large models will place greater emphasis on multimodal integration, processing data from different modalities to achieve the fusion and interaction of multimodal information. This capability for multimodal integration allows large models to better understand and process complex information. Moreover, it may be beneficial to combine large model technology with external knowledge bases to further enhance the model’s understanding and application breadth. This means that models can leverage not only their internal language patterns and statistical information but also integrate external structured knowledge for reasoning and decision-making, better addressing complex issues in the real world. Importantly, external knowledge can also enhance the generalization capabilities of large models.

Exploration of Embodied Intelligence

Embodied intelligence refers to intelligent systems that perceive and act based on a physical body, acquiring information, understanding problems, making decisions, and executing actions through interactions with the environment. The proliferation of large models has significantly accelerated the research and implementation of embodied intelligence. Large language models are becoming key tools to help robots better understand and utilize advanced semantic knowledge. By automating task analysis and breaking them down into specific actions, large model technology makes interactions between robots and humans, as well as physical environments, more natural, enhancing the intelligent performance of robots. For instance, different tasks can be achieved through different large models. By using language large models for learning dialogue, visual large models for map recognition, and multimodal large models for executing physical actions, robots can learn concepts more efficiently and direct actions, while decomposing all instructions for execution, completing automated scheduling and collaboration through large model technology. This comprehensive utilization of different models presents new opportunities and challenges for the intelligent development of robots.

Explainability and Trustworthiness

As model scales increase, their internal structures become increasingly complex, making the explainability and trustworthiness of models focal points of concern. First, to enhance model explainability, researchers will focus on developing new methods and technologies that enable large models to clearly explain their decision-making processes and the basis for their generated results. This may involve introducing more transparent model structures, such as transparent neural networks or interpretable attention mechanisms, and developing explanatory algorithms and tools to help users understand model outputs.

Secondly, to enhance model trustworthiness, a series of measures will be taken to reduce the likelihood of models producing errors or misleading information. One important direction is to introduce external information sources and provide models with the capability to access and reference these sources. This way, models will be able to access the most accurate and up-to-date information, thereby improving the accuracy and trustworthiness of their output results.

At the same time, to increase transparency and trust, models will also provide citations related to external information sources, allowing users to audit these sources to determine their reliability. Notably, while some large models with external information access and citation capabilities have already emerged, such as Google’s REALM and Facebook’s RAG, this is merely the beginning of development in this field. In the future, more innovations and advancements are expected, with new models like OpenAI’s WebGPT and DeepMind’s Sparrow further propelling development in this area, laying a more solid foundation for the future applications of large model technology. The future development of large model technology will increasingly emphasize explainability and trustworthiness, which is not only an inevitable trend in technological development but also a reasonable requirement from society for the application of technology. Only by continuously enhancing the explainability and trustworthiness of models can large model technology be better applied across various fields, bringing greater impetus to the development of human society.

The Rise of Generative AI: Insights from Industry Experts

Thu, 30 Nov 2023 00:00:00 +0000

Introduction

On November 30, 2022, OpenAI launched ChatGPT, a chatbot that has been hailed as a pivotal moment in human history, likened to the steam engine and the iPhone. This revolutionary technology in generative AI has sparked a wave of innovation across the tech industry, prompting a re-evaluation of software and hardware.

The Impact of Generative AI

The past year has witnessed a surge in the value of AI infrastructure providers, enabling advancements in various fields from healthcare to aerospace. However, this technological shift has also generated anxiety about AI’s potential threats to human existence and job security. OpenAI itself faced a crisis, nearly experiencing a collapse.

Questions About the Future

As the industry evolves, several questions arise: What will be the next evolution of large language models? When will the AI chip shortage be resolved? Are we running out of training data? How will the competition among AI models in China unfold? Should the development of AI technologies accelerate or decelerate? Will AGI (Artificial General Intelligence) manifest in different forms? To address these questions, we invited industry experts to share their insights and pose their own questions.

OpenAI’s Position and Competitors

OpenAI, previously unknown to the public, has become one of the most recognized tech companies globally within a year, creating competitive pressure for giants like Google, Meta, and Amazon. Many are curious about the release date of GPT-5 and who might challenge OpenAI’s dominance.

Zhang Peng, CEO of Zhiyuan AI, stated that while OpenAI is leading, it should not overshadow other competitors. He emphasized that true challengers must have a strong technical foundation and deep understanding. Professor Xiao Yanghua from Fudan University noted that once a model approaches AGI, its upgrade and evolution could be astonishingly rapid, raising concerns about widening gaps.

User Growth and Challenges

After explosive early growth, OpenAI’s user growth has slowed, which is considered normal. Wang Xiaohang, VP of Ant Group, mentioned that the evolution of model capabilities is data-driven, and the availability of publicly accessible data is dwindling. He suggested two paths forward, emphasizing that AGI, as a centralized product, is not yet a high-frequency necessity for the public. The industry trend is shifting towards creating a super ecosystem rather than a centralized super AI.

Future Directions for Large Language Models

In interviews, industry leaders proposed several directions for the evolution of large language models. Liu Qingfeng, chairman of iFlytek, highlighted the need for larger model parameters, the creation of AI personas, and deeper customization across various industry scenarios. Wang Fengyang from Baidu emphasized the importance of intelligent agents, while Zhou Bowen from Xiangyuan Technology discussed the potential for AI to effectively use tools, which he termed tool intelligence.

The Competitive Landscape in China

Following the launch of ChatGPT, Chinese tech companies entered a heated competition dubbed the “Hundred Model War,” involving both established firms and rapidly funded startups. Chen Lei from Xinyi Technology predicted a more objective market in the coming year, with a focus on practical applications and a reduction in the number of foundational models.

Open Source vs. Closed Source Models

As OpenAI becomes less open about its model parameters and training details, the question arises: Can open-source models surpass closed-source ones? Liang Jiaen, chairman of Yunzhisheng, estimated that while open-source models might have a broader impact in terms of application quantity, closed-source models may achieve higher performance levels.

Insights from Industry Experts

The following are key insights from the interviews:

Will GPT-5 be released? Experts agree that further iterations like GPT-5, GPT-6, and GPT-7 are inevitable, but the timeline remains uncertain due to market conditions and safety evaluations.
Who will challenge OpenAI? Competitors include tech giants like Microsoft and Google, as well as startups like Anthropic and Inflection AI. The consensus is that achieving AGI is a common goal among these entities.
How to address OpenAI’s slowing growth? Experts suggest that the slowdown in user growth is normal and that the focus should shift to integrating AI into various industries to create real demand.

Conclusion

The future of generative AI is filled with potential and challenges. As the industry continues to evolve, the focus will be on practical applications, the development of intelligent agents, and the integration of AI into everyday tasks. The competition will not only be about who can create the best models but also about who can effectively apply these technologies to solve real-world problems.