Home OpenAI ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens

OpenAI

ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens

adminUpdated 6 hours Ago3 Mins read2 Views

ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens

Reframing Code LLM Training through Scalable, Automated Data Pipelines

Code data plays a key role in training LLMs, benefiting not just coding tasks but also broader reasoning abilities. While many open-source models rely on manual filtering and expert-crafted rules to curate code datasets, these approaches are time-consuming, biased, and hard to scale across languages. Proprietary models like Claude 3.7 and OpenAI o3 excel at coding tasks but don’t share details about their data. Even open-source models like DeepSeek and Qwen2.5 still depend heavily on human-designed filters. However, this reliance limits progress, echoing “The Bitter Lesson” that real breakthroughs come from scalable, data-driven methods, not handcrafted heuristics.

Seed-Coder’s Model-First Pipeline Minimizes Human Dependency in Pretraining

Researchers at ByteDance introduce Seed-Coder, a family of 8B open-source LLMs including base, instruction, and reasoning models, designed to reduce human involvement in code data curation. Instead of relying on manual rules, their model-centric pipeline utilizes LLMs to score and filter large-scale code data from sources such as GitHub and code-related websites, resulting in a 6-trillion-token dataset. The instruction model is fine-tuned using synthetic data and preference optimization, while the reasoning model enhances multi-step code logic via Long-Chain-of-Thought reinforcement learning. Seed-Coder achieves top performance for its size, often surpassing larger models, and is openly shared to encourage further research and development.

6-Trillion Token Corpus Built with LLM Quality Filters across GitHub and Web Data

Seed-Coder is trained using a model-driven approach that minimizes manual intervention. The pretraining corpus comprises approximately 6 trillion tokens, sourced from various sources, including GitHub code, commit histories, and code-related web data. Initially, basic filtering removes files with syntax issues or inappropriate content. Then, large language models are used to evaluate and score the remaining code, ensuring high-quality data without relying on hand-crafted rules. Pretraining occurs in two stages: first, with core code and web data, and later, with more complex structures, such as full repositories and long-context tasks, like fill-in-the-middle, to enhance the model’s coding capabilities.

Post-Training via Instruction Tuning and LongCoT Enables Multi-Step Code Understanding

After pretraining, Seed-Coder undergoes further refinement through two post-training stages. First, the instruction model is trained using supervised fine-tuning on a diverse set of synthetic instruction data generated and filtered by LLMs, helping it better understand and follow human prompts. Then, its performance is enhanced using direct preference optimization (DPO), which aligns model responses more closely with human preferences. For complex reasoning tasks, the reasoning model is improved using LongCoT reinforcement learning, which strengthens its ability to handle multi-step coding challenges. These steps significantly boost Seed-Coder’s performance across various code generation and reasoning tasks.

Seed-Coder Excels in Code Generation, Editing, and Multi-Step Reasoning Benchmarks

The evaluation reveals that the three Seed-Coder models, Base, Instruct, and Reasoning, perform exceptionally well across a range of coding tasks. The Base model outperforms other open-source models of similar size on code generation tasks, achieving strong scores on benchmarks like HumanEval and MultiPL-E. The Instruct model excels in tasks requiring code editing and instruction-following, leading in evaluations such as CodeEditorBench and FullStack. The Reasoning model, trained with long-chain-of-thought techniques, demonstrates outstanding multi-step problem-solving skills, particularly on challenging benchmarks like LiveCodeBench and Codeforces, even surpassing models that are several times larger in size.

In conclusion, Seed-Coder is a family of efficient and high-performing open-source language models designed specifically for coding tasks. These models stand out by relying largely on LLMs rather than humans to filter and curate training data, significantly reducing manual effort. Despite being trained on fewer tokens compared to some larger models, Seed-Coder exhibits exceptional performance in tasks such as code generation, completion, editing, and reasoning. However, its abilities in general language understanding are still limited due to the absence of broad web data and mathematical content. Future updates aim to expand the model family and improve its capabilities across different model sizes.

Check out the Paper, Model Series, GitHub Page and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Source link

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

Previous post ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

Next post Google DeepMind Releases Gemini Robotics On-Device: Local AI Model for Real-Time Robotic Dexterity

Build a Low-Footprint AI Coding Assistant with Mistral Devstral

In this Ultra-Light Mistral Devstral tutorial, a Colab-friendly guide is provided that...

admin5 Mins read

OpenAI

Google DeepMind Releases Gemini Robotics On-Device: Local AI Model for Real-Time Robotic Dexterity

Google DeepMind has unveiled Gemini Robotics On-Device, a compact, local version of...

admin3 Mins read

OpenAI

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

Why Multimodal Reasoning Matters for Vision-Language Tasks Multimodal reasoning enables models to...

admin3 Mins read

OpenAI

A Coding Implementation for Creating, Annotating, and Visualizing Complex Biological Knowledge Graphs Using PyBEL

In this tutorial, we explore how to leverage the PyBEL ecosystem to...

admin5 Mins read

This Week

Mistral AI Releases Mistral Small 3.2: Enhanced Instruction Following, Reduced Repetition, and Stronger Function Calling for AI Integration

Building Event-Driven AI Agents with UAgents and Google Gemini: A Modular Python Implementation Guide

Why Generalization in Flow Matching Models Comes from Approximation, Not Stochasticity

Weekly Newsletter

ByteDance Researchers Introduce Seed-Coder: A Model-Centric Code LLM Trained on 6 Trillion Tokens

Reframing Code LLM Training through Scalable, Automated Data Pipelines

Seed-Coder’s Model-First Pipeline Minimizes Human Dependency in Pretraining

6-Trillion Token Corpus Built with LLM Quality Filters across GitHub and Web Data

Post-Training via Instruction Tuning and LongCoT Enables Multi-Step Code Understanding

Seed-Coder Excels in Code Generation, Editing, and Multi-Step Reasoning Benchmarks

Leave a comment

Leave a Reply Cancel reply

Latest Posts

Building Event-Driven AI Agents with UAgents and Google Gemini: A Modular Python Implementation Guide

Why Generalization in Flow Matching Models Comes from Approximation, Not Stochasticity

Meta AI Researchers Introduced a Scalable Byte-Level Autoregressive U-Net Model That Outperforms Token-Based Transformers Across Language Modeling Benchmarks

The Evolution of AI Voices: From Robotic to Human-Like

Build a Low-Footprint AI Coding Assistant with Mistral Devstral

Google DeepMind Releases Gemini Robotics On-Device: Local AI Model for Real-Time Robotic Dexterity

ByteDance Researchers Introduce VGR: A Novel Reasoning Multimodal Large Language Model (MLLM) with Enhanced Fine-Grained Visual Perception Capabilities

A Coding Implementation for Creating, Annotating, and Visualizing Complex Biological Knowledge Graphs Using PyBEL

Get to Know Us

keep in touch