Large language models have driven progress in machine translation, leveraging massive training corpora to translate dozens of languages and dialects while capturing subtle linguistic nuances. Yet, fine-tuning these models for translation accuracy often impairs their instruction-following and conversational skills, and broad-purpose versions struggle to meet professional fidelity standards. Balancing precise, culturally aware translations with the ability to handle code generation, problem-solving, and user-specific formatting remains challenging. Models must also preserve terminological consistency and adhere to formatting guidelines across varied audiences. Stakeholders require systems that can dynamically adapt to domain requirements and user preferences without sacrificing fluency. Benchmark scores such as WMT24++, covering 55 language variants, and IFEval’s 541 instruction-focused prompts highlight the gap between specialized translation quality and general-purpose versatility, posing a critical bottleneck for enterprise deployment.
Current Approaches to Tailoring Language Models for Translation Accuracy
Multiple approaches have been explored to tailor language models for translation. Fine-tuning pre-trained large language models on parallel corpora has been used to improve the adequacy and fluency of translated text. Meanwhile, continued pretraining on a combination of monolingual and parallel data enhances multilingual fluency. Some research teams have supplemented training with reinforcement learning from human feedback to align outputs with quality preferences. Proprietary systems such as GPT-4o and Claude 3.7 have demonstrated leading translation quality, and open-weight adaptations including TOWER V2 and GEMMA 2 models have reached parity or surpassed closed-source models under certain language scenarios. These strategies reflect continuous efforts to address the dual demands of translation accuracy and broad language capabilities.
Introducing TOWER+: Unified Training for Translation and General Language Tasks
Researchers from Unbabel, Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa (Lisbon ELLIS Unit), and MICS, CentraleSupélec, Université Paris-Saclay, introduced TOWER+, a suite of models. The research team designed variants at multiple parameter scales, 2 billion, 9 billion, and 72 billion, to explore the trade-off between translation specialization and general-purpose utility. By implementing a unified training pipeline, the researchers aimed to position TOWER+ models on the Pareto frontier, achieving both high translation performance and robust general capabilities without sacrificing one for the other. The approach leverages architectures to balance the specific demands of machine translation with the flexibility required by conversational and instructional tasks, supporting a range of application scenarios.
TOWER+ Training Pipeline: Pretraining, Supervised Tuning, Preferences, and RL
The training pipeline begins with continued pretraining on carefully curated data that includes monolingual content, filtered parallel sentences formatted as translation instructions, and a small fraction of instruction-like examples. Next, supervised fine-tuning refines the model using a combination of translation tasks and diverse instruction-following scenarios, including code generation, mathematical problem-solving, and question-answering. A preference optimization stage follows, employing weighted preference optimization and group-relative policy updates trained on off-policy signals and human-edited translation variants. Finally, reinforcement learning with verifiable rewards reinforces precise compliance with transformation guidelines, using regex-based checks and preference annotations to refine the model’s ability to follow explicit instructions during translation. This combination of pretraining, supervised alignment, and reward-driven updates yields a robust balance between specialized translation accuracy and versatile language proficiency.
Benchmark Results: TOWER+ Achieves State-of-the-Art Translation and Instruction Following
The TOWER+ 9B model achieved a win rate of 33.47% on multilingual general chat prompts, while earning an XCOMET-XXL score of 84.38 across 24 language pairs, outperforming similarly sized open-weight counterparts. The flagship 72 billion-parameter variant secured a 54.52 percent win rate on M-ArenaHard, recorded an IFEval instruction-following score of 89.02, and reached an XCOMET-XXL level of 83.29 on the full WMT24++ benchmark. On the combined translation and instruction-following benchmark, IF-MT scored 5.55 for instruction adherence and 88.95 for translation fidelity, establishing state-of-the-art results among open-weight models. These outcomes confirm that the researchers’ integrative pipeline effectively bridges the gap between specialized translation performance and broad language capabilities, demonstrating its viability for both enterprise and research applications.
Key Technical Highlights of the TOWER+ Models
- TOWER+ models, developed by Unbabel and academic partners, span 2 B, 9 B, and 72 B parameters to explore the performance frontier between translation specialization and general-purpose utility.
- The post-training pipeline integrates four stages: continued pretraining (66% monolingual, 33% parallel, and 1% instruction), supervised fine-tuning (22.3% translation), Weighted Preference Optimization, and verifiable reinforcement learning, to preserve chat skills while enhancing translation accuracy.
- Continued pretraining covers 27 languages and dialects, as well as 47 language pairs, over 32 billion tokens, merging specialized and general checkpoints to maintain balance.
- The 9 B variant achieved a 33.47% win rate on M-ArenaHard, 83.84% on IFEval, and an 84.38% XCOMET-XXL across 24 pairs, with IF-MT scores of 4.85 (instruction) and 88.51 (translation).
- The 72 B model recorded 54.52% M-ArenaHard, 89.02% IFEval, 83.29% XCOMET-XXL, and 5.55/88.95% IF-MT, setting a new open-weight standard.
- Even the 2B model matched larger baselines, with 6.33% on M-ArenaHard and 87.65% IF-MT translation quality.
- Benchmarked against GPT-4O-1120, Claude-Sonnet-3.7, ALMA-R, GEMMA-2, and LLAMA-3.3, the TOWER+ suite consistently matches or outperforms on both specialized and general tasks.
- The research provides a reproducible recipe for building LLMs that serve translation and conversational needs concurrently, reducing model proliferation and operational overhead.
Conclusion: A Pareto-Optimal Framework for Future Translation-Focused LLMs
In conclusion, by unifying large-scale pretraining with specialized alignment stages, TOWER+ demonstrates that translation excellence and conversational versatility can coexist within a single open-weight suite. The models achieve a Pareto-optimal balance across translation fidelity, instruction-following, and general chat capabilities, offering a scalable blueprint for future domain-specific LLM development.
Check out the Paper and Models. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.
Leave a comment