Home OpenAI Tx-LLM: A Large Language Model (LLM) Fine-Tuned from PaLM-2 to Predict Properties of Many Entities that are Relevant to Therapeutic Development
OpenAI

Tx-LLM: A Large Language Model (LLM) Fine-Tuned from PaLM-2 to Predict Properties of Many Entities that are Relevant to Therapeutic Development

Share
Tx-LLM: A Large Language Model (LLM) Fine-Tuned from PaLM-2 to Predict Properties of Many Entities that are Relevant to Therapeutic Development
Share


Developing therapeutics is costly and time-consuming, often taking 10-15 years and up to $2 billion, with most drug candidates failing during clinical trials. A successful therapeutic must meet various criteria, such as target interaction, non-toxicity, and suitable pharmacokinetics. Current AI models focus on specialized tasks within this pipeline, but their limited scope can hinder performance. The Therapeutics Data Commons (TDC) offers datasets to help AI models predict drug properties, yet these models work independently. LLMs, which excel at multi-tasking, provide the potential to improve therapeutic development by learning across diverse tasks using a unified approach.

LLMs, particularly transformer-based models, have advanced natural language processing, excelling in tasks through self-supervised learning on large datasets. Recent studies show LLMs can handle diverse tasks, including regression, using textual representations of parameters. In therapeutics, specialized models like graph neural networks (GNNs) represent molecules as graphs for functions such as drug discovery. Protein and nucleic acid sequences are also encoded to predict properties like binding and structure. LLMs are increasingly applied in biology and chemistry, with models like LlaSMol and protein-specific models achieving promising results in drug synthesis and protein engineering tasks.

Researchers from Google Research and Google DeepMind introduced Tx-LLM, a generalist large language model fine-tuned from PaLM-2, designed to handle diverse therapeutic tasks. Trained on 709 datasets covering 66 functions across the drug discovery pipeline, Tx-LLM uses a single set of weights to process various chemical and biological entities, such as small molecules, proteins, and nucleic acids. It achieves competitive performance on 43 tasks and surpasses state-of-the-art on 22. Tx-LLM excels in tasks combining molecular representations with text and shows positive transfer between different drug types. This model is a valuable tool for end-to-end drug development.

The researchers compiled a dataset collection called TxT, containing 709 drug discovery datasets from the TDC repository, focusing on 66 tasks. Each dataset was formatted for instruction tuning, featuring four components: instructions, context, question, and answer. These tasks included binary classification, regression, and generation tasks, with representations like SMILES strings for molecules and amino acid sequences for proteins. Tx-LLM was fine-tuned from PaLM-2 using this data. They evaluated the model’s performance using metrics such as AUROC and Spearman correlation and set accuracy. Statistical tests and data contamination analyses were performed to ensure robust results.

The Tx-LLM model demonstrated strong performance on TDC datasets, surpassing or matching state-of-the-art (SOTA) results on 43 out of 66 tasks. It outperformed SOTA on 22 datasets and achieved near-SOTA performance on 21 others. Notably, Tx-LLM excelled in datasets combining SMILES molecular strings with text features like disease or cell line descriptions, likely due to its pretrained knowledge of the text. However, it struggled on datasets that relied solely on SMILES strings, where graph-based models were more effective. Overall, the results highlight the strengths of fine-tuned language models for tasks involving drugs and text-based features.

Tx-LLM is the first LLM trained on diverse TDC datasets, including molecules, proteins, cells, and diseases. Interestingly, training with non-small molecule datasets, such as proteins, improved performance on small molecule tasks. While general LLMs have struggled with specialized chemistry tasks, Tx-LLM excelled in regression, outperforming state-of-the-art models in several cases. This model shows potential for end-to-end drug development, from gene identification to clinical trials. However, Tx-LLM is still in the research stage, with limitations in natural language instruction and prediction accuracy, requiring further improvement and validation for broader applications.


Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.





Source link

Share

Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles
Researchers at Peking University Introduce A New AI Benchmark for Evaluating Numerical Understanding and Processing in Large Language Models
OpenAI

Researchers at Peking University Introduce A New AI Benchmark for Evaluating Numerical Understanding and Processing in Large Language Models

Large language models (LLMs) have revolutionized artificial intelligence, showing prowess in handling...

A Deep Dive into Small Language Models: Efficient Alternatives to Large Language Models for Real-Time Processing and Specialized Tasks
OpenAI

A Deep Dive into Small Language Models: Efficient Alternatives to Large Language Models for Real-Time Processing and Specialized Tasks

AI has made significant strides in developing large language models (LLMs) that...

SVDQuant: A Novel 4-bit Post-Training Quantization Paradigm for Diffusion Models
OpenAI

SVDQuant: A Novel 4-bit Post-Training Quantization Paradigm for Diffusion Models

The rapid scaling of diffusion models has led to memory usage and...

Top 10 VPNs for Apple TV in 2025
OpenAI

Top 10 VPNs for Apple TV in 2025

It’s very crucial to protect privacy and be safe when using platforms...