In this Ultra-Light Mistral Devstral tutorial, a Colab-friendly guide is provided that is designed specifically for users facing disk space constraints. Running large language models like Mistral can be a challenge in environments with limited storage and memory, but this tutorial shows how to deploy the powerful devstral-small model. With aggressive quantization using BitsAndBytes, cache management, and efficient token generation, this tutorial walks you through building a lightweight assistant that’s fast, interactive, and disk-conscious. Whether you’re debugging code, writing small tools, or prototyping on the go, this setup ensures that you get maximum performance with minimal footprint.
!pip install -q kagglehub mistral-common bitsandbytes transformers --no-cache-dir
!pip install -q accelerate torch --no-cache-dir
import shutil
import os
import gc
The tutorial begins by installing essential lightweight packages such as kagglehub, mistral-common, bitsandbytes, and transformers, ensuring no cache is stored to minimize disk usage. It also includes accelerate and torch for efficient model loading and inference. To further optimize space, any pre-existing cache or temporary directories are cleared using Python’s shutil, os, and gc modules.
def cleanup_cache():
"""Clean up unnecessary files to save disk space"""
cache_dirs = ['/root/.cache', '/tmp/kagglehub']
for cache_dir in cache_dirs:
if os.path.exists(cache_dir):
shutil.rmtree(cache_dir, ignore_errors=True)
gc.collect()
cleanup_cache()
print("🧹 Disk space optimized!")
To maintain a minimal disk footprint throughout execution, the cleanup_cache() function is defined to remove redundant cache directories like /root/.cache and /tmp/kagglehub. This proactive cleanup helps free up space before and after key operations. Once invoked, the function confirms that disk space has been optimized, reinforcing the tutorial’s focus on resource efficiency.
import warnings
warnings.filterwarnings("ignore")
import torch
import kagglehub
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
To ensure smooth execution without distracting warning messages, we suppress all runtime warnings using Python’s warnings module. It then imports essential libraries for model interaction, including torch for tensor computations, kagglehub for streaming the model, and transformers for loading the quantized LLM. Mistral-specific classes like UserMessage, ChatCompletionRequest, and MistralTokenizer are also packed to handle tokenization and request formatting tailored to Devstral’s architecture.
class LightweightDevstral:
def __init__(self):
print("📦 Downloading model (streaming mode)...")
self.model_path = kagglehub.model_download(
'mistral-ai/devstral-small-2505/Transformers/devstral-small-2505/1',
force_download=False
)
quantization_config = BitsAndBytesConfig(
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.uint8,
load_in_4bit=True
)
print("⚡ Loading ultra-compressed model...")
self.model = AutoModelForCausalLM.from_pretrained(
self.model_path,
torch_dtype=torch.float16,
device_map="auto",
quantization_config=quantization_config,
low_cpu_mem_usage=True,
trust_remote_code=True
)
self.tokenizer = MistralTokenizer.from_file(f'{self.model_path}/tekken.json')
cleanup_cache()
print("✅ Lightweight assistant ready! (~2GB disk usage)")
def generate(self, prompt, max_tokens=400):
"""Memory-efficient generation"""
tokenized = self.tokenizer.encode_chat_completion(
ChatCompletionRequest(messages=[UserMessage(content=prompt)])
)
input_ids = torch.tensor([tokenized.tokens])
if torch.cuda.is_available():
input_ids = input_ids.to(self.model.device)
with torch.inference_mode():
output = self.model.generate(
input_ids=input_ids,
max_new_tokens=max_tokens,
temperature=0.6,
top_p=0.85,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
use_cache=True
)[0]
del input_ids
torch.cuda.empty_cache() if torch.cuda.is_available() else None
return self.tokenizer.decode(output[len(tokenized.tokens):])
print("🚀 Initializing lightweight AI assistant...")
assistant = LightweightDevstral()
We define the LightweightDevstral class, the core component of the tutorial, which handles model loading and text generation in a resource-efficient manner. It begins by streaming the devstral-small-2505 model using kagglehub, avoiding redundant downloads. The model is then loaded with aggressive 4-bit quantization via BitsAndBytesConfig, significantly reducing memory and disk usage while still enabling performant inference. A custom tokenizer is initialized from a local JSON file, and the cache is cleared immediately afterward. The generate method employs memory-safe practices, such as torch.inference_mode() and empty_cache(), to generate responses efficiently, making this assistant suitable even for environments with tight hardware constraints.
def run_demo(title, prompt, emoji="🎯"):
"""Run a single demo with cleanup"""
print(f"\n{emoji} {title}")
print("-" * 50)
result = assistant.generate(prompt, max_tokens=350)
print(result)
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
run_demo(
"Quick Prime Finder",
"Write a fast prime checker function `is_prime(n)` with explanation and test cases.",
"🔢"
)
run_demo(
"Debug This Code",
"""Fix this buggy function and explain the issues:
```python
def avg_positive(numbers):
total = sum([n for n in numbers if n > 0])
return total / len([n for n in numbers if n > 0])
```""",
"🐛"
)
run_demo(
"Text Tool Creator",
"Create a simple `TextAnalyzer` class with word count, char count, and palindrome check methods.",
"🛠️"
)
Here we showcase the model’s coding abilities through a compact demo suite using the run_demo() function. Each demo sends a prompt to the Devstral assistant and prints the generated response, immediately followed by memory cleanup to prevent buildup over multiple runs. The examples include writing an efficient prime-checking function, debugging a Python snippet with logical flaws, and building a mini TextAnalyzer class. These demonstrations highlight the model’s utility as a lightweight, disk-conscious coding assistant capable of real-time code generation and explanation.
def quick_coding():
"""Lightweight interactive session"""
print("\n🎮 QUICK CODING MODE")
print("=" * 40)
print("Enter short coding prompts (type 'exit' to quit)")
session_count = 0
max_sessions = 5
while session_count < max_sessions:
prompt = input(f"\n[{session_count+1}/{max_sessions}] Your prompt: ")
if prompt.lower() in ['exit', 'quit', '']:
break
try:
result = assistant.generate(prompt, max_tokens=300)
print("💡 Solution:")
print(result[:500])
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
except Exception as e:
print(f"❌ Error: {str(e)[:100]}...")
session_count += 1
print(f"\n✅ Session complete! Memory cleaned.")
We introduce Quick Coding Mode, a lightweight interactive interface that allows users to submit short coding prompts directly to the Devstral assistant. Designed to limit memory usage, the session caps interaction to five prompts, each followed by aggressive memory cleanup to ensure continued responsiveness in low-resource environments. The assistant responds with concise, truncated code suggestions, making this mode ideal for rapid prototyping, debugging, or exploring coding concepts on the fly, all without overwhelming the notebook’s disk or memory capacity.
def check_disk_usage():
"""Monitor disk usage"""
import subprocess
try:
result = subprocess.run(['df', '-h', '/'], capture_output=True, text=True)
lines = result.stdout.split('\n')
if len(lines) > 1:
usage_line = lines[1].split()
used = usage_line[2]
available = usage_line[3]
print(f"💾 Disk: {used} used, {available} available")
except:
print("💾 Disk usage check unavailable")
print("\n🎉 Tutorial Complete!")
cleanup_cache()
check_disk_usage()
print("\n💡 Space-Saving Tips:")
print("• Model uses ~2GB vs original ~7GB+")
print("• Automatic cache cleanup after each use")
print("• Limited token generation to save memory")
print("• Use 'del assistant' when done to free ~2GB")
print("• Restart runtime if memory issues persist")
Finally, we offer a cleanup routine and a helpful disk usage monitor. Using the df -h command via Python’s subprocess module, it displays how much disk space is used and available, confirming the model’s lightweight nature. After re-invoking cleanup_cache() to ensure minimal residue, the script concludes with a set of practical space-saving tips.
In conclusion, we can now leverage the capabilities of Mistral’s Devstral model in space-constrained environments like Google Colab, without compromising usability or speed. The model loads in a highly compressed format, performs efficient text generation, and ensures memory is promptly cleared after use. With the interactive coding mode and demo suite included, users can test their ideas quickly and seamlessly.
Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.
Leave a comment