Performance Tuning Guide¶
This guide covers tunable parameters across model inference, parallel processing, memory management, and caching. Use it to optimize File Organizer for your hardware and workload.
Overview¶
Performance-sensitive components:
| Component | Purpose | Source Module |
|---|---|---|
| ModelConfig | Inference parameters | src/file_organizer/models/base.py |
| ParallelConfig | Worker pool and timeouts | src/file_organizer/parallel/config.py |
| AdaptiveBatchSizer | Memory-aware batch sizing | src/file_organizer/optimization/batch_sizer.py |
| ModelCache | LRU model caching with TTL | src/file_organizer/optimization/model_cache.py |
| ModelWarmup | Background model pre-loading | src/file_organizer/optimization/warmup.py |
| ResourceMonitor | Memory and GPU monitoring | src/file_organizer/optimization/resource_monitor.py |
| MemoryLimiter | Hard memory caps | src/file_organizer/optimization/memory_limiter.py |
Model Configuration¶
ModelConfig controls inference behavior for all AI models (text, vision, audio).
| Parameter | Default | Description |
|---|---|---|
name | (required) | Model identifier (e.g., qwen2.5:3b-instruct-q4_K_M) |
model_type | (required) | TEXT, VISION, AUDIO, or VIDEO |
quantization | q4_k_m | Quantization level (lower = faster, less accurate) |
device | AUTO | Inference device: AUTO, CPU, CUDA, MPS, METAL |
temperature | 0.5 | Sampling temperature (lower = more deterministic) |
max_tokens | 3000 | Maximum tokens in generated response |
top_k | 3 | Top-K sampling (fewer candidates = faster) |
top_p | 0.3 | Nucleus sampling threshold |
context_window | 4096 | Maximum context length in tokens |
batch_size | 1 | Batch size for inference |
framework | ollama | Backend framework: ollama, llama_cpp, mlx |
Device Selection¶
from file_organizer.models.base import DeviceType
# Automatic (recommended) - detects best available device
DeviceType.AUTO
# Force CPU (universal, slower)
DeviceType.CPU
# NVIDIA GPU (fastest for supported models)
DeviceType.CUDA
# Apple Silicon GPU
DeviceType.MPS
# Apple Metal via MLX
DeviceType.METAL
Tuning Tips¶
- Lower
max_tokensto 200-500 for classification tasks that only need short responses (folder names, filenames) - Reduce
temperatureto 0.1-0.3 for more consistent naming - Reduce
context_windowif processing only small files to save memory - Use
q4_k_mquantization for the best speed/quality tradeoff
Parallel Processing¶
ParallelConfig controls how files are processed concurrently.
| Parameter | Default | Description |
|---|---|---|
max_workers | None (CPU count) | Maximum worker threads or processes |
executor_type | THREAD | THREAD (I/O-bound) or PROCESS (CPU-bound) |
chunk_size | 10 | Files submitted per scheduling round |
timeout_per_file | 60.0 | Seconds before a file processing times out |
retry_count | 2 | Retry attempts for failed files |
Tuning Tips¶
- For Ollama-based inference (I/O-bound), use
THREADexecutor - For local model inference with GPU,
max_workers=1prevents GPU contention - Increase
timeout_per_filefor large PDFs or videos (120-300s) - Increase
chunk_sizefor many small files (50-100) - Set
retry_count=0to fail fast during bulk operations
from file_organizer.parallel.config import ParallelConfig, ExecutorType
config = ParallelConfig(
max_workers=4,
executor_type=ExecutorType.THREAD,
chunk_size=20,
timeout_per_file=120.0,
retry_count=1,
)
Adaptive Batch Sizing¶
AdaptiveBatchSizer calculates how many files to process per batch based on available system memory.
| Parameter | Default | Description |
|---|---|---|
target_memory_percent | 70.0 | Target percentage of available memory to use |
min_batch_size | 1 | Minimum files per batch |
max_batch_size | 1000 | Maximum files per batch |
How It Works¶
- Queries available system memory (Linux
/proc/meminfo, macOSsysctl) - Calculates a memory budget from
target_memory_percent - Estimates per-file cost from average file size plus overhead
- Returns the number of files that fit in the budget
- Accepts runtime feedback via
adjust_from_feedback()to refine estimates
Tuning Tips¶
- Lower
target_memory_percent(50-60%) on systems running other services - Use
set_bounds(min_size=5, max_size=50)to constrain batch sizes - Call
adjust_from_feedback()after each batch to let the sizer learn
from file_organizer.optimization.batch_sizer import AdaptiveBatchSizer
sizer = AdaptiveBatchSizer(target_memory_percent=60.0)
sizer.set_bounds(min_size=5, max_size=100)
batch_size = sizer.calculate_batch_size(file_sizes, overhead_per_file=1024)
Model Cache¶
ModelCache keeps loaded models in memory using an LRU eviction policy with TTL expiration.
| Parameter | Default | Description |
|---|---|---|
max_models | 3 | Maximum models kept in cache simultaneously |
ttl_seconds | 300.0 | Time-to-live before a cached model expires (seconds) |
How It Works¶
- On
get_or_load(): returns cached model if present and not expired - Expired models are evicted on next access
- When cache is full, the least-recently-used model is evicted
- Thread-safe via internal lock (safe for parallel processing)
- Calls
cleanup()on evicted models to release resources
Tuning Tips¶
- Increase
max_modelsif you frequently switch between text, vision, and audio models (set to 3-5) - Increase
ttl_secondsfor long-running batch jobs (600-3600s) - Decrease
max_modelsto 1 on memory-constrained systems - Use
cache.stats()to monitor hit/miss ratios
from file_organizer.optimization.model_cache import ModelCache
cache = ModelCache(max_models=3, ttl_seconds=600)
model = cache.get_or_load("qwen2.5:3b", loader_fn)
stats = cache.stats()
Model Warmup¶
ModelWarmup pre-loads models in background threads to eliminate cold-start latency on first use.
| Parameter | Default | Description |
|---|---|---|
max_workers | 2 | Maximum parallel model loading threads |
How It Works¶
- Accepts a list of model names to pre-load
- Skips models already present in the cache
- Loads models in parallel using a thread pool
- Supports both synchronous (
warmup()) and async (warmup_async()) modes
Tuning Tips¶
- Pre-warm only models you will actually use in the session
- Set
max_workers=1if loading models is GPU-bound (prevents contention) - Use
warmup_async()to load models while the application starts up
from file_organizer.optimization.warmup import ModelWarmup
warmup = ModelWarmup(cache, loader_factory, max_workers=2)
result = warmup.warmup(["qwen2.5:3b", "qwen2.5vl:7b"])
Resource Monitor¶
ResourceMonitor provides real-time memory and GPU usage to inform cache eviction and model loading decisions.
Memory Monitoring¶
- Uses
psutilif available, falls back to/proc/meminfo(Linux) orsysctl(macOS) and theresourcemodule - Returns
MemoryInfowith RSS, VMS, and percent of total memory
GPU Monitoring¶
- Queries NVIDIA GPUs via
nvidia-smi - Returns
GpuMemoryInfowith total, used, free bytes and device name - Returns
Noneif no NVIDIA GPU is detected
Eviction Threshold¶
| Parameter | Default | Description |
|---|---|---|
threshold_percent | 85.0 | Memory usage percentage that triggers eviction |
from file_organizer.optimization.resource_monitor import ResourceMonitor
monitor = ResourceMonitor()
mem = monitor.get_memory_usage()
if monitor.should_evict(threshold_percent=80.0):
cache.clear()
Memory Limiter¶
MemoryLimiter enforces hard memory caps on the process, taking configurable actions when limits are exceeded.
| Parameter | Default | Description |
|---|---|---|
max_memory_mb | (required) | Maximum allowed RSS in megabytes |
action | WARN | Enforcement action when limit is exceeded |
Enforcement Actions¶
| Action | Behavior |
|---|---|
WARN | Logs a warning, continues execution |
BLOCK | Logs a warning; caller should check check() before proceeding |
EVICT_CACHE | Calls the registered eviction callback to free memory |
RAISE | Raises MemoryLimitError exception |
Usage¶
from file_organizer.optimization.memory_limiter import MemoryLimiter, LimitAction
limiter = MemoryLimiter(max_memory_mb=4096, action=LimitAction.EVICT_CACHE)
limiter.set_evict_callback(cache.clear)
# Check before heavy operations
if limiter.check():
process_large_file()
# Or use as context manager
with limiter.guarded():
process_batch()
Hardware Recommendations¶
Small Workloads (< 100 files)¶
- RAM: 8 GB minimum
- CPU: Any modern multi-core
- GPU: Optional (CPU inference works)
- Recommended config: default settings
Medium Workloads (100-1,000 files)¶
- RAM: 16 GB recommended
- CPU: 4+ cores
- GPU: Recommended for vision/audio models
- Recommended config:
max_workers=4,chunk_size=20target_memory_percent=60.0max_models=2,ttl_seconds=600
Large Workloads (1,000+ files)¶
- RAM: 32 GB recommended
- CPU: 8+ cores
- GPU: NVIDIA with 8+ GB VRAM or Apple Silicon
- Recommended config:
max_workers=8,chunk_size=50target_memory_percent=50.0max_memory_mb=8192,action=EVICT_CACHEmax_models=3,ttl_seconds=1800timeout_per_file=180.0
Benchmarking¶
Use the built-in startup benchmark to measure initialization time:
For per-file processing benchmarks, use the CLI with --dry-run:
Environment Variables¶
For environment-variable-based configuration (e.g., FO_CONFIG_DIR, FO_API_HOST, FO_API_PORT), see the Configuration Guide.