Performance Tuning Guide¶

This guide covers tunable parameters across model inference, parallel processing, memory management, and caching. Use it to optimize File Organizer for your hardware and workload.

Overview¶

Performance-sensitive components:

Component	Purpose	Source Module
ModelConfig	Inference parameters	`src/file_organizer/models/base.py`
ParallelConfig	Worker pool and timeouts	`src/file_organizer/parallel/config.py`
AdaptiveBatchSizer	Memory-aware batch sizing	`src/file_organizer/optimization/batch_sizer.py`
ModelCache	LRU model caching with TTL	`src/file_organizer/optimization/model_cache.py`
ModelWarmup	Background model pre-loading	`src/file_organizer/optimization/warmup.py`
ResourceMonitor	Memory and GPU monitoring	`src/file_organizer/optimization/resource_monitor.py`
MemoryLimiter	Hard memory caps	`src/file_organizer/optimization/memory_limiter.py`

Model Configuration¶

ModelConfig controls inference behavior for all AI models (text, vision, audio).

Parameter	Default	Description
`name`	(required)	Model identifier (e.g., `qwen2.5:3b-instruct-q4_K_M`)
`model_type`	(required)	`TEXT`, `VISION`, `AUDIO`, or `VIDEO`
`quantization`	`q4_k_m`	Quantization level (lower = faster, less accurate)
`device`	`AUTO`	Inference device: `AUTO`, `CPU`, `CUDA`, `MPS`, `METAL`
`temperature`	`0.5`	Sampling temperature (lower = more deterministic)
`max_tokens`	`3000`	Maximum tokens in generated response
`top_k`	`3`	Top-K sampling (fewer candidates = faster)
`top_p`	`0.3`	Nucleus sampling threshold
`context_window`	`4096`	Maximum context length in tokens
`batch_size`	`1`	Batch size for inference
`framework`	`ollama`	Backend framework: `ollama`, `llama_cpp`, `mlx`

Device Selection¶

from file_organizer.models.base import DeviceType

# Automatic (recommended) - detects best available device
DeviceType.AUTO

# Force CPU (universal, slower)
DeviceType.CPU

# NVIDIA GPU (fastest for supported models)
DeviceType.CUDA

# Apple Silicon GPU
DeviceType.MPS

# Apple Metal via MLX
DeviceType.METAL

Tuning Tips¶

Lower max_tokens to 200-500 for classification tasks that only need short responses (folder names, filenames)
Reduce temperature to 0.1-0.3 for more consistent naming
Reduce context_window if processing only small files to save memory
Use q4_k_m quantization for the best speed/quality tradeoff

Parallel Processing¶

ParallelConfig controls how files are processed concurrently.

Parameter	Default	Description
`max_workers`	`None` (CPU count)	Maximum worker threads or processes
`executor_type`	`THREAD`	`THREAD` (I/O-bound) or `PROCESS` (CPU-bound)
`chunk_size`	`10`	Files submitted per scheduling round
`timeout_per_file`	`60.0`	Seconds before a file processing times out
`retry_count`	`2`	Retry attempts for failed files

Tuning Tips¶

For Ollama-based inference (I/O-bound), use THREAD executor
For local model inference with GPU, max_workers=1 prevents GPU contention
Increase timeout_per_file for large PDFs or videos (120-300s)
Increase chunk_size for many small files (50-100)
Set retry_count=0 to fail fast during bulk operations

from file_organizer.parallel.config import ParallelConfig, ExecutorType

config = ParallelConfig(
    max_workers=4,
    executor_type=ExecutorType.THREAD,
    chunk_size=20,
    timeout_per_file=120.0,
    retry_count=1,
)

Adaptive Batch Sizing¶

AdaptiveBatchSizer calculates how many files to process per batch based on available system memory.

Parameter	Default	Description
`target_memory_percent`	`70.0`	Target percentage of available memory to use
`min_batch_size`	`1`	Minimum files per batch
`max_batch_size`	`1000`	Maximum files per batch

How It Works¶

Queries available system memory (Linux /proc/meminfo, macOS sysctl)
Calculates a memory budget from target_memory_percent
Estimates per-file cost from average file size plus overhead
Returns the number of files that fit in the budget
Accepts runtime feedback via adjust_from_feedback() to refine estimates

Tuning Tips¶

Lower target_memory_percent (50-60%) on systems running other services
Use set_bounds(min_size=5, max_size=50) to constrain batch sizes
Call adjust_from_feedback() after each batch to let the sizer learn

from file_organizer.optimization.batch_sizer import AdaptiveBatchSizer

sizer = AdaptiveBatchSizer(target_memory_percent=60.0)
sizer.set_bounds(min_size=5, max_size=100)
batch_size = sizer.calculate_batch_size(file_sizes, overhead_per_file=1024)

Model Cache¶

ModelCache keeps loaded models in memory using an LRU eviction policy with TTL expiration.

Parameter	Default	Description
`max_models`	`3`	Maximum models kept in cache simultaneously
`ttl_seconds`	`300.0`	Time-to-live before a cached model expires (seconds)

How It Works¶

On get_or_load(): returns cached model if present and not expired
Expired models are evicted on next access
When cache is full, the least-recently-used model is evicted
Thread-safe via internal lock (safe for parallel processing)
Calls cleanup() on evicted models to release resources

Tuning Tips¶

Increase max_models if you frequently switch between text, vision, and audio models (set to 3-5)
Increase ttl_seconds for long-running batch jobs (600-3600s)
Decrease max_models to 1 on memory-constrained systems
Use cache.stats() to monitor hit/miss ratios

from file_organizer.optimization.model_cache import ModelCache

cache = ModelCache(max_models=3, ttl_seconds=600)
model = cache.get_or_load("qwen2.5:3b", loader_fn)
stats = cache.stats()

Model Warmup¶

ModelWarmup pre-loads models in background threads to eliminate cold-start latency on first use.

Parameter	Default	Description
`max_workers`	`2`	Maximum parallel model loading threads

How It Works¶

Accepts a list of model names to pre-load
Skips models already present in the cache
Loads models in parallel using a thread pool
Supports both synchronous (warmup()) and async (warmup_async()) modes

Tuning Tips¶

Pre-warm only models you will actually use in the session
Set max_workers=1 if loading models is GPU-bound (prevents contention)
Use warmup_async() to load models while the application starts up

from file_organizer.optimization.warmup import ModelWarmup

warmup = ModelWarmup(cache, loader_factory, max_workers=2)
result = warmup.warmup(["qwen2.5:3b", "qwen2.5vl:7b"])

Resource Monitor¶

ResourceMonitor provides real-time memory and GPU usage to inform cache eviction and model loading decisions.

Memory Monitoring¶

Uses psutil if available, falls back to /proc/meminfo (Linux) or sysctl (macOS) and the resource module
Returns MemoryInfo with RSS, VMS, and percent of total memory

GPU Monitoring¶

Queries NVIDIA GPUs via nvidia-smi
Returns GpuMemoryInfo with total, used, free bytes and device name
Returns None if no NVIDIA GPU is detected

Eviction Threshold¶

Parameter	Default	Description
`threshold_percent`	`85.0`	Memory usage percentage that triggers eviction

from file_organizer.optimization.resource_monitor import ResourceMonitor

monitor = ResourceMonitor()
mem = monitor.get_memory_usage()

if monitor.should_evict(threshold_percent=80.0):
    cache.clear()

Memory Limiter¶

MemoryLimiter enforces hard memory caps on the process, taking configurable actions when limits are exceeded.

Parameter	Default	Description
`max_memory_mb`	(required)	Maximum allowed RSS in megabytes
`action`	`WARN`	Enforcement action when limit is exceeded

Enforcement Actions¶

Action	Behavior
`WARN`	Logs a warning, continues execution
`BLOCK`	Logs a warning; caller should check `check()` before proceeding
`EVICT_CACHE`	Calls the registered eviction callback to free memory
`RAISE`	Raises `MemoryLimitError` exception

Usage¶

from file_organizer.optimization.memory_limiter import MemoryLimiter, LimitAction

limiter = MemoryLimiter(max_memory_mb=4096, action=LimitAction.EVICT_CACHE)
limiter.set_evict_callback(cache.clear)

# Check before heavy operations
if limiter.check():
    process_large_file()

# Or use as context manager
with limiter.guarded():
    process_batch()

Hardware Recommendations¶

Small Workloads (< 100 files)¶

RAM: 8 GB minimum
CPU: Any modern multi-core
GPU: Optional (CPU inference works)
Recommended config: default settings

Medium Workloads (100-1,000 files)¶

RAM: 16 GB recommended
CPU: 4+ cores
GPU: Recommended for vision/audio models
Recommended config:
max_workers=4, chunk_size=20
target_memory_percent=60.0
max_models=2, ttl_seconds=600

Large Workloads (1,000+ files)¶

RAM: 32 GB recommended
CPU: 8+ cores
GPU: NVIDIA with 8+ GB VRAM or Apple Silicon
Recommended config:
max_workers=8, chunk_size=50
target_memory_percent=50.0
max_memory_mb=8192, action=EVICT_CACHE
max_models=3, ttl_seconds=1800
timeout_per_file=180.0

Benchmarking¶

Use the built-in startup benchmark to measure initialization time:

python scripts/benchmark_startup.py

For per-file processing benchmarks, use the CLI with --dry-run:

file-organizer organize ~/test-files ~/organized-output --dry-run

Environment Variables¶

For environment-variable-based configuration (e.g., FO_CONFIG_DIR, FO_API_HOST, FO_API_PORT), see the Configuration Guide.