Performance Tuning Guide¶
This guide covers tunable parameters across model inference, parallel processing, memory management, and caching. Use it to optimize File Organizer for your hardware and workload.
Overview¶
Performance-sensitive components:
| Component | Purpose | Source Module |
|---|---|---|
| ModelConfig | Inference parameters | src/file_organizer/models/base.py |
| ParallelConfig | Worker pool and timeouts | src/file_organizer/parallel/config.py |
| AdaptiveBatchSizer | Memory-aware batch sizing | src/file_organizer/optimization/batch_sizer.py |
| ModelCache | LRU model caching with TTL | src/file_organizer/optimization/model_cache.py |
| ModelWarmup | Background model pre-loading | src/file_organizer/optimization/warmup.py |
| ResourceMonitor | Memory and GPU monitoring | src/file_organizer/optimization/resource_monitor.py |
| MemoryLimiter | Hard memory caps | src/file_organizer/optimization/memory_limiter.py |
Model Configuration¶
ModelConfig controls inference behavior for all AI models (text, vision, audio).
| Parameter | Default | Description |
|---|---|---|
name | (required) | Model identifier (e.g., qwen2.5:3b-instruct-q4_K_M) |
model_type | (required) | TEXT, VISION, AUDIO, or VIDEO |
quantization | q4_k_m | Quantization level (lower = faster, less accurate) |
device | AUTO | Inference device: AUTO, CPU, CUDA, MPS, METAL |
temperature | 0.5 | Sampling temperature (lower = more deterministic) |
max_tokens | 3000 | Maximum tokens in generated response |
top_k | 3 | Top-K sampling (fewer candidates = faster) |
top_p | 0.3 | Nucleus sampling threshold |
context_window | 4096 | Maximum context length in tokens |
batch_size | 1 | Batch size for inference |
framework | ollama | Backend framework: ollama, llama_cpp, mlx |
Device Selection¶
from file_organizer.models.base import DeviceType
# Automatic (recommended) - detects best available device
DeviceType.AUTO
# Force CPU (universal, slower)
DeviceType.CPU
# NVIDIA GPU (fastest for supported models)
DeviceType.CUDA
# Apple Silicon GPU
DeviceType.MPS
# Apple Metal via MLX
DeviceType.METAL
Tuning Tips¶
- Lower
max_tokensto 200-500 for classification tasks that only need short responses (folder names, filenames) - Reduce
temperatureto 0.1-0.3 for more consistent naming - Reduce
context_windowif processing only small files to save memory - Use
q4_k_mquantization for the best speed/quality tradeoff
Parallel Processing¶
ParallelConfig controls how files are processed concurrently.
| Parameter | Default | Description |
|---|---|---|
max_workers | None (CPU count) | Maximum worker threads or processes |
executor_type | THREAD | THREAD (I/O-bound) or PROCESS (CPU-bound) |
prefetch_depth | 2 | Queue-ahead depth per worker for task scheduling (0 = no prefetch queueing). |
chunk_size | 10 | Files submitted per scheduling round |
timeout_per_file | 60.0 | Seconds before a file processing times out |
retry_count | 2 | Retry attempts for failed files |
Tuning Tips¶
- For Ollama-based inference (I/O-bound), use
THREADexecutor - For local model inference with GPU,
max_workers=1prevents GPU contention - Lower
prefetch_depth(or use0) to reduce queued work and memory pressure - Increase
timeout_per_filefor large PDFs or videos (120-300s) - Increase
chunk_sizefor many small files (50-100) - Set
retry_count=0to fail fast during bulk operations
from file_organizer.parallel.config import ParallelConfig, ExecutorType
config = ParallelConfig(
max_workers=4,
executor_type=ExecutorType.THREAD,
prefetch_depth=2,
chunk_size=20,
timeout_per_file=120.0,
retry_count=1,
)
Pipeline Prefetch (I/O-Compute Overlap)¶
PipelineOrchestrator supports double-buffered processing: I/O stages (e.g. PreprocessorStage) run ahead in a thread pool while the compute stage (e.g. AnalyzerStage / LLM inference) runs on the current file. This hides file-read latency behind LLM inference time.
| Parameter | Default | Description |
|---|---|---|
prefetch_depth | 2 | Files to pre-process ahead of current file. 0 disables prefetch. |
prefetch_stages | 1 | Requested number of leading stages to prefetch. The current implementation only supports the first stage; values greater than 1 log a warning and are effectively treated as 1. |
memory_limiter | None | Optional MemoryLimiter; gates whether a new prefetch slot opens. |
Tuning Tips¶
- Increase
prefetch_depth(3–5) when files are large and disk I/O is the bottleneck - Keep
prefetch_depth=2(default) for typical SSD + Ollama workloads - Set
prefetch_depth=0to disable overlap and process files sequentially (useful for debugging or on memory-constrained systems) - Keep
prefetch_stages=1; higher values are currently capped to the first stage for thread-safety - Pass a
MemoryLimiterto automatically back off prefetch when RSS approaches a configured ceiling - Note: on
file-organizer organize,--no-prefetchis a backward-compatible alias for--prefetch-depth 0in the legacyParallelProcessorpath. For stage-based pipeline overlap, setprefetch_depth=0onPipelineOrchestrator.
from file_organizer.optimization.memory_limiter import MemoryLimiter, LimitAction
from file_organizer.pipeline.orchestrator import PipelineOrchestrator
from file_organizer.pipeline.stages import PreprocessorStage, AnalyzerStage
# Prefetch depth=3 with a 2 GB memory ceiling
limiter = MemoryLimiter(max_memory_mb=2048, action=LimitAction.WARN)
pipeline = PipelineOrchestrator(
stages=[PreprocessorStage(), AnalyzerStage()],
prefetch_depth=3,
memory_limiter=limiter,
)
results = pipeline.process_batch(files)
PipelineOrchestrator uses MemoryLimiter.check() only to decide whether to open another prefetch slot, so the LimitAction configured on MemoryLimiter in this example does not affect prefetch gating; it only changes what happens when you call limiter.enforce() or use the guarded context manager.
Adaptive Batch Sizing¶
AdaptiveBatchSizer calculates how many files to process per batch based on available system memory.
| Parameter | Default | Description |
|---|---|---|
target_memory_percent | 70.0 | Target percentage of available memory to use |
min_batch_size | 1 | Minimum files per batch |
max_batch_size | 1000 | Maximum files per batch |
How It Works¶
- Queries available system memory (Linux
/proc/meminfo, macOSsysctl) - Calculates a memory budget from
target_memory_percent - Estimates per-file cost from average file size plus overhead
- Returns the number of files that fit in the budget
- Accepts runtime feedback via
adjust_from_feedback()to refine estimates
Tuning Tips¶
- Lower
target_memory_percent(50-60%) on systems running other services - Use
set_bounds(min_size=5, max_size=50)to constrain batch sizes - Call
adjust_from_feedback()after each batch to let the sizer learn
from file_organizer.optimization.batch_sizer import AdaptiveBatchSizer
sizer = AdaptiveBatchSizer(target_memory_percent=60.0)
sizer.set_bounds(min_size=5, max_size=100)
batch_size = sizer.calculate_batch_size(file_sizes, overhead_per_file=1024)
Model Cache¶
ModelCache keeps loaded models in memory using an LRU eviction policy with TTL expiration.
| Parameter | Default | Description |
|---|---|---|
max_models | 3 | Maximum models kept in cache simultaneously |
ttl_seconds | 300.0 | Time-to-live before a cached model expires (seconds) |
How It Works¶
- On
get_or_load(): returns cached model if present and not expired - Expired models are evicted on next access
- When cache is full, the least-recently-used model is evicted
- Thread-safe via internal lock (safe for parallel processing)
- Calls
cleanup()on evicted models to release resources
Tuning Tips¶
- Increase
max_modelsif you frequently switch between text, vision, and audio models (set to 3-5) - Increase
ttl_secondsfor long-running batch jobs (600-3600s) - Decrease
max_modelsto 1 on memory-constrained systems - Use
cache.stats()to monitor hit/miss ratios
from file_organizer.optimization.model_cache import ModelCache
cache = ModelCache(max_models=3, ttl_seconds=600)
model = cache.get_or_load("qwen2.5:3b", loader_fn)
stats = cache.stats()
Model Warmup¶
ModelWarmup pre-loads models in background threads to eliminate cold-start latency on first use.
| Parameter | Default | Description |
|---|---|---|
max_workers | 2 | Maximum parallel model loading threads |
How It Works¶
- Accepts a list of model names to pre-load
- Skips models already present in the cache
- Loads models in parallel using a thread pool
- Supports both synchronous (
warmup()) and async (warmup_async()) modes
Tuning Tips¶
- Pre-warm only models you will actually use in the session
- Set
max_workers=1if loading models is GPU-bound (prevents contention) - Use
warmup_async()to load models while the application starts up
from file_organizer.optimization.warmup import ModelWarmup
warmup = ModelWarmup(cache, loader_factory, max_workers=2)
result = warmup.warmup(["qwen2.5:3b", "qwen2.5vl:7b"])
Resource Monitor¶
ResourceMonitor provides real-time memory and GPU usage to inform cache eviction and model loading decisions.
Memory Monitoring¶
- Uses
psutilif available, falls back to/proc/meminfo(Linux) orsysctl(macOS) and theresourcemodule - Returns
MemoryInfowith RSS, VMS, and percent of total memory
GPU Monitoring¶
- Queries NVIDIA GPUs via
nvidia-smi - Returns
GpuMemoryInfowith total, used, free bytes and device name - Returns
Noneif no NVIDIA GPU is detected
Eviction Threshold¶
| Parameter | Default | Description |
|---|---|---|
threshold_percent | 85.0 | Memory usage percentage that triggers eviction |
from file_organizer.optimization.resource_monitor import ResourceMonitor
monitor = ResourceMonitor()
mem = monitor.get_memory_usage()
if monitor.should_evict(threshold_percent=80.0):
cache.clear()
Memory Limiter¶
MemoryLimiter enforces hard memory caps on the process, taking configurable actions when limits are exceeded.
| Parameter | Default | Description |
|---|---|---|
max_memory_mb | (required) | Maximum allowed RSS in megabytes |
action | WARN | Enforcement action when limit is exceeded |
Enforcement Actions¶
| Action | Behavior |
|---|---|
WARN | Logs a warning, continues execution |
BLOCK | Logs a warning; caller should check check() before proceeding |
EVICT_CACHE | Calls the registered eviction callback to free memory |
RAISE | Raises MemoryLimitError exception |
Usage¶
from file_organizer.optimization.memory_limiter import MemoryLimiter, LimitAction
limiter = MemoryLimiter(max_memory_mb=4096, action=LimitAction.EVICT_CACHE)
limiter.set_evict_callback(cache.clear)
# Check before heavy operations
if limiter.check():
process_large_file()
# Or use as context manager
with limiter.guarded():
process_batch()
Hardware Recommendations¶
Small Workloads (< 100 files)¶
- RAM: 8 GB minimum
- CPU: Any modern multi-core
- GPU: Optional (CPU inference works)
- Recommended config: default settings
Medium Workloads (100-1,000 files)¶
- RAM: 16 GB recommended
- CPU: 4+ cores
- GPU: Recommended for vision/audio models
- Recommended config:
max_workers=4,chunk_size=20target_memory_percent=60.0max_models=2,ttl_seconds=600
Large Workloads (1,000+ files)¶
- RAM: 32 GB recommended
- CPU: 8+ cores
- GPU: NVIDIA with 8+ GB VRAM or Apple Silicon
- Recommended config:
max_workers=8,chunk_size=50target_memory_percent=50.0max_memory_mb=8192,action=EVICT_CACHEmax_models=3,ttl_seconds=1800timeout_per_file=180.0
Benchmarking¶
Use the benchmark CLI to profile specific processing surfaces:
# Baseline file-system overhead
file-organizer benchmark run ~/test-files --suite io --json
# Text and vision processor paths
file-organizer benchmark run ~/test-files --suite text --json
file-organizer benchmark run ~/test-files --suite vision --json
# Pipeline and full organizer passes
file-organizer benchmark run ~/test-files --suite pipeline --json
file-organizer benchmark run ~/test-files --suite e2e --json
Why this matters: - io gives a floor for disk-only overhead. - text/vision/audio isolate processor-stack latency without requiring live model backends. - pipeline and e2e capture orchestration overhead and write-path behavior. - Baseline JSON now includes runner_profile_version; treat baseline comparisons as valid only when this version matches.
Environment Variables¶
For environment-variable-based configuration (e.g., FO_CONFIG_DIR, FO_API_HOST, FO_API_PORT), see the Configuration Guide.