File Format Reference¶

This reference documents every file format supported by File Organizer, including extraction behavior, optional dependencies, and tunable parameters.

Overview¶

File Organizer supports 48+ file formats across 8 categories. Each format has a dedicated reader that extracts text, metadata, or structural information for AI-based classification and organization.

Category	Formats	Optional Dependencies	Install Group
Documents	`.txt`, `.md`, `.pdf`, `.docx`, `.csv`, `.xlsx`, `.pptx`	PyMuPDF, python-docx, pandas, python-pptx	Core / none
Ebooks	`.epub`	ebooklib	Core
Archives	`.zip`, `.7z`, `.tar`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.tbz2`, `.tar.xz`, `.rar`	py7zr, rarfile	`[archive]`
Scientific	`.hdf5`, `.h5`, `.hdf`, `.nc`, `.nc4`, `.netcdf`, `.mat`	h5py, netCDF4, scipy	`[scientific]`
CAD	`.dxf`, `.dwg`, `.step`, `.stp`, `.iges`, `.igs`	ezdxf	`[cad]`
Images	`.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.tif`	None (VisionProcessor)	Core
Audio	`.mp3`, `.wav`, `.flac`, `.m4a`, `.ogg`	faster-whisper, torch	`[audio]`
Video	`.mp4`, `.avi`, `.mkv`, `.mov`, `.wmv`	opencv-python, scenedetect	`[video]`

Global Limits¶

All readers share a maximum file size check before processing:

Parameter	Default	Location
`MAX_FILE_SIZE_BYTES`	500 MB	`src/file_organizer/utils/readers/_base.py`

Files exceeding this limit raise FileTooLargeError and are skipped.

Installing Optional Dependencies¶

# Individual groups
pip install -e ".[archive]"      # 7z and RAR support
pip install -e ".[scientific]"   # HDF5, NetCDF, MATLAB
pip install -e ".[cad]"          # DXF/DWG/STEP/IGES
pip install -e ".[audio]"        # Audio transcription
pip install -e ".[video]"        # Video scene detection

# Everything
pip install -e ".[all]"

Documents¶

Source module: src/file_organizer/utils/readers/documents.py

Plain Text (`.txt`, `.md`)¶

Parameter	Default	Description
`max_chars`	5000	Maximum characters read from file

Reads with UTF-8 encoding, ignoring decode errors
No optional dependencies required

PDF (`.pdf`)¶

Parameter	Default	Description
`max_pages`	5	Maximum number of pages to extract text from

Uses PyMuPDF (fitz) for text extraction
Extracts plain text from each page sequentially
Requires: PyMuPDF (included in core dependencies)

Word Documents (`.docx`)¶

Extracts text from all non-empty paragraphs
Joins paragraphs with newlines
Requires: python-docx (included in core dependencies)

Note

Only .docx (Office Open XML) is supported. Legacy .doc (binary format) files are not supported and will return None from the reader dispatcher.

Spreadsheets (`.csv`, `.xlsx`)¶

Parameter	Default	Description
`max_rows`	100	Maximum rows read from the spreadsheet

CSV files read with pandas.read_csv
Excel files read with pandas.read_excel (requires openpyxl for .xlsx)
Returns string representation of the DataFrame
Requires: pandas, openpyxl (included in core dependencies)

Note

Only .xlsx (Office Open XML) is supported for Excel files. Legacy .xls (binary format) files are registered in the reader dispatch table but will fail at runtime because the required xlrd package is not included in project dependencies.

Presentations (`.pptx`)¶

Extracts text from all shapes on each slide
Formats output as Slide N: text1 | text2 | ...
Requires: python-pptx (included in core dependencies)

Legacy Format Limitation

Only .pptx (Office Open XML) is fully supported. Legacy .ppt (binary format) files are detected by the reader dispatch table but will fail at runtime because python-pptx only supports .pptx. If you need to process .ppt files, convert them to .pptx first using Microsoft Office or LibreOffice.

Ebooks¶

Source module: src/file_organizer/utils/readers/ebook.py

EPUB (`.epub`)¶

Parameter	Default	Description
`max_chars`	10000	Maximum characters extracted from the ebook

Iterates over document items in the EPUB container
Strips HTML tags using regex
Stops extraction once max_chars is reached
Requires: ebooklib (included in core dependencies)

Note

Only .epub format is supported. Other ebook formats (.mobi, .azw) are not currently supported. The reader dispatcher returns None for unrecognized extensions, causing the file to be skipped.

Archives¶

Source module: src/file_organizer/utils/readers/archives.py

Archive readers extract metadata and file listings, not file contents. This provides enough information for AI classification without decompressing large archives.

ZIP (`.zip`)¶

Parameter	Default	Description
`max_files`	50	Maximum number of entries to list

Uses Python standard library zipfile
Lists file names, sizes (original and compressed), and compression ratio
No optional dependencies required

7z (`.7z`)¶

Parameter	Default	Description
`max_files`	50	Maximum number of entries to list

Lists file names and sizes from the 7z archive
Requires: py7zr (install group: [archive])

TAR Archives (`.tar`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.tbz2`, `.tar.xz`)¶

Parameter	Default	Description
`max_files`	50	Maximum number of entries to list

Uses Python standard library tarfile
Supports gzip, bzip2, and xz compression
Lists member names, sizes, and types (file/directory/link)
No optional dependencies required

RAR (`.rar`)¶

Parameter	Default	Description
`max_files`	50	Maximum number of entries to list

Lists file names, sizes (original and compressed), and modification dates
Requires: rarfile (install group: [archive])
Also requires the unrar command-line tool to be installed on the system

Scientific Data¶

Source module: src/file_organizer/utils/readers/scientific.py

Scientific readers extract structure and metadata rather than raw data arrays.

HDF5 (`.hdf5`, `.h5`, `.hdf`)¶

Parameter	Default	Description
`max_datasets`	20	Maximum number of datasets to list

Traverses HDF5 group hierarchy with visititems
For each dataset: reports name, dtype, shape, size in KB
Lists up to 3 attributes per dataset
Reports total number of top-level groups
Requires: h5py (install group: [scientific])

NetCDF (`.nc`, `.nc4`, `.netcdf`)¶

Reports file format (e.g., NETCDF4)
Lists all dimensions with sizes (marks unlimited dimensions)
Lists first 20 variables with dtype and shape
Shows units and long_name attributes when present
Lists first 10 global attributes
Requires: netCDF4 (install group: [scientific])

MATLAB (`.mat`)¶

Loads .mat file structure (not full data arrays)
Lists first 30 variables with type and shape information
Filters out internal metadata variables (names starting with __)
Requires: scipy (install group: [scientific])

CAD¶

Source module: src/file_organizer/utils/readers/cad.py

DXF (`.dxf`)¶

Parameter	Default	Description
`max_layers`	20	Maximum number of layers to list

Parses DXF structure using ezdxf
Reports DXF version and number of entities
Lists layers with entity counts
Lists named blocks
Extracts header variables (units, limits, extents)
Requires: ezdxf (install group: [cad])

DWG (`.dwg`)¶

Limited support via ezdxf (not all DWG versions supported)
Falls back to basic file information (size, modification date) on failure
Requires: ezdxf (install group: [cad])

STEP (`.step`, `.stp`)¶

Parameter	Default	Description
`max_lines`	100	Maximum header/data lines to parse

Plain text parser for ISO 10303 STEP files
Extracts header information: file description, name, schema
Counts data entities by type
No optional dependencies required (plain text parsing)

IGES (`.iges`, `.igs`)¶

Parameter	Default	Description
`max_lines`	50	Maximum lines to parse from each section

Plain text parser for IGES section-marked format
Counts entities by section marker (S, G, D, P)
Extracts global section parameters
No optional dependencies required (plain text parsing)

Images¶

Images are processed by VisionProcessor using the vision-language model (Qwen 2.5-VL 7B by default). The vision model generates descriptions, folder names, and filenames based on image content.

Extension	Format
`.jpg`, `.jpeg`	JPEG
`.png`	PNG
`.gif`	GIF
`.bmp`	BMP
`.tiff`, `.tif`	TIFF

No optional dependencies required for image processing.

Audio¶

Audio files are processed by AudioModel / AudioTranscriber using faster-whisper for local transcription.

Extension	Format
`.mp3`	MPEG Audio Layer 3
`.wav`	Waveform Audio
`.flac`	Free Lossless Audio Codec
`.m4a`	MPEG-4 Audio
`.ogg`	Ogg Vorbis

Install dependencies:

pip install -e ".[audio]"

This installs: faster-whisper, torch, mutagen, tinytag, pydub, ffmpeg-python.

Video¶

Video files are processed by VisionProcessor with frame extraction using OpenCV and optional scene detection.

Extension	Format
`.mp4`	MPEG-4
`.avi`	AVI
`.mkv`	Matroska
`.mov`	QuickTime
`.wmv`	Windows Media Video

Install dependencies:

pip install -e ".[video]"

This installs: opencv-python, scenedetect[opencv].

Reader Dispatch¶

The read_file() function in src/file_organizer/utils/readers/__init__.py dispatches to the correct reader based on file extension. The dispatch logic:

Checks file size against MAX_FILE_SIZE_BYTES (500 MB)
Handles compound extensions (.tar.gz, .tar.bz2, .tar.xz)
Maps the extension to the appropriate reader function
Returns None for unsupported extensions (the file is skipped)

If an optional dependency is missing, the reader raises ImportError with installation instructions.

Adding Support for New Formats¶

To add a new file format:

Create or extend a reader function in src/file_organizer/utils/readers/
Register the extension in the READERS dict in __init__.py
If the format requires an optional dependency, add it to pyproject.toml under the appropriate install group
Use the _check_file_size() helper for size validation
Raise FileReadError on read failures

File Format Reference¶

Overview¶

Global Limits¶

Installing Optional Dependencies¶

Documents¶

Plain Text (.txt, .md)¶

PDF (.pdf)¶

Word Documents (.docx)¶

Spreadsheets (.csv, .xlsx)¶

Presentations (.pptx)¶

Ebooks¶

EPUB (.epub)¶

Archives¶

ZIP (.zip)¶

7z (.7z)¶

TAR Archives (.tar, .tar.gz, .tgz, .tar.bz2, .tbz2, .tar.xz)¶

RAR (.rar)¶

Scientific Data¶

HDF5 (.hdf5, .h5, .hdf)¶

NetCDF (.nc, .nc4, .netcdf)¶

MATLAB (.mat)¶

CAD¶

DXF (.dxf)¶

DWG (.dwg)¶

STEP (.step, .stp)¶

IGES (.iges, .igs)¶

Images¶

Audio¶

Video¶

Reader Dispatch¶

Adding Support for New Formats¶

Plain Text (`.txt`, `.md`)¶

PDF (`.pdf`)¶

Word Documents (`.docx`)¶

Spreadsheets (`.csv`, `.xlsx`)¶

Presentations (`.pptx`)¶

EPUB (`.epub`)¶

ZIP (`.zip`)¶

7z (`.7z`)¶

TAR Archives (`.tar`, `.tar.gz`, `.tgz`, `.tar.bz2`, `.tbz2`, `.tar.xz`)¶

RAR (`.rar`)¶

HDF5 (`.hdf5`, `.h5`, `.hdf`)¶

NetCDF (`.nc`, `.nc4`, `.netcdf`)¶

MATLAB (`.mat`)¶

DXF (`.dxf`)¶

DWG (`.dwg`)¶

STEP (`.step`, `.stp`)¶

IGES (`.iges`, `.igs`)¶