Getting started

wdfkit reads Renishaw WiRE .wdf spectra into xarray.DataArray objects and provides preprocessing tools: cosmic-ray removal, Nd:YAG laser-harmonic notches, normalization, and PCA-based denoising.

WDFReader loads data eagerly by default; pass chunks=True (or an integer target chunk size in megabytes) for lazy, Dask-backed arrays when working with large maps.

The project is inspired by spectrapy by Dejan Skrelic—an earlier tool that shaped how spectroscopy users treat this kind of data.

Installation

The quickest way to install is via pip:

pip install wdfkit

For a conda-based development setup, see the README.

Reading a `.wdf` file

WDFReader loads the file and returns an (xr.DataArray, image_or_None) pair. It can be unpacked directly:

from wdfkit import WDFReader

data, image = WDFReader("measurement.wdf")

# or keep as an object
reader = WDFReader("measurement.wdf")
data = reader.data    # xr.DataArray
image = reader.image  # white-light image, or None

Optional keyword arguments include spectral_dim (override the automatic spectral-axis name), chunks (lazy/Dask chunking), verbose, and time_coord — see the class docstring for details.

The DataArray rank and dimension names depend on how WiRE acquired the data:

Single spectrum → 1D (spectral_dim,) (only the spectral axis).
Series (e.g. depth or stage stacks) → 2D (series_axis, spectral_dim). The first dimension is named from ORGN metadata when possible (for example "SpatialZ").
Line scans, XY lines, random-point maps → 2D ("point", spectral_dim) with spatial coordinates on point when present.
Raster maps → 3D ("y", "x", spectral_dim).

The spectral coordinate name is inferred automatically from the file’s XlistDataUnits (e.g. wavelength-related units → "nm", Raman shift → "raman_shift"). Override with spectral_dim, including legacy notebooks that expect "shifts":

data, _ = WDFReader("measurement.wdf", spectral_dim="shifts")

read() loads the same cube or spectrum but returns only the DataArray (no white-light image). classify() returns a small summary dict (scan kind, counts, flags) without reading the spectral payload — useful for scripting over folders of .wdf files.

Cosmic-ray removal

Use CosmicRayRemover for maps, line scans (2D stacks along point or similar), and single spectra expressed as 2D (n_row, spectral_dim) with n_row == 1. The default pipeline runs laser-harmonic notch first, then spike removal:

from wdfkit import WDFReader, CosmicRayRemover

data, _ = WDFReader("map.wdf")

remover = CosmicRayRemover()       # all defaults
data_clean = remover.remove(data)  # harmonics + cosmic rays

If WDFReader returned a 1D single spectrum, add a singleton axis so the spectral dimension stays last (shape (1, n_spectral)) before calling the remover:

data, _ = WDFReader("single.wdf")
if data.ndim == 1:
    data = data.expand_dims("spectrum")
data_clean = CosmicRayRemover().remove(data)

For fine-grained control, call the steps separately:

remover = CosmicRayRemover(
    sensitivity=0.02,    # more aggressive detection
    threshold=4.0,       # lower threshold for single-spectrum path
    single_spectrum_method="interpolate",
)

data_no_harmonics = remover.harmonic_check(data)
data_clean = remover.remove_cosmic_rays(data_no_harmonics)

To inspect detections and masks, use the diagnostics method (keys depend on dimensionality — 3D maps expose core_mask / repair_mask; 2D single-row spectra expose cosmic_mask, etc.):

data_clean, diagnostics = remover.remove_with_diagnostics(data)
# e.g. diagnostics["core_mask"] — boolean array of detected spikes (maps)
# diagnostics["repair_mask"] — dilated mask that was interpolated (maps)

Key parameters for single-spectrum removal:

Parameter	Description
`single_spectrum_method`	`"median"` (default), `"interpolate"`, or `"derivative"`
`kernel_size`	Odd integer ≥ 3; median filter window
`threshold`	Multiplier on robust noise (higher → fewer detections)
`max_passes`	Single-spectrum path: number of detect–repair iterations (default `3`)

Key parameters for map removal:

Parameter	Description
`sensitivity`	Scales detection aggressiveness (higher → more hits)
`width`	Spectral mask dilation as a fraction of the spectral length (map path)
`disk_radius`	Spatial disk radius for the median reference filter
`map_mad_multiplier`	Multiplier on per-channel MAD noise (larger → fewer false positives)
`map_require_spatial_local_max`	If True (default), keep only spatial local maxima at fixed wavelength, reducing bright extended features misclassified as cosmic rays

See CosmicRayRemover for additional map tuning parameters (map_noisy_channel_relax_min, map_spectral_dilate_cap, …).

Normalization

normalize() scales spectra along the spectral axis:

from wdfkit import normalize

data_norm = normalize(data, method="area")

Available methods: "l1", "l2", "max", "min_max", "robust_scale" (default), "area", "wave_number". Pass spectral_dim if the spectral axis is not the last dimension. Per- spectrum methods can run chunk-wise on Dask-backed arrays; "robust_scale" and "wave_number" need the full dataset in memory first (see the function docstring).

PCA denoising

SpectraCleaner removes noise from a population of spectra using PCA reconstruction. Typical inputs are 3D map cubes (y, x, spectral_dim) or 2D stacks (n_spectra, spectral_dim). It requires more than one spectrum — for a single spectrum use a 1D smoother instead.

from wdfkit import WDFReader, SpectraCleaner

data, _ = WDFReader("map.wdf")

cleaner = SpectraCleaner(n_components="mle")  # Minka's MLE picks component count
data_clean = cleaner.clean(data)

To also retrieve the PCA decomposition (components, per-spectrum scores, explained variance arrays):

cleaner = SpectraCleaner(n_components=0.95)   # keep 95 % of variance
data_clean, decomp = cleaner.clean_with_decomposition(data)

components = decomp["components"]              # shape (n_components, n_spectral)
coeffs = decomp["coeffs"]                      # same spatial layout as input + components axis
ratio = decomp["explained_variance_ratio"]     # length n_components

Overall variance explained by the retained components is summarized on the cleaned array under attrs["treatments"]["spectra_cleaning"] (explained_variance_ratio_total, n_components_used, and related fields). Large arrays live only in decomp (components, coeffs, …).

Typical workflow

A common end-to-end pipeline for a Raman map:

from wdfkit import WDFReader, CosmicRayRemover, normalize, SpectraCleaner

# 1. Load
data, image = WDFReader("raman_map.wdf")

# 2. Remove cosmic rays (and Nd:YAG harmonics if applicable)
data = CosmicRayRemover(sensitivity=0.015).remove(data)

# 3. Normalize
data = normalize(data, method="area")

# 4. PCA denoise (any multi-spectrum cube or stack)
data = SpectraCleaner(n_components="mle").clean(data)