wdfkit package

Python package for WDF data treatment.

class wdfkit.CosmicRayRemover(sensitivity: float = 0.01, width: float = 0.02, disk_radius: int = 3, single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median', kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3, spectral_dim: str | None = None, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True)[source]

Bases: object

Cosmic-ray removal: spatial median for maps; robust 1D for singles.

Optionally removes broad Nd:YAG laser harmonics on ~355 nm excitation before narrow spike removal (harmonic_check(), remove()).

Maps (3D): spatial disk median on a min/median-normalized cube; per-λ scaled MAD cutoffs and noisy-band relax_λ; repair by spectral interpolation (not copying the full median surface).

Single spectrum / line scan (2D or 1×1 map): see remove_cosmic_rays_1d() — up to max_passes iterations of scipy.signal.medfilt-based MAD detection, mask dilation by 1 channel, and linear-interpolation repair from the original signal.

Parameters:

sensitivity (float) – Map path: scales aggressiveness. The cutoff includes (0.01 / sensitivity) times the per-channel MAD level (0.01 is the legacy default reference). Larger sensitivity → more hits.
width (float) – Map path: spectral dilation of the CR mask (fraction of length).
disk_radius (int) – Map path: spatial disk radius for the reference median filter.
map_mad_multiplier (float) – Map path: multiplier on noise_λ × relax_λ (like 1D threshold; larger → fewer false positives).
map_noisy_channel_relax_min (float) – Map path: floor on relax_λ in noisy channels. Higher → weaker boost in noisy bands → fewer false positives.
map_spectral_dilate_cap (int) – Map path: max footprint length (in spectral channels) when dilating hits along λ. Caps width × N so repair stays narrow and interp stays accurate.
map_require_spatial_local_max (bool) – Map path: if True (default), keep only voxels that are strict maxima in their (y, x) slice at fixed λ (8-neighbour), reducing extended bright features being treated as CRs.
map_max_spectral_repair_extent (int | None) – Map path: after spectral dilation, each contiguous repair segment along λ is clipped to at most this many channels (centered on max residual in the segment). None disables (not recommended for noisy maps).
map_min_residual_over_cutoff (float) – Map path: require residual > cutoff * this factor (> 1 stricter, fewer false positives). Use 1.0 for the legacy strict inequality.
single_spectrum_method (Literal['median', 'interpolate', 'derivative']) – "median" or "interpolate" (both use medfilt detection and linear-interpolation repair — equivalent in practice), or "derivative" (neighbour-difference peak test).
kernel_size (int) – Odd, >= 3. Passed to medfilt for single-spectrum median-based methods.
threshold (float) – Single-spectrum only: spike cutoff is threshold * MAD_noise. Lower → more aggressive (try 3.5–4.0 for noisy spectra).
max_passes (int) – Single-spectrum only: number of detection–repair iterations (default 3). Each pass runs on the already-repaired signal so that large spikes no longer mask smaller ones. 1 replicates old single-pass behaviour.
spectral_dim (str | None) – Name of the spectral axis (default: last dimension). Used for harmonic cleanup and when the spectral dimension is not last.

disk_radius: int = 3

harmonic_check(spectrum: DataArray) → DataArray[source]

Notch broad harmonics when LaserWaveLength is ~355 nm (Nd:YAG).

If spectrum.attrs['LaserWaveLength'] is outside 354–356 nm, returns spectrum unchanged.

Searches 1064 / 532 / 355 / 266 nm (±2.5 nm); replaces ~1 nm around each found peak with linear interpolation. Prints one line per removal.

kernel_size: int = 5

map_mad_multiplier: float = 7.0

map_max_spectral_repair_extent: int | None = 12

map_min_residual_over_cutoff: float = 1.05

map_noisy_channel_relax_min: float = 0.82

map_require_spatial_local_max: bool = True

map_spectral_dilate_cap: int = 5

max_passes: int = 3

remove(spectrum: DataArray) → DataArray[source]: Harmonic cleanup first, then cosmic-ray removal.

remove_cosmic_rays(spectrum: DataArray) → DataArray[source]: Spike removal only (no harmonic notch).

remove_cosmic_rays_with_diagnostics(spectrum: DataArray) → tuple[DataArray, dict[str, Any]][source]

Like remove_cosmic_rays(), but returns a diagnostics dict for visualization / QC (not written to DataArray.attrs).

For 3D maps, diagnostics includes boolean core_mask, repair_mask, and float arrays residual, preprocessed, spatial_median_reference, cutoff, per_spectrum_median, etc. Use matplotlib to overlay masks or compare spectra at selected (y, x).

For 2D single-spectrum input, diagnostics contain cosmic_mask and corrected_1d (the 1D corrected intensity).

remove_with_diagnostics(spectrum: DataArray) → tuple[DataArray, dict[str, Any]][source]: Harmonics, then remove_cosmic_rays_with_diagnostics().

sensitivity: float = 0.01

single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median'

spectral_dim: str | None = None

threshold: float = 5.0

transform(spectrum: DataArray) → DataArray[source]: Alias of remove() (harmonics then cosmic rays).

width: float = 0.02

class wdfkit.SpectraCleaner(method: Literal['pca'] = 'pca', n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, spectral_dim: str | None = None, pca_kwargs: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Denoise a population of spectra by low-rank reconstruction.

Designed for 3D map cubes (ny, nx, n_spectral) and 2D stacks (n_spectra, n_spectral). PCA reconstruction needs more than one spectrum to separate shared signal from per-channel noise — a single spectrum is rejected with a clear error (use a 1D smoother instead).

Parameters:

method (Literal['pca']) – Denoising method. Currently only "pca" is implemented; the switch is kept for forward compatibility.
n_components (int | float | str | None) – Forwarded to sklearn.decomposition.PCA. "mle" (default), a float in (0, 1) for variance-explained, an int count, or None for min(n_spectra, n_spectral).
subtract_min (bool) – Subtract per-spectrum min before the fit (legacy default True). PCA also mean-centers internally, so this only changes the baseline offset fed to the fit.
restore_min (bool) – Add the saved per-spectrum min back after reconstruction. Off by default (legacy behavior); enable to preserve absolute intensities.
spectral_dim (str | None) – Name of the spectral axis in DataArray inputs. Defaults to the last dimension; pass when spectra are not last (e.g. "raman_shift" with leading spectral axis).
pca_kwargs (dict[str, Any]) – Extra kwargs forwarded to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).

clean(spectra: DataArray) → DataArray[source]: Return a denoised copy of spectra (no decomposition payload).

clean_with_decomposition(spectra: DataArray) → tuple[DataArray, dict[str, Any]][source]

Like clean(), but also returns the PCA decomposition.

The returned decomposition dict has keys components (shape (n_components, n_spectral)), coeffs (per-spectrum scores reshaped to the input’s spatial layout + components axis), mean, explained_variance, explained_variance_ratio, and noise_variance. These arrays can be large — they’re returned separately rather than written to DataArray.attrs.

method: Literal['pca'] = 'pca'

n_components: int | float | str | None = 'mle'

pca_kwargs: dict[str, Any]

restore_min: bool = False

spectral_dim: str | None = None

subtract_min: bool = True

transform(spectra: DataArray) → DataArray[source]: Alias of clean().

class wdfkit.SpectralAxisSpec(dim_name: str, units: str)[source]

Bases: object

Resolved spectral coordinate used as xarray dimension name + coord attrs.

dim_name: str

units: str

class wdfkit.WDFReader(path: str | PathLike[str], *, verbose: bool = False, time_coord: str = 'seconds_elapsed', spectral_dim: str | None = None, chunks: bool | int = False)[source]

Bases: object

Load spectra and metadata from a Renishaw WiRE .wdf binary file.

Typical usage:

data_array, white_light_image = WDFReader(path)

After construction, .data and .image hold the same objects as the unpacked tuple.

Parameters:

spectral_dim – Name for the spectral axis coordinate (default None / "auto"). WiRE XLST XlistDataUnits selects the default (e.g. RamanShift → dimension "raman_shift"). Set to "shifts" for legacy notebooks.
chunks – Enable lazy Dask-backed reading. False (default) = eager; True = auto-chunk at ~128 MB per chunk; int = target MB.

wdfkit.classify(path: str | PathLike[str]) → dict[source]

Return scan classification for a WiRE .wdf file without loading the spectral data.

Returns:: Keys: kind, measurement_type, scan_type, wmap_flag, nspectra, npoints, nsteps.
Return type:: dict

wdfkit.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) → DataArray | ndarray[source]

Scale spectra along the spectral axis.

For xarray.DataArray input, the spectral axis defaults to the last dimension (e.g. nm, raman_shift, shifts, …). Pass spectral_dim to select another dimension when spectra are not last.

Dask-backed DataArrays are handled transparently:

Per-spectrum methods ("l1", "l2", "max", "min_max", "area"): processed chunk-by-chunk via xr.apply_ufunc — no data is loaded into RAM beyond the current chunk.
Global methods ("robust_scale", "wave_number"): require statistics across all spectra; the full array is computed first. A UserWarning is emitted so you know RAM is being used.

Parameters:

input_spectra – DataArray or 2D ndarray of shape (n_spectra, n_points).
method – One of "l1", "l2", "max", "min_max", "wave_number", "robust_scale", "area".
spectral_dim – Spectral dimension name when input_spectra is a DataArray.
x_values – Spectral abscissa for ndarray input (default arange(n_points)).

Returns:

Same type as input_spectra with updated attrs["treatments"] for
DataArray output.

wdfkit.read(path: str | PathLike[str], *, verbose: bool = False, spectral_dim: str | None = None, chunks: bool | int = False) → DataArray[source]

Read a WiRE .wdf file and return a xarray.DataArray.

Parameters:

path – Path to the .wdf file.
spectral_dim – Override for the spectral-axis dimension name.
chunks – Dask chunking: False (eager), True (auto), or int (target MB).

Returns:

Shape and dims depend on scan kind; spectral axis is always last.

Return type:

xarray.DataArray

wdfkit.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) → tuple[ndarray, ndarray][source]

Remove sharp positive spikes from one 1D spectrum (PL-style).

Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.

The algorithm runs up to max_passes iterations. Each pass:

Detects new spikes on the current (already-repaired) signal.
Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.
Accumulates into a single cumulative mask across all passes.
Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.

Early termination when a pass finds no new spikes.

Parameters:

y – One spectral trace (any numeric dtype; cast to float).
method – "median" or "interpolate" — scipy.signal.medfilt reference signal, MAD on residual for detection (both methods now repair identically via linear interpolation). "derivative" — neighbour-difference test on diff(y) MAD; interior points only.
kernel_size – Odd length >= 3 for medfilt (median / interpolate methods).
threshold – Multiplier on MAD-derived noise (larger → fewer detections).
max_passes – Maximum number of detection–repair iterations (default 3). Use 1 for the old single-pass behaviour.

Returns:

corrected_y – Same shape as y; unchanged if no spikes are found or if noise is degenerate.
cosmic_mask – Boolean mask, same shape as y; True at all channels that were corrected (including dilation neighbours). All False when nothing was found or when the mask would cover the entire spectrum.

wdfkit.resolve_spectral_axis(xlist_data_units: str, spectral_dim: str | None) → SpectralAxisSpec[source]

Choose spectral coordinate dimension name and coord.attrs["units"].

Parameters:

xlist_data_units – DATA_UNITS label resolved from raw XLST (e.g. "Nanometre").
spectral_dim – None or "auto" — derive dim name from xlist_data_units. Any other string — force this dimension name (units still come from the table when known; unknown Wire enums fall back to units="unknown").

Returns:

dim_name is safe as an xarray dimension identifier (ASCII tokens).

Return type:

SpectralAxisSpec

Submodules

wdfkit.reader

Public WDFReader API plus module-level read() and classify().

class wdfkit.reader.WDFReader(path: str | PathLike[str], *, verbose: bool = False, time_coord: str = 'seconds_elapsed', spectral_dim: str | None = None, chunks: bool | int = False)[source]

Bases: object

Load spectra and metadata from a Renishaw WiRE .wdf binary file.

Typical usage:

data_array, white_light_image = WDFReader(path)

After construction, .data and .image hold the same objects as the unpacked tuple.

Parameters:

spectral_dim – Name for the spectral axis coordinate (default None / "auto"). WiRE XLST XlistDataUnits selects the default (e.g. RamanShift → dimension "raman_shift"). Set to "shifts" for legacy notebooks.
chunks – Enable lazy Dask-backed reading. False (default) = eager; True = auto-chunk at ~128 MB per chunk; int = target MB.

wdfkit.reader.classify(path: str | PathLike[str]) → dict[source]

Return scan classification for a WiRE .wdf file without loading the spectral data.

Returns:: Keys: kind, measurement_type, scan_type, wmap_flag, nspectra, npoints, nsteps.
Return type:: dict

wdfkit.reader.read(path: str | PathLike[str], *, verbose: bool = False, spectral_dim: str | None = None, chunks: bool | int = False) → DataArray[source]

Read a WiRE .wdf file and return a xarray.DataArray.

Parameters:

path – Path to the .wdf file.
spectral_dim – Override for the spectral-axis dimension name.
chunks – Dask chunking: False (eager), True (auto), or int (target MB).

Returns:

Shape and dims depend on scan kind; spectral axis is always last.

Return type:

xarray.DataArray

wdfkit.cosmic_ray

High-level cosmic-ray removal: CosmicRayRemover for maps and singles.

class wdfkit.cosmic_ray.CosmicRayRemover(sensitivity: float = 0.01, width: float = 0.02, disk_radius: int = 3, single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median', kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3, spectral_dim: str | None = None, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True)[source]

Bases: object

Cosmic-ray removal: spatial median for maps; robust 1D for singles.

Optionally removes broad Nd:YAG laser harmonics on ~355 nm excitation before narrow spike removal (harmonic_check(), remove()).

Maps (3D): spatial disk median on a min/median-normalized cube; per-λ scaled MAD cutoffs and noisy-band relax_λ; repair by spectral interpolation (not copying the full median surface).

Single spectrum / line scan (2D or 1×1 map): see remove_cosmic_rays_1d() — up to max_passes iterations of scipy.signal.medfilt-based MAD detection, mask dilation by 1 channel, and linear-interpolation repair from the original signal.

Parameters:

sensitivity (float) – Map path: scales aggressiveness. The cutoff includes (0.01 / sensitivity) times the per-channel MAD level (0.01 is the legacy default reference). Larger sensitivity → more hits.
width (float) – Map path: spectral dilation of the CR mask (fraction of length).
disk_radius (int) – Map path: spatial disk radius for the reference median filter.
map_mad_multiplier (float) – Map path: multiplier on noise_λ × relax_λ (like 1D threshold; larger → fewer false positives).
map_noisy_channel_relax_min (float) – Map path: floor on relax_λ in noisy channels. Higher → weaker boost in noisy bands → fewer false positives.
map_spectral_dilate_cap (int) – Map path: max footprint length (in spectral channels) when dilating hits along λ. Caps width × N so repair stays narrow and interp stays accurate.
map_require_spatial_local_max (bool) – Map path: if True (default), keep only voxels that are strict maxima in their (y, x) slice at fixed λ (8-neighbour), reducing extended bright features being treated as CRs.
map_max_spectral_repair_extent (int | None) – Map path: after spectral dilation, each contiguous repair segment along λ is clipped to at most this many channels (centered on max residual in the segment). None disables (not recommended for noisy maps).
map_min_residual_over_cutoff (float) – Map path: require residual > cutoff * this factor (> 1 stricter, fewer false positives). Use 1.0 for the legacy strict inequality.
single_spectrum_method (Literal['median', 'interpolate', 'derivative']) – "median" or "interpolate" (both use medfilt detection and linear-interpolation repair — equivalent in practice), or "derivative" (neighbour-difference peak test).
kernel_size (int) – Odd, >= 3. Passed to medfilt for single-spectrum median-based methods.
threshold (float) – Single-spectrum only: spike cutoff is threshold * MAD_noise. Lower → more aggressive (try 3.5–4.0 for noisy spectra).
max_passes (int) – Single-spectrum only: number of detection–repair iterations (default 3). Each pass runs on the already-repaired signal so that large spikes no longer mask smaller ones. 1 replicates old single-pass behaviour.
spectral_dim (str | None) – Name of the spectral axis (default: last dimension). Used for harmonic cleanup and when the spectral dimension is not last.

disk_radius: int = 3

harmonic_check(spectrum: DataArray) → DataArray[source]

Notch broad harmonics when LaserWaveLength is ~355 nm (Nd:YAG).

If spectrum.attrs['LaserWaveLength'] is outside 354–356 nm, returns spectrum unchanged.

Searches 1064 / 532 / 355 / 266 nm (±2.5 nm); replaces ~1 nm around each found peak with linear interpolation. Prints one line per removal.

kernel_size: int = 5

map_mad_multiplier: float = 7.0

map_max_spectral_repair_extent: int | None = 12

map_min_residual_over_cutoff: float = 1.05

map_noisy_channel_relax_min: float = 0.82

map_require_spatial_local_max: bool = True

map_spectral_dilate_cap: int = 5

max_passes: int = 3

remove(spectrum: DataArray) → DataArray[source]: Harmonic cleanup first, then cosmic-ray removal.

remove_cosmic_rays(spectrum: DataArray) → DataArray[source]: Spike removal only (no harmonic notch).

remove_cosmic_rays_with_diagnostics(spectrum: DataArray) → tuple[DataArray, dict[str, Any]][source]

Like remove_cosmic_rays(), but returns a diagnostics dict for visualization / QC (not written to DataArray.attrs).

For 3D maps, diagnostics includes boolean core_mask, repair_mask, and float arrays residual, preprocessed, spatial_median_reference, cutoff, per_spectrum_median, etc. Use matplotlib to overlay masks or compare spectra at selected (y, x).

For 2D single-spectrum input, diagnostics contain cosmic_mask and corrected_1d (the 1D corrected intensity).

remove_with_diagnostics(spectrum: DataArray) → tuple[DataArray, dict[str, Any]][source]: Harmonics, then remove_cosmic_rays_with_diagnostics().

sensitivity: float = 0.01

single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median'

spectral_dim: str | None = None

threshold: float = 5.0

transform(spectrum: DataArray) → DataArray[source]: Alias of remove() (harmonics then cosmic rays).

width: float = 0.02

wdfkit.cosmic_ray.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) → tuple[ndarray, ndarray][source]

Remove sharp positive spikes from one 1D spectrum (PL-style).

Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.

The algorithm runs up to max_passes iterations. Each pass:

Detects new spikes on the current (already-repaired) signal.
Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.
Accumulates into a single cumulative mask across all passes.
Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.

Early termination when a pass finds no new spikes.

Parameters:

y – One spectral trace (any numeric dtype; cast to float).
method – "median" or "interpolate" — scipy.signal.medfilt reference signal, MAD on residual for detection (both methods now repair identically via linear interpolation). "derivative" — neighbour-difference test on diff(y) MAD; interior points only.
kernel_size – Odd length >= 3 for medfilt (median / interpolate methods).
threshold – Multiplier on MAD-derived noise (larger → fewer detections).
max_passes – Maximum number of detection–repair iterations (default 3). Use 1 for the old single-pass behaviour.

Returns:

corrected_y – Same shape as y; unchanged if no spikes are found or if noise is degenerate.
cosmic_mask – Boolean mask, same shape as y; True at all channels that were corrected (including dilation neighbours). All False when nothing was found or when the mask would cover the entire spectrum.

wdfkit.spectra_cleaner

High-level spectral denoising: SpectraCleaner.

Currently implements PCA-based reconstruction (legacy pca_clean); the method switch is kept so other denoisers can be added (e.g. Savitzky- Golay or wavelet) without breaking callers.

class wdfkit.spectra_cleaner.SpectraCleaner(method: Literal['pca'] = 'pca', n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, spectral_dim: str | None = None, pca_kwargs: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Denoise a population of spectra by low-rank reconstruction.

Designed for 3D map cubes (ny, nx, n_spectral) and 2D stacks (n_spectra, n_spectral). PCA reconstruction needs more than one spectrum to separate shared signal from per-channel noise — a single spectrum is rejected with a clear error (use a 1D smoother instead).

Parameters:

method (Literal['pca']) – Denoising method. Currently only "pca" is implemented; the switch is kept for forward compatibility.
n_components (int | float | str | None) – Forwarded to sklearn.decomposition.PCA. "mle" (default), a float in (0, 1) for variance-explained, an int count, or None for min(n_spectra, n_spectral).
subtract_min (bool) – Subtract per-spectrum min before the fit (legacy default True). PCA also mean-centers internally, so this only changes the baseline offset fed to the fit.
restore_min (bool) – Add the saved per-spectrum min back after reconstruction. Off by default (legacy behavior); enable to preserve absolute intensities.
spectral_dim (str | None) – Name of the spectral axis in DataArray inputs. Defaults to the last dimension; pass when spectra are not last (e.g. "raman_shift" with leading spectral axis).
pca_kwargs (dict[str, Any]) – Extra kwargs forwarded to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).

clean(spectra: DataArray) → DataArray[source]: Return a denoised copy of spectra (no decomposition payload).

clean_with_decomposition(spectra: DataArray) → tuple[DataArray, dict[str, Any]][source]

Like clean(), but also returns the PCA decomposition.

The returned decomposition dict has keys components (shape (n_components, n_spectral)), coeffs (per-spectrum scores reshaped to the input’s spatial layout + components axis), mean, explained_variance, explained_variance_ratio, and noise_variance. These arrays can be large — they’re returned separately rather than written to DataArray.attrs.

method: Literal['pca'] = 'pca'

n_components: int | float | str | None = 'mle'

pca_kwargs: dict[str, Any]

restore_min: bool = False

spectral_dim: str | None = None

subtract_min: bool = True

transform(spectra: DataArray) → DataArray[source]: Alias of clean().

wdfkit.spectral

Spectral-axis naming from WiRE XLST unit enums.

class wdfkit.spectral.SpectralAxisSpec(dim_name: str, units: str)[source]

Bases: object

Resolved spectral coordinate used as xarray dimension name + coord attrs.

dim_name: str

units: str

wdfkit.spectral.resolve_spectral_axis(xlist_data_units: str, spectral_dim: str | None) → SpectralAxisSpec[source]

Choose spectral coordinate dimension name and coord.attrs["units"].

Parameters:

xlist_data_units – DATA_UNITS label resolved from raw XLST (e.g. "Nanometre").
spectral_dim – None or "auto" — derive dim name from xlist_data_units. Any other string — force this dimension name (units still come from the table when known; unknown Wire enums fall back to units="unknown").

Returns:

dim_name is safe as an xarray dimension identifier (ASCII tokens).

Return type:

SpectralAxisSpec

wdfkit.preprocessing

Spectral preprocessing (normalization).

Cosmic-ray removal: see wdfkit.cosmic_ray. PCA-based denoising: see wdfkit.spectra_cleaner.

wdfkit.preprocessing.denoise_spectra_pca(values: ndarray, *, n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, pca_kwargs: dict[str, Any] | None = None, return_decomposition: bool = False) → tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]

Denoise a stack / cube of spectra by PCA reconstruction.

The input is reshaped to (n_spectra, n_spectral) for the fit, then reshaped back to the original spatial layout on return. PCA itself mean-centers internally; the optional per-spectrum min subtraction below only changes the baseline offset fed to the decomposition.

Parameters:

values – Array of shape (..., n_spectral). Typical inputs: (ny, nx, n_spectral) map cube, or (n_spectra, n_spectral) stack. Needs more than one spectrum (PCA on a single spectrum is degenerate).
n_components – Forwarded to sklearn.decomposition.PCA. "mle" (default) picks the number with Minka’s MLE; a float in (0, 1) keeps the components that explain that fraction of variance; an int fixes the count; None uses min(n_spectra, n_spectral).
subtract_min – If True (default, matches legacy pca_clean), subtract the per-spectrum minimum before the fit so PCA models the spectral shape rather than offsets.
restore_min – If True, add the saved per-spectrum minimum back to the cleaned output. Off by default to match legacy pca_clean; turn on to preserve absolute intensities.
pca_kwargs – Extra kwargs passed straight to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).
return_decomposition – If True, also return a third dict with the components, per-spectrum coefficients, mean, and explained-variance arrays (large; not suitable for DataArray.attrs).

Returns:

cleaned – Same shape and dtype-flavor (float) as values.
meta – Small dict with the parameters actually used and summary stats — safe to attach to DataArray.attrs.
decomposition_payload – Only when return_decomposition=True. Has keys components, coeffs, mean, explained_variance, explained_variance_ratio, noise_variance.

wdfkit.preprocessing.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) → DataArray | ndarray[source]

Scale spectra along the spectral axis.

For xarray.DataArray input, the spectral axis defaults to the last dimension (e.g. nm, raman_shift, shifts, …). Pass spectral_dim to select another dimension when spectra are not last.

Dask-backed DataArrays are handled transparently:

Per-spectrum methods ("l1", "l2", "max", "min_max", "area"): processed chunk-by-chunk via xr.apply_ufunc — no data is loaded into RAM beyond the current chunk.
Global methods ("robust_scale", "wave_number"): require statistics across all spectra; the full array is computed first. A UserWarning is emitted so you know RAM is being used.

Parameters:

input_spectra – DataArray or 2D ndarray of shape (n_spectra, n_points).
method – One of "l1", "l2", "max", "min_max", "wave_number", "robust_scale", "area".
spectral_dim – Spectral dimension name when input_spectra is a DataArray.
x_values – Spectral abscissa for ndarray input (default arange(n_points)).

Returns:

Same type as input_spectra with updated attrs["treatments"] for
DataArray output.

wdfkit.preprocessing.normalize

Per-spectrum normalization (dynamic spectral coordinate).

wdfkit.preprocessing.normalize.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) → DataArray | ndarray[source]

Scale spectra along the spectral axis.

For xarray.DataArray input, the spectral axis defaults to the last dimension (e.g. nm, raman_shift, shifts, …). Pass spectral_dim to select another dimension when spectra are not last.

Dask-backed DataArrays are handled transparently:

Per-spectrum methods ("l1", "l2", "max", "min_max", "area"): processed chunk-by-chunk via xr.apply_ufunc — no data is loaded into RAM beyond the current chunk.
Global methods ("robust_scale", "wave_number"): require statistics across all spectra; the full array is computed first. A UserWarning is emitted so you know RAM is being used.

Parameters:

input_spectra – DataArray or 2D ndarray of shape (n_spectra, n_points).
method – One of "l1", "l2", "max", "min_max", "wave_number", "robust_scale", "area".
spectral_dim – Spectral dimension name when input_spectra is a DataArray.
x_values – Spectral abscissa for ndarray input (default arange(n_points)).

Returns:

Same type as input_spectra with updated attrs["treatments"] for
DataArray output.

wdfkit.preprocessing.pca_clean

PCA-based spectral denoising for stacks of spectra and 3D map cubes.

PCA decomposes a population of spectra into orthogonal components and reconstructs each spectrum from the leading ones. Components dominated by uncorrelated per-channel noise are dropped, so the reconstruction is a denoised version of the input. This requires more than one spectrum — see wdfkit.spectra_cleaner.SpectraCleaner for the user-facing API.

wdfkit.preprocessing.pca_clean.denoise_spectra_pca(values: ndarray, *, n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, pca_kwargs: dict[str, Any] | None = None, return_decomposition: bool = False) → tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]

Denoise a stack / cube of spectra by PCA reconstruction.

The input is reshaped to (n_spectra, n_spectral) for the fit, then reshaped back to the original spatial layout on return. PCA itself mean-centers internally; the optional per-spectrum min subtraction below only changes the baseline offset fed to the decomposition.

Parameters:

values – Array of shape (..., n_spectral). Typical inputs: (ny, nx, n_spectral) map cube, or (n_spectra, n_spectral) stack. Needs more than one spectrum (PCA on a single spectrum is degenerate).
n_components – Forwarded to sklearn.decomposition.PCA. "mle" (default) picks the number with Minka’s MLE; a float in (0, 1) keeps the components that explain that fraction of variance; an int fixes the count; None uses min(n_spectra, n_spectral).
subtract_min – If True (default, matches legacy pca_clean), subtract the per-spectrum minimum before the fit so PCA models the spectral shape rather than offsets.
restore_min – If True, add the saved per-spectrum minimum back to the cleaned output. Off by default to match legacy pca_clean; turn on to preserve absolute intensities.
pca_kwargs – Extra kwargs passed straight to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).
return_decomposition – If True, also return a third dict with the components, per-spectrum coefficients, mean, and explained-variance arrays (large; not suitable for DataArray.attrs).

Returns:

cleaned – Same shape and dtype-flavor (float) as values.
meta – Small dict with the parameters actually used and summary stats — safe to attach to DataArray.attrs.
decomposition_payload – Only when return_decomposition=True. Has keys components, coeffs, mean, explained_variance, explained_variance_ratio, noise_variance.

wdfkit.preprocessing.cosmic_ray_1d

1D spectrum cosmic-ray (positive spike) removal.

wdfkit.preprocessing.cosmic_ray_1d.linear_interpolate_masked_channels_1d(y: ndarray, bad_channel_mask: ndarray) → ndarray[source]: Fill masked channels by linear interpolation from good ones.

wdfkit.preprocessing.cosmic_ray_1d.positive_spike_mask_from_derivative_peaks(y: ndarray, threshold_multiplier: float) → ndarray[source]

Interior i where y[i] is above both neighbors by threshold_multiplier * noise.

noise is scaled MAD of diff(y).

wdfkit.preprocessing.cosmic_ray_1d.positive_spike_mask_vs_median_smooth(y: ndarray, median_smoothed_y: ndarray, threshold_multiplier: float) → tuple[ndarray, float][source]

Mask where positive residual exceeds threshold_multiplier * noise.

Residual is y - median_smoothed_y; noise is scaled MAD of residual.

wdfkit.preprocessing.cosmic_ray_1d.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) → tuple[ndarray, ndarray][source]

Remove sharp positive spikes from one 1D spectrum (PL-style).

Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.

The algorithm runs up to max_passes iterations. Each pass:

Detects new spikes on the current (already-repaired) signal.
Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.
Accumulates into a single cumulative mask across all passes.
Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.

Early termination when a pass finds no new spikes.

Parameters:

y – One spectral trace (any numeric dtype; cast to float).
method – "median" or "interpolate" — scipy.signal.medfilt reference signal, MAD on residual for detection (both methods now repair identically via linear interpolation). "derivative" — neighbour-difference test on diff(y) MAD; interior points only.
kernel_size – Odd length >= 3 for medfilt (median / interpolate methods).
threshold – Multiplier on MAD-derived noise (larger → fewer detections).
max_passes – Maximum number of detection–repair iterations (default 3). Use 1 for the old single-pass behaviour.

Returns:

corrected_y – Same shape as y; unchanged if no spikes are found or if noise is degenerate.
cosmic_mask – Boolean mask, same shape as y; True at all channels that were corrected (including dilation neighbours). All False when nothing was found or when the mask would cover the entire spectrum.

wdfkit.preprocessing.cosmic_ray_map

Spatial (3D map) cosmic-ray detection and replacement.

wdfkit.preprocessing.cosmic_ray_map.correct_cosmic_rays_on_map_cube(values: ndarray, *, sensitivity: float, spectral_width_fraction: float, disk_radius: int, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True, return_diagnostic_masks: bool = False) → tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]

Spatial disk median on a per-spectrum normalized cube; robust positive residual test per wavelength.

Per channel λ, the cutoff is map_mad_multiplier * (0.01/sensitivity) * relax_λ * noise_λ, where noise_λ is scaled MAD of (preprocessed - spatial_median_reference) in the (y, x) plane, and relax_λ comes from map_noisy_channel_relax_min (noisy bands more sensitive).

Spectral dilation length is min(width×N, map_spectral_dilate_cap). After dilation, each contiguous True segment along λ at fixed (y, x) is clipped to at most map_max_spectral_repair_extent channels (None disables) so repair stays localized.

Detection uses residual > map_min_residual_over_cutoff * cutoff.

If map_require_spatial_local_max, a voxel must be a strict spatial maximum in its λ slice among 8 neighbours (reduces false positives).

Repair: dilate core hits along λ, then for each (y, x) interpolate masked samples along λ from spatial_median_reference[y, x, :]; unmasked λ keep preprocessed.

If return_diagnostic_masks is True, returns a third dict (large numpy arrays — do not put them in DataArray.attrs).

wdfkit.preprocessing.cosmic_ray_map.interpolate_cosmic_ray_regions_spectrally(preprocessed: ndarray, spatial_median_reference: ndarray, repair_mask: ndarray) → ndarray[source]

Inpaint repair_mask points by interp along λ.

Reference curve is spatial_median_reference[y, x, :]; other channels keep original preprocessed values.

wdfkit.preprocessing.cosmic_ray_map.min_subtract_median_normalize_map_cube(values: ndarray) → tuple[ndarray, ndarray][source]: Per spectrum: subtract min along λ, divide by median intensity.

wdfkit.preprocessing.cosmic_ray_map.unique_spatial_indices_from_nonzero(nonzero_axes: tuple[ndarray, ...], spatial_ndim: int) → list[tuple[int, ...]][source]: Unique (y, x, …) from np.nonzero-style sparse index arrays.