wdfkit package

Python package for WDF data treatment.

class wdfkit.CosmicRayRemover(sensitivity: float = 0.01, width: float = 0.02, disk_radius: int = 3, single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median', kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3, spectral_dim: str | None = None, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True)[source]

Bases: object

Cosmic-ray removal: spatial median for maps; robust 1D for singles.

Optionally removes broad Nd:YAG laser harmonics on ~355 nm excitation before narrow spike removal (harmonic_check(), remove()).

Maps (3D): spatial disk median on a min/median-normalized cube; per-λ scaled MAD cutoffs and noisy-band relax_λ; repair by spectral interpolation (not copying the full median surface).

Single spectrum / line scan (2D or 1×1 map): see remove_cosmic_rays_1d() — up to max_passes iterations of scipy.signal.medfilt-based MAD detection, mask dilation by 1 channel, and linear-interpolation repair from the original signal.

Parameters:
  • sensitivity (float) – Map path: scales aggressiveness. The cutoff includes (0.01 / sensitivity) times the per-channel MAD level (0.01 is the legacy default reference). Larger sensitivitymore hits.

  • width (float) – Map path: spectral dilation of the CR mask (fraction of length).

  • disk_radius (int) – Map path: spatial disk radius for the reference median filter.

  • map_mad_multiplier (float) – Map path: multiplier on noise_λ × relax_λ (like 1D threshold; larger → fewer false positives).

  • map_noisy_channel_relax_min (float) – Map path: floor on relax_λ in noisy channels. Higher → weaker boost in noisy bands → fewer false positives.

  • map_spectral_dilate_cap (int) – Map path: max footprint length (in spectral channels) when dilating hits along λ. Caps width × N so repair stays narrow and interp stays accurate.

  • map_require_spatial_local_max (bool) – Map path: if True (default), keep only voxels that are strict maxima in their (y, x) slice at fixed λ (8-neighbour), reducing extended bright features being treated as CRs.

  • map_max_spectral_repair_extent (int | None) – Map path: after spectral dilation, each contiguous repair segment along λ is clipped to at most this many channels (centered on max residual in the segment). None disables (not recommended for noisy maps).

  • map_min_residual_over_cutoff (float) – Map path: require residual > cutoff * this factor (> 1 stricter, fewer false positives). Use 1.0 for the legacy strict inequality.

  • single_spectrum_method (Literal['median', 'interpolate', 'derivative']) – "median" or "interpolate" (both use medfilt detection and linear-interpolation repair — equivalent in practice), or "derivative" (neighbour-difference peak test).

  • kernel_size (int) – Odd, >= 3. Passed to medfilt for single-spectrum median-based methods.

  • threshold (float) – Single-spectrum only: spike cutoff is threshold * MAD_noise. Lower → more aggressive (try 3.54.0 for noisy spectra).

  • max_passes (int) – Single-spectrum only: number of detection–repair iterations (default 3). Each pass runs on the already-repaired signal so that large spikes no longer mask smaller ones. 1 replicates old single-pass behaviour.

  • spectral_dim (str | None) – Name of the spectral axis (default: last dimension). Used for harmonic cleanup and when the spectral dimension is not last.

disk_radius: int = 3
harmonic_check(spectrum: DataArray) DataArray[source]

Notch broad harmonics when LaserWaveLength is ~355 nm (Nd:YAG).

If spectrum.attrs['LaserWaveLength'] is outside 354–356 nm, returns spectrum unchanged.

Searches 1064 / 532 / 355 / 266 nm (±2.5 nm); replaces ~1 nm around each found peak with linear interpolation. Prints one line per removal.

kernel_size: int = 5
map_mad_multiplier: float = 7.0
map_max_spectral_repair_extent: int | None = 12
map_min_residual_over_cutoff: float = 1.05
map_noisy_channel_relax_min: float = 0.82
map_require_spatial_local_max: bool = True
map_spectral_dilate_cap: int = 5
max_passes: int = 3
remove(spectrum: DataArray) DataArray[source]

Harmonic cleanup first, then cosmic-ray removal.

remove_cosmic_rays(spectrum: DataArray) DataArray[source]

Spike removal only (no harmonic notch).

remove_cosmic_rays_with_diagnostics(spectrum: DataArray) tuple[DataArray, dict[str, Any]][source]

Like remove_cosmic_rays(), but returns a diagnostics dict for visualization / QC (not written to DataArray.attrs).

For 3D maps, diagnostics includes boolean core_mask, repair_mask, and float arrays residual, preprocessed, spatial_median_reference, cutoff, per_spectrum_median, etc. Use matplotlib to overlay masks or compare spectra at selected (y, x).

For 2D single-spectrum input, diagnostics contain cosmic_mask and corrected_1d (the 1D corrected intensity).

remove_with_diagnostics(spectrum: DataArray) tuple[DataArray, dict[str, Any]][source]

Harmonics, then remove_cosmic_rays_with_diagnostics().

sensitivity: float = 0.01
single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median'
spectral_dim: str | None = None
threshold: float = 5.0
transform(spectrum: DataArray) DataArray[source]

Alias of remove() (harmonics then cosmic rays).

width: float = 0.02
class wdfkit.SpectraCleaner(method: Literal['pca'] = 'pca', n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, spectral_dim: str | None = None, pca_kwargs: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Denoise a population of spectra by low-rank reconstruction.

Designed for 3D map cubes (ny, nx, n_spectral) and 2D stacks (n_spectra, n_spectral). PCA reconstruction needs more than one spectrum to separate shared signal from per-channel noise — a single spectrum is rejected with a clear error (use a 1D smoother instead).

Parameters:
  • method (Literal['pca']) – Denoising method. Currently only "pca" is implemented; the switch is kept for forward compatibility.

  • n_components (int | float | str | None) – Forwarded to sklearn.decomposition.PCA. "mle" (default), a float in (0, 1) for variance-explained, an int count, or None for min(n_spectra, n_spectral).

  • subtract_min (bool) – Subtract per-spectrum min before the fit (legacy default True). PCA also mean-centers internally, so this only changes the baseline offset fed to the fit.

  • restore_min (bool) – Add the saved per-spectrum min back after reconstruction. Off by default (legacy behavior); enable to preserve absolute intensities.

  • spectral_dim (str | None) – Name of the spectral axis in DataArray inputs. Defaults to the last dimension; pass when spectra are not last (e.g. "raman_shift" with leading spectral axis).

  • pca_kwargs (dict[str, Any]) – Extra kwargs forwarded to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).

clean(spectra: DataArray) DataArray[source]

Return a denoised copy of spectra (no decomposition payload).

clean_with_decomposition(spectra: DataArray) tuple[DataArray, dict[str, Any]][source]

Like clean(), but also returns the PCA decomposition.

The returned decomposition dict has keys components (shape (n_components, n_spectral)), coeffs (per-spectrum scores reshaped to the input’s spatial layout + components axis), mean, explained_variance, explained_variance_ratio, and noise_variance. These arrays can be large — they’re returned separately rather than written to DataArray.attrs.

method: Literal['pca'] = 'pca'
n_components: int | float | str | None = 'mle'
pca_kwargs: dict[str, Any]
restore_min: bool = False
spectral_dim: str | None = None
subtract_min: bool = True
transform(spectra: DataArray) DataArray[source]

Alias of clean().

class wdfkit.SpectralAxisSpec(dim_name: str, units: str)[source]

Bases: object

Resolved spectral coordinate used as xarray dimension name + coord attrs.

dim_name: str
units: str
class wdfkit.WDFReader(path: str | PathLike[str], *, verbose: bool = False, time_coord: str = 'seconds_elapsed', spectral_dim: str | None = None, chunks: bool | int = False)[source]

Bases: object

Load spectra and metadata from a Renishaw WiRE .wdf binary file.

Typical usage:

data_array, white_light_image = WDFReader(path)

After construction, .data and .image hold the same objects as the unpacked tuple.

Parameters:
  • spectral_dim – Name for the spectral axis coordinate (default None / "auto"). WiRE XLST XlistDataUnits selects the default (e.g. RamanShift → dimension "raman_shift"). Set to "shifts" for legacy notebooks.

  • chunks – Enable lazy Dask-backed reading. False (default) = eager; True = auto-chunk at ~128 MB per chunk; int = target MB.

wdfkit.classify(path: str | PathLike[str]) dict[source]

Return scan classification for a WiRE .wdf file without loading the spectral data.

Returns:

Keys: kind, measurement_type, scan_type, wmap_flag, nspectra, npoints, nsteps.

Return type:

dict

wdfkit.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) DataArray | ndarray[source]

Scale spectra along the spectral axis.

For xarray.DataArray input, the spectral axis defaults to the last dimension (e.g. nm, raman_shift, shifts, …). Pass spectral_dim to select another dimension when spectra are not last.

Dask-backed DataArrays are handled transparently:

  • Per-spectrum methods ("l1", "l2", "max", "min_max", "area"): processed chunk-by-chunk via xr.apply_ufunc — no data is loaded into RAM beyond the current chunk.

  • Global methods ("robust_scale", "wave_number"): require statistics across all spectra; the full array is computed first. A UserWarning is emitted so you know RAM is being used.

Parameters:
  • input_spectra – DataArray or 2D ndarray of shape (n_spectra, n_points).

  • method – One of "l1", "l2", "max", "min_max", "wave_number", "robust_scale", "area".

  • spectral_dim – Spectral dimension name when input_spectra is a DataArray.

  • x_values – Spectral abscissa for ndarray input (default arange(n_points)).

Returns:

  • Same type as input_spectra with updated attrs["treatments"] for

  • DataArray output.

wdfkit.read(path: str | PathLike[str], *, verbose: bool = False, spectral_dim: str | None = None, chunks: bool | int = False) DataArray[source]

Read a WiRE .wdf file and return a xarray.DataArray.

Parameters:
  • path – Path to the .wdf file.

  • spectral_dim – Override for the spectral-axis dimension name.

  • chunks – Dask chunking: False (eager), True (auto), or int (target MB).

Returns:

Shape and dims depend on scan kind; spectral axis is always last.

Return type:

xarray.DataArray

wdfkit.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) tuple[ndarray, ndarray][source]

Remove sharp positive spikes from one 1D spectrum (PL-style).

Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.

The algorithm runs up to max_passes iterations. Each pass:

  1. Detects new spikes on the current (already-repaired) signal.

  2. Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.

  3. Accumulates into a single cumulative mask across all passes.

  4. Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.

Early termination when a pass finds no new spikes.

Parameters:
  • y – One spectral trace (any numeric dtype; cast to float).

  • method"median" or "interpolate"scipy.signal.medfilt reference signal, MAD on residual for detection (both methods now repair identically via linear interpolation). "derivative" — neighbour-difference test on diff(y) MAD; interior points only.

  • kernel_size – Odd length >= 3 for medfilt (median / interpolate methods).

  • threshold – Multiplier on MAD-derived noise (larger → fewer detections).

  • max_passes – Maximum number of detection–repair iterations (default 3). Use 1 for the old single-pass behaviour.

Returns:

  • corrected_y – Same shape as y; unchanged if no spikes are found or if noise is degenerate.

  • cosmic_mask – Boolean mask, same shape as y; True at all channels that were corrected (including dilation neighbours). All False when nothing was found or when the mask would cover the entire spectrum.

wdfkit.resolve_spectral_axis(xlist_data_units: str, spectral_dim: str | None) SpectralAxisSpec[source]

Choose spectral coordinate dimension name and coord.attrs["units"].

Parameters:
  • xlist_data_unitsDATA_UNITS label resolved from raw XLST (e.g. "Nanometre").

  • spectral_dimNone or "auto" — derive dim name from xlist_data_units. Any other string — force this dimension name (units still come from the table when known; unknown Wire enums fall back to units="unknown").

Returns:

dim_name is safe as an xarray dimension identifier (ASCII tokens).

Return type:

SpectralAxisSpec

Submodules

wdfkit.reader

Public WDFReader API plus module-level read() and classify().

class wdfkit.reader.WDFReader(path: str | PathLike[str], *, verbose: bool = False, time_coord: str = 'seconds_elapsed', spectral_dim: str | None = None, chunks: bool | int = False)[source]

Bases: object

Load spectra and metadata from a Renishaw WiRE .wdf binary file.

Typical usage:

data_array, white_light_image = WDFReader(path)

After construction, .data and .image hold the same objects as the unpacked tuple.

Parameters:
  • spectral_dim – Name for the spectral axis coordinate (default None / "auto"). WiRE XLST XlistDataUnits selects the default (e.g. RamanShift → dimension "raman_shift"). Set to "shifts" for legacy notebooks.

  • chunks – Enable lazy Dask-backed reading. False (default) = eager; True = auto-chunk at ~128 MB per chunk; int = target MB.

wdfkit.reader.classify(path: str | PathLike[str]) dict[source]

Return scan classification for a WiRE .wdf file without loading the spectral data.

Returns:

Keys: kind, measurement_type, scan_type, wmap_flag, nspectra, npoints, nsteps.

Return type:

dict

wdfkit.reader.read(path: str | PathLike[str], *, verbose: bool = False, spectral_dim: str | None = None, chunks: bool | int = False) DataArray[source]

Read a WiRE .wdf file and return a xarray.DataArray.

Parameters:
  • path – Path to the .wdf file.

  • spectral_dim – Override for the spectral-axis dimension name.

  • chunks – Dask chunking: False (eager), True (auto), or int (target MB).

Returns:

Shape and dims depend on scan kind; spectral axis is always last.

Return type:

xarray.DataArray

wdfkit.cosmic_ray

High-level cosmic-ray removal: CosmicRayRemover for maps and singles.

class wdfkit.cosmic_ray.CosmicRayRemover(sensitivity: float = 0.01, width: float = 0.02, disk_radius: int = 3, single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median', kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3, spectral_dim: str | None = None, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True)[source]

Bases: object

Cosmic-ray removal: spatial median for maps; robust 1D for singles.

Optionally removes broad Nd:YAG laser harmonics on ~355 nm excitation before narrow spike removal (harmonic_check(), remove()).

Maps (3D): spatial disk median on a min/median-normalized cube; per-λ scaled MAD cutoffs and noisy-band relax_λ; repair by spectral interpolation (not copying the full median surface).

Single spectrum / line scan (2D or 1×1 map): see remove_cosmic_rays_1d() — up to max_passes iterations of scipy.signal.medfilt-based MAD detection, mask dilation by 1 channel, and linear-interpolation repair from the original signal.

Parameters:
  • sensitivity (float) – Map path: scales aggressiveness. The cutoff includes (0.01 / sensitivity) times the per-channel MAD level (0.01 is the legacy default reference). Larger sensitivitymore hits.

  • width (float) – Map path: spectral dilation of the CR mask (fraction of length).

  • disk_radius (int) – Map path: spatial disk radius for the reference median filter.

  • map_mad_multiplier (float) – Map path: multiplier on noise_λ × relax_λ (like 1D threshold; larger → fewer false positives).

  • map_noisy_channel_relax_min (float) – Map path: floor on relax_λ in noisy channels. Higher → weaker boost in noisy bands → fewer false positives.

  • map_spectral_dilate_cap (int) – Map path: max footprint length (in spectral channels) when dilating hits along λ. Caps width × N so repair stays narrow and interp stays accurate.

  • map_require_spatial_local_max (bool) – Map path: if True (default), keep only voxels that are strict maxima in their (y, x) slice at fixed λ (8-neighbour), reducing extended bright features being treated as CRs.

  • map_max_spectral_repair_extent (int | None) – Map path: after spectral dilation, each contiguous repair segment along λ is clipped to at most this many channels (centered on max residual in the segment). None disables (not recommended for noisy maps).

  • map_min_residual_over_cutoff (float) – Map path: require residual > cutoff * this factor (> 1 stricter, fewer false positives). Use 1.0 for the legacy strict inequality.

  • single_spectrum_method (Literal['median', 'interpolate', 'derivative']) – "median" or "interpolate" (both use medfilt detection and linear-interpolation repair — equivalent in practice), or "derivative" (neighbour-difference peak test).

  • kernel_size (int) – Odd, >= 3. Passed to medfilt for single-spectrum median-based methods.

  • threshold (float) – Single-spectrum only: spike cutoff is threshold * MAD_noise. Lower → more aggressive (try 3.54.0 for noisy spectra).

  • max_passes (int) – Single-spectrum only: number of detection–repair iterations (default 3). Each pass runs on the already-repaired signal so that large spikes no longer mask smaller ones. 1 replicates old single-pass behaviour.

  • spectral_dim (str | None) – Name of the spectral axis (default: last dimension). Used for harmonic cleanup and when the spectral dimension is not last.

disk_radius: int = 3
harmonic_check(spectrum: DataArray) DataArray[source]

Notch broad harmonics when LaserWaveLength is ~355 nm (Nd:YAG).

If spectrum.attrs['LaserWaveLength'] is outside 354–356 nm, returns spectrum unchanged.

Searches 1064 / 532 / 355 / 266 nm (±2.5 nm); replaces ~1 nm around each found peak with linear interpolation. Prints one line per removal.

kernel_size: int = 5
map_mad_multiplier: float = 7.0
map_max_spectral_repair_extent: int | None = 12
map_min_residual_over_cutoff: float = 1.05
map_noisy_channel_relax_min: float = 0.82
map_require_spatial_local_max: bool = True
map_spectral_dilate_cap: int = 5
max_passes: int = 3
remove(spectrum: DataArray) DataArray[source]

Harmonic cleanup first, then cosmic-ray removal.

remove_cosmic_rays(spectrum: DataArray) DataArray[source]

Spike removal only (no harmonic notch).

remove_cosmic_rays_with_diagnostics(spectrum: DataArray) tuple[DataArray, dict[str, Any]][source]

Like remove_cosmic_rays(), but returns a diagnostics dict for visualization / QC (not written to DataArray.attrs).

For 3D maps, diagnostics includes boolean core_mask, repair_mask, and float arrays residual, preprocessed, spatial_median_reference, cutoff, per_spectrum_median, etc. Use matplotlib to overlay masks or compare spectra at selected (y, x).

For 2D single-spectrum input, diagnostics contain cosmic_mask and corrected_1d (the 1D corrected intensity).

remove_with_diagnostics(spectrum: DataArray) tuple[DataArray, dict[str, Any]][source]

Harmonics, then remove_cosmic_rays_with_diagnostics().

sensitivity: float = 0.01
single_spectrum_method: Literal['median', 'interpolate', 'derivative'] = 'median'
spectral_dim: str | None = None
threshold: float = 5.0
transform(spectrum: DataArray) DataArray[source]

Alias of remove() (harmonics then cosmic rays).

width: float = 0.02
wdfkit.cosmic_ray.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) tuple[ndarray, ndarray][source]

Remove sharp positive spikes from one 1D spectrum (PL-style).

Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.

The algorithm runs up to max_passes iterations. Each pass:

  1. Detects new spikes on the current (already-repaired) signal.

  2. Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.

  3. Accumulates into a single cumulative mask across all passes.

  4. Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.

Early termination when a pass finds no new spikes.

Parameters:
  • y – One spectral trace (any numeric dtype; cast to float).

  • method"median" or "interpolate"scipy.signal.medfilt reference signal, MAD on residual for detection (both methods now repair identically via linear interpolation). "derivative" — neighbour-difference test on diff(y) MAD; interior points only.

  • kernel_size – Odd length >= 3 for medfilt (median / interpolate methods).

  • threshold – Multiplier on MAD-derived noise (larger → fewer detections).

  • max_passes – Maximum number of detection–repair iterations (default 3). Use 1 for the old single-pass behaviour.

Returns:

  • corrected_y – Same shape as y; unchanged if no spikes are found or if noise is degenerate.

  • cosmic_mask – Boolean mask, same shape as y; True at all channels that were corrected (including dilation neighbours). All False when nothing was found or when the mask would cover the entire spectrum.

wdfkit.spectra_cleaner

High-level spectral denoising: SpectraCleaner.

Currently implements PCA-based reconstruction (legacy pca_clean); the method switch is kept so other denoisers can be added (e.g. Savitzky- Golay or wavelet) without breaking callers.

class wdfkit.spectra_cleaner.SpectraCleaner(method: Literal['pca'] = 'pca', n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, spectral_dim: str | None = None, pca_kwargs: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Denoise a population of spectra by low-rank reconstruction.

Designed for 3D map cubes (ny, nx, n_spectral) and 2D stacks (n_spectra, n_spectral). PCA reconstruction needs more than one spectrum to separate shared signal from per-channel noise — a single spectrum is rejected with a clear error (use a 1D smoother instead).

Parameters:
  • method (Literal['pca']) – Denoising method. Currently only "pca" is implemented; the switch is kept for forward compatibility.

  • n_components (int | float | str | None) – Forwarded to sklearn.decomposition.PCA. "mle" (default), a float in (0, 1) for variance-explained, an int count, or None for min(n_spectra, n_spectral).

  • subtract_min (bool) – Subtract per-spectrum min before the fit (legacy default True). PCA also mean-centers internally, so this only changes the baseline offset fed to the fit.

  • restore_min (bool) – Add the saved per-spectrum min back after reconstruction. Off by default (legacy behavior); enable to preserve absolute intensities.

  • spectral_dim (str | None) – Name of the spectral axis in DataArray inputs. Defaults to the last dimension; pass when spectra are not last (e.g. "raman_shift" with leading spectral axis).

  • pca_kwargs (dict[str, Any]) – Extra kwargs forwarded to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).

clean(spectra: DataArray) DataArray[source]

Return a denoised copy of spectra (no decomposition payload).

clean_with_decomposition(spectra: DataArray) tuple[DataArray, dict[str, Any]][source]

Like clean(), but also returns the PCA decomposition.

The returned decomposition dict has keys components (shape (n_components, n_spectral)), coeffs (per-spectrum scores reshaped to the input’s spatial layout + components axis), mean, explained_variance, explained_variance_ratio, and noise_variance. These arrays can be large — they’re returned separately rather than written to DataArray.attrs.

method: Literal['pca'] = 'pca'
n_components: int | float | str | None = 'mle'
pca_kwargs: dict[str, Any]
restore_min: bool = False
spectral_dim: str | None = None
subtract_min: bool = True
transform(spectra: DataArray) DataArray[source]

Alias of clean().

wdfkit.spectral

Spectral-axis naming from WiRE XLST unit enums.

class wdfkit.spectral.SpectralAxisSpec(dim_name: str, units: str)[source]

Bases: object

Resolved spectral coordinate used as xarray dimension name + coord attrs.

dim_name: str
units: str
wdfkit.spectral.resolve_spectral_axis(xlist_data_units: str, spectral_dim: str | None) SpectralAxisSpec[source]

Choose spectral coordinate dimension name and coord.attrs["units"].

Parameters:
  • xlist_data_unitsDATA_UNITS label resolved from raw XLST (e.g. "Nanometre").

  • spectral_dimNone or "auto" — derive dim name from xlist_data_units. Any other string — force this dimension name (units still come from the table when known; unknown Wire enums fall back to units="unknown").

Returns:

dim_name is safe as an xarray dimension identifier (ASCII tokens).

Return type:

SpectralAxisSpec

wdfkit.preprocessing

Spectral preprocessing (normalization).

Cosmic-ray removal: see wdfkit.cosmic_ray. PCA-based denoising: see wdfkit.spectra_cleaner.

wdfkit.preprocessing.denoise_spectra_pca(values: ndarray, *, n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, pca_kwargs: dict[str, Any] | None = None, return_decomposition: bool = False) tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]

Denoise a stack / cube of spectra by PCA reconstruction.

The input is reshaped to (n_spectra, n_spectral) for the fit, then reshaped back to the original spatial layout on return. PCA itself mean-centers internally; the optional per-spectrum min subtraction below only changes the baseline offset fed to the decomposition.

Parameters:
  • values – Array of shape (..., n_spectral). Typical inputs: (ny, nx, n_spectral) map cube, or (n_spectra, n_spectral) stack. Needs more than one spectrum (PCA on a single spectrum is degenerate).

  • n_components – Forwarded to sklearn.decomposition.PCA. "mle" (default) picks the number with Minka’s MLE; a float in (0, 1) keeps the components that explain that fraction of variance; an int fixes the count; None uses min(n_spectra, n_spectral).

  • subtract_min – If True (default, matches legacy pca_clean), subtract the per-spectrum minimum before the fit so PCA models the spectral shape rather than offsets.

  • restore_min – If True, add the saved per-spectrum minimum back to the cleaned output. Off by default to match legacy pca_clean; turn on to preserve absolute intensities.

  • pca_kwargs – Extra kwargs passed straight to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).

  • return_decomposition – If True, also return a third dict with the components, per-spectrum coefficients, mean, and explained-variance arrays (large; not suitable for DataArray.attrs).

Returns:

  • cleaned – Same shape and dtype-flavor (float) as values.

  • meta – Small dict with the parameters actually used and summary stats — safe to attach to DataArray.attrs.

  • decomposition_payload – Only when return_decomposition=True. Has keys components, coeffs, mean, explained_variance, explained_variance_ratio, noise_variance.

wdfkit.preprocessing.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) DataArray | ndarray[source]

Scale spectra along the spectral axis.

For xarray.DataArray input, the spectral axis defaults to the last dimension (e.g. nm, raman_shift, shifts, …). Pass spectral_dim to select another dimension when spectra are not last.

Dask-backed DataArrays are handled transparently:

  • Per-spectrum methods ("l1", "l2", "max", "min_max", "area"): processed chunk-by-chunk via xr.apply_ufunc — no data is loaded into RAM beyond the current chunk.

  • Global methods ("robust_scale", "wave_number"): require statistics across all spectra; the full array is computed first. A UserWarning is emitted so you know RAM is being used.

Parameters:
  • input_spectra – DataArray or 2D ndarray of shape (n_spectra, n_points).

  • method – One of "l1", "l2", "max", "min_max", "wave_number", "robust_scale", "area".

  • spectral_dim – Spectral dimension name when input_spectra is a DataArray.

  • x_values – Spectral abscissa for ndarray input (default arange(n_points)).

Returns:

  • Same type as input_spectra with updated attrs["treatments"] for

  • DataArray output.

wdfkit.preprocessing.normalize

Per-spectrum normalization (dynamic spectral coordinate).

wdfkit.preprocessing.normalize.normalize(input_spectra: DataArray | ndarray, method: str = 'robust_scale', *, spectral_dim: str | None = None, **kwargs) DataArray | ndarray[source]

Scale spectra along the spectral axis.

For xarray.DataArray input, the spectral axis defaults to the last dimension (e.g. nm, raman_shift, shifts, …). Pass spectral_dim to select another dimension when spectra are not last.

Dask-backed DataArrays are handled transparently:

  • Per-spectrum methods ("l1", "l2", "max", "min_max", "area"): processed chunk-by-chunk via xr.apply_ufunc — no data is loaded into RAM beyond the current chunk.

  • Global methods ("robust_scale", "wave_number"): require statistics across all spectra; the full array is computed first. A UserWarning is emitted so you know RAM is being used.

Parameters:
  • input_spectra – DataArray or 2D ndarray of shape (n_spectra, n_points).

  • method – One of "l1", "l2", "max", "min_max", "wave_number", "robust_scale", "area".

  • spectral_dim – Spectral dimension name when input_spectra is a DataArray.

  • x_values – Spectral abscissa for ndarray input (default arange(n_points)).

Returns:

  • Same type as input_spectra with updated attrs["treatments"] for

  • DataArray output.

wdfkit.preprocessing.pca_clean

PCA-based spectral denoising for stacks of spectra and 3D map cubes.

PCA decomposes a population of spectra into orthogonal components and reconstructs each spectrum from the leading ones. Components dominated by uncorrelated per-channel noise are dropped, so the reconstruction is a denoised version of the input. This requires more than one spectrum — see wdfkit.spectra_cleaner.SpectraCleaner for the user-facing API.

wdfkit.preprocessing.pca_clean.denoise_spectra_pca(values: ndarray, *, n_components: int | float | str | None = 'mle', subtract_min: bool = True, restore_min: bool = False, pca_kwargs: dict[str, Any] | None = None, return_decomposition: bool = False) tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]

Denoise a stack / cube of spectra by PCA reconstruction.

The input is reshaped to (n_spectra, n_spectral) for the fit, then reshaped back to the original spatial layout on return. PCA itself mean-centers internally; the optional per-spectrum min subtraction below only changes the baseline offset fed to the decomposition.

Parameters:
  • values – Array of shape (..., n_spectral). Typical inputs: (ny, nx, n_spectral) map cube, or (n_spectra, n_spectral) stack. Needs more than one spectrum (PCA on a single spectrum is degenerate).

  • n_components – Forwarded to sklearn.decomposition.PCA. "mle" (default) picks the number with Minka’s MLE; a float in (0, 1) keeps the components that explain that fraction of variance; an int fixes the count; None uses min(n_spectra, n_spectral).

  • subtract_min – If True (default, matches legacy pca_clean), subtract the per-spectrum minimum before the fit so PCA models the spectral shape rather than offsets.

  • restore_min – If True, add the saved per-spectrum minimum back to the cleaned output. Off by default to match legacy pca_clean; turn on to preserve absolute intensities.

  • pca_kwargs – Extra kwargs passed straight to sklearn.decomposition.PCA (e.g. {"svd_solver": "full"}).

  • return_decomposition – If True, also return a third dict with the components, per-spectrum coefficients, mean, and explained-variance arrays (large; not suitable for DataArray.attrs).

Returns:

  • cleaned – Same shape and dtype-flavor (float) as values.

  • meta – Small dict with the parameters actually used and summary stats — safe to attach to DataArray.attrs.

  • decomposition_payload – Only when return_decomposition=True. Has keys components, coeffs, mean, explained_variance, explained_variance_ratio, noise_variance.

wdfkit.preprocessing.cosmic_ray_1d

1D spectrum cosmic-ray (positive spike) removal.

wdfkit.preprocessing.cosmic_ray_1d.linear_interpolate_masked_channels_1d(y: ndarray, bad_channel_mask: ndarray) ndarray[source]

Fill masked channels by linear interpolation from good ones.

wdfkit.preprocessing.cosmic_ray_1d.positive_spike_mask_from_derivative_peaks(y: ndarray, threshold_multiplier: float) ndarray[source]

Interior i where y[i] is above both neighbors by threshold_multiplier * noise.

noise is scaled MAD of diff(y).

wdfkit.preprocessing.cosmic_ray_1d.positive_spike_mask_vs_median_smooth(y: ndarray, median_smoothed_y: ndarray, threshold_multiplier: float) tuple[ndarray, float][source]

Mask where positive residual exceeds threshold_multiplier * noise.

Residual is y - median_smoothed_y; noise is scaled MAD of residual.

wdfkit.preprocessing.cosmic_ray_1d.remove_cosmic_rays_1d(y: ndarray, method: Literal['median', 'interpolate', 'derivative'], *, kernel_size: int = 5, threshold: float = 5.0, max_passes: int = 3) tuple[ndarray, ndarray][source]

Remove sharp positive spikes from one 1D spectrum (PL-style).

Operates on the raw counts / intensity array (only masked indices change). Cosmic rays: positive excursions vs a robust noise model.

The algorithm runs up to max_passes iterations. Each pass:

  1. Detects new spikes on the current (already-repaired) signal.

  2. Dilates the new spike mask by 1 channel on each side to catch sub-threshold spike edges.

  3. Accumulates into a single cumulative mask across all passes.

  4. Repairs by linear interpolation from the original signal at all cumulative masked positions — avoids chaining interpolation errors.

Early termination when a pass finds no new spikes.

Parameters:
  • y – One spectral trace (any numeric dtype; cast to float).

  • method"median" or "interpolate"scipy.signal.medfilt reference signal, MAD on residual for detection (both methods now repair identically via linear interpolation). "derivative" — neighbour-difference test on diff(y) MAD; interior points only.

  • kernel_size – Odd length >= 3 for medfilt (median / interpolate methods).

  • threshold – Multiplier on MAD-derived noise (larger → fewer detections).

  • max_passes – Maximum number of detection–repair iterations (default 3). Use 1 for the old single-pass behaviour.

Returns:

  • corrected_y – Same shape as y; unchanged if no spikes are found or if noise is degenerate.

  • cosmic_mask – Boolean mask, same shape as y; True at all channels that were corrected (including dilation neighbours). All False when nothing was found or when the mask would cover the entire spectrum.

wdfkit.preprocessing.cosmic_ray_map

Spatial (3D map) cosmic-ray detection and replacement.

wdfkit.preprocessing.cosmic_ray_map.correct_cosmic_rays_on_map_cube(values: ndarray, *, sensitivity: float, spectral_width_fraction: float, disk_radius: int, map_mad_multiplier: float = 7.0, map_noisy_channel_relax_min: float = 0.82, map_spectral_dilate_cap: int = 5, map_max_spectral_repair_extent: int | None = 12, map_min_residual_over_cutoff: float = 1.05, map_require_spatial_local_max: bool = True, return_diagnostic_masks: bool = False) tuple[ndarray, dict[str, Any]] | tuple[ndarray, dict[str, Any], dict[str, Any]][source]

Spatial disk median on a per-spectrum normalized cube; robust positive residual test per wavelength.

Per channel λ, the cutoff is map_mad_multiplier * (0.01/sensitivity) * relax_λ * noise_λ, where noise_λ is scaled MAD of (preprocessed - spatial_median_reference) in the (y, x) plane, and relax_λ comes from map_noisy_channel_relax_min (noisy bands more sensitive).

Spectral dilation length is min(width×N, map_spectral_dilate_cap). After dilation, each contiguous True segment along λ at fixed (y, x) is clipped to at most map_max_spectral_repair_extent channels (None disables) so repair stays localized.

Detection uses residual > map_min_residual_over_cutoff * cutoff.

If map_require_spatial_local_max, a voxel must be a strict spatial maximum in its λ slice among 8 neighbours (reduces false positives).

Repair: dilate core hits along λ, then for each (y, x) interpolate masked samples along λ from spatial_median_reference[y, x, :]; unmasked λ keep preprocessed.

If return_diagnostic_masks is True, returns a third dict (large numpy arrays — do not put them in DataArray.attrs).

wdfkit.preprocessing.cosmic_ray_map.interpolate_cosmic_ray_regions_spectrally(preprocessed: ndarray, spatial_median_reference: ndarray, repair_mask: ndarray) ndarray[source]

Inpaint repair_mask points by interp along λ.

Reference curve is spatial_median_reference[y, x, :]; other channels keep original preprocessed values.

wdfkit.preprocessing.cosmic_ray_map.min_subtract_median_normalize_map_cube(values: ndarray) tuple[ndarray, ndarray][source]

Per spectrum: subtract min along λ, divide by median intensity.

wdfkit.preprocessing.cosmic_ray_map.unique_spatial_indices_from_nonzero(nonzero_axes: tuple[ndarray, ...], spatial_ndim: int) list[tuple[int, ...]][source]

Unique (y, x, …) from np.nonzero-style sparse index arrays.