A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)

A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)#

“A Python package for simple and robust climate data analysis.”

Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee

With thanks to Peter Gleckler, Paul Durack, Karl Taylor, and Chris Golaz

Updated: 03/18/25 [v0.8.0]

This work is performed under the auspices of the U. S. DOE by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344.

Presentation Overview#

Intended audience: Some or no familiarity with xarray and/or xcdat

Driving force behind xCDAT
Goals and milestones of CDAT’s successor
Introducing xCDAT
Understanding the basics of Xarray
How xCDAT extends Xarray for climate data analysis
Technical design philosophy and APIs
Demo of capabilities
How to get involved

The data used in this example can be found in the xcdat-data repository.

Notebook Kernel Setup#

Users can install their own instance of xcdat and follow these examples using their own environment (e.g., with VS Code, Jupyter, Spyder, iPython) or enable xcdat with existing JupyterHub instances.

First, create the conda environment:

conda create -n xcdat_notebook -c conda-forge xcdat xesmf matplotlib ipython ipykernel cartopy nc-time-axis gsw-xarray jupyter pooch

Then install the kernel from the xcdat_notebook environment using ipykernel and name the kernel with the display name (e.g., xcdat_notebook):

python -m ipykernel install --user --name xcdat_notebook --display-name xcdat_notebook

Then to select the kernel xcdat_notebook in Jupyter to use this kernel.

The Driving Force Behind xCDAT#

The CDAT (Community Data Analysis Tools) library has provided a suite of robust and comprehensive open-source climate data analysis and visualization packages for over 20 years
A driving need for a modern successor
- Focus on a maintainable and extensible library
- Serve the needs of the climate community in the long-term

Goals and Milestones for CDAT’s Successor#

Offer similar core capabilities
1. For example geospatial averaging, temporal averaging, and regridding
Use modern technologies in the library’s stack
1. Support parallelism and lazy operations
Be maintainable, extensible, and easy-to-use
1. Python Enhancement Proposals (PEPs)
2. Automate DevOps processes (unit testing, code coverage)
3. Actively maintain documentation
Cultivate an open-source community that can sustain the project
1. Encourage GitHub contributions
2. Community engagement efforts (e.g., Pangeo, ESGF)

Introducing xCDAT#

xCDAT is an extension of xarray for climate data analysis on structured grids
Goal of providing features and utilities for simple and robust analysis of climate data
Jointly developed by scientists and developers from:
- E3SM Project (Energy Exascale Earth System Model Project)
- PCMDI (Program for Climate Model Diagnosis and Intercomparison)
- SEATS Project (Simplifying ESM Analysis Through Standards Project)
- Users around the world via GitHub

Before We Dive Deeper, Let’s Talk About Xarray#

Xarray is an evolution of an internal tool developed at The Climate Corporation
Released as open source in May 2014
NumFocus fiscally sponsored project since August 2018

Key Features and Capabilities in Xarray#

“N-D labeled arrays and datasets in Python”
- Built upon and extends NumPy and pandas
Interoperable with scientific Python ecosystem including NumPy, Dask, Pandas, and Matplotlib
Supports file I/O, indexing and selecting, interpolating, grouping, aggregating, parallelism (Dask), plotting (matplotlib wrapper)
- Supported formats include: netCDF, Iris, OPeNDAP, Zarr, and GRIB

Source: https://xarray.dev/#features

Why use Xarray?#

“Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.”

—https://xarray.pydata.org/en/v2022.10.0/getting-started-guide/why-xarray.html

Apply operations over dimensions by name
- x.sum('time')
Select values by label (or logical location) instead of integer location
- x.loc['2014-01-01'] or x.sel(time='2014-01-01')
Mathematical operations vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape
- x - y
Easily use the split-apply-combine paradigm with groupby
- x.groupby('time.dayofyear').mean().
Database-like alignment based on coordinate labels that smoothly handles missing values
- x, y = xr.align(x, y, join='outer')
Keep track of arbitrary metadata in the form of a Python dictionary
- x.attrs

Source: https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html#what-labels-enable

Resources for Learning Xarray#

Here are some highly recommended resources:

xCDAT Extends Xarray for Climate Data Analysis#

Some key xCDAT features are inspired by or ported from the core CDAT library
- e.g., spatial averaging, temporal averaging, regrid2 for horizontal regridding
Other features leverage powerful libraries in the xarray ecosystem
- xESMF for horizontal regridding
- xgcm for vertical interpolation
- CF-xarray for CF convention metadata interpretation
xCDAT strives to support datasets CF compliant and common non-CF compliant metadata (time units in “months since …” or “years since …”)
Inherent support for lazy operations and parallelism through xarray + dask

cf-xarray logo

The Technical Design Philosophy#

Streamline the user experience of developing code to analyze climate data
Reduce the complexity and overhead for implementing certain features with xarray (e.g., temporal averaging, spatial averaging)
Encourage reusable functionalities through a single library

Leveraging the APIs#

xCDAT provides public APIs in two ways:

Top-level APIs functions
- e.g., xcdat.open_dataset(), xcdat.center_times()
- Usually for opening datasets and performing dataset level operations
Accessor classes
- xcdat provides Dataset accessors, which are implicit namespaces for custom functionality.
- Accessor namespaces clearly identifies separation from built-in xarray methods.
- Operate on variables within the xr.Dataset
- e.g., ds.spatial, ds.temporal, ds.regridder

xcdat accessor — xcdat spatial functionality is exposed by chaining the .spatial accessor attribute to the xr.Dataset object.

Source: https://xcdat.readthedocs.io/en/latest/api.html

Key Features in xCDAT#

Feature	API	Description
Extend `xr.open_dataset()` and `xr.open_mfdataset()`	`open_dataset()` `open_mfdataset()`	Bounds generation Time decoding (CF and select non-CF time units) Centering of time coordinates Conversion of longitudinal axis orientation
Temporal averaging	`ds.temporal.average()` `ds.temporal.group_average()` `ds.temporal.climatology()` `ds.temporal.departures()`	Single snapshot and group average Climatology and departure Weighted or unweighted Optional seasonal configuration< (e.g., custom seasons)
Geospatial averaging	`ds.spatial.average()`	Rectilinear grids Weighted Optional specification of region domain
Horizontal regridding	`ds.regridder.horizontal()`	Rectilinear and curvilinear grids Extends xESMF horizontal regridding Python implementation of regrid2
Vertical regridding	`ds.regridder.vertical()`	Transform vertical coordinates Extends xgcm vertical interpolation Linear, logarithmic, and conservative interpolation Decode parametric vertical coordinates if required

Parallelism with Dask#

Nearly all existing xarray methods have been extended to work automatically with Dask arrays for parallelism

—https://docs.xarray.dev/en/stable/user-guide/dask.html#using-dask-with-xarray

Parallelized xarray methods include indexing, computation, concatenating and grouped operations
xCDAT APIs that build upon xarray methods inherently support Dask parallelism
- Dask arrays are loaded into memory only when absolutely required (e.g., decoding time, handling bounds)

High-level Overview of Dask Mechanics#

Dask divides arrays into many small pieces, called “chunks” (each presumed to be small enough to fit into memory)
Dask arrays operations are lazy
- Operations queue up a series of tasks mapped over blocks
- No computation is performed until values need to be computed (lazy)
- Data is loaded into memory and computation is performed in streaming fashion, block-by-block
Computation is controlled by multi-processing or thread pool

Source: https://docs.xarray.dev/en/stable/user-guide/dask.html

How do I activate Dask with Xarray/xCDAT?#

The usual way to create a Dataset filled with Dask arrays is to load the data from a netCDF file or files
You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function
- By default, open_mfdataset() will chunk each netCDF file into a single Dask array
- Supply the chunks argument to control the size of the resulting Dask arrays
- Xarray maintains a Dask array until it is not possible (raises an exception instead of loading into memory)

Source: https://docs.xarray.dev/en/stable/user-guide/dask.html#reading-and-writing-data

[15]:

# Use .chunk() to activate Dask arrays
# NOTE: `open_mfdataset()` automatically chunks by the number of files, which
# might not be optimal.
ds = xc.tutorial.open_dataset("tas_amon_access", chunks={"time": "auto"})

[16]:

ds

[16]:

<xarray.Dataset> Size: 7MB
Dimensions:    (time: 60, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * lat        (lat) float64 1kB -90.0 -88.75 -87.5 -86.25 ... 87.5 88.75 90.0
  * lon        (lon) float64 2kB 0.0 1.875 3.75 5.625 ... 354.4 356.2 358.1
    height     float64 8B ...
  * time       (time) object 480B 1870-01-16 12:00:00 ... 1874-12-16 12:00:00
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object 960B dask.array<chunksize=(60, 2), meta=np.ndarray>
    lat_bnds   (lat, bnds) float64 2kB dask.array<chunksize=(145, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 3kB dask.array<chunksize=(192, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 7MB dask.array<chunksize=(60, 145, 192), meta=np.ndarray>
Attributes: (12/48)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    variant_label:                   r10i1p1f1
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

Example of Parallelism in xCDAT’s Spatial Averager#

This is a demonstration that chunked Dataset objects work with xCDAT APIs.

The generation of weights is serial
The weighted average operation should be parallelized (uses xarray’s .weighted() API)
We intend on doing performance metrics and giving guidance on when to chunk
- For now visit xCDAT/xcdat#376 for best practices

Queue up the spatial average operation in the Dask task graph

[17]:

tas_global = ds.spatial.average("tas", axis=["X", "Y"], weights="generate")["tas"]
tas_global

[17]:

<xarray.DataArray 'tas' (time: 60)> Size: 480B
dask.array<truediv, shape=(60,), dtype=float64, chunksize=(60,), chunktype=numpy.ndarray>
Coordinates:
    height   float64 8B ...
  * time     (time) object 480B 1870-01-16 12:00:00 ... 1874-12-16 12:00:00
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2020-06-05T04:06:10Z altered by CMOR: Treated scalar dime...
    _ChunkSizes:    [  1 145 192]

Trigger the computation by calling .compute (or .load())

[18]:

tas_global.compute()

Further Dask Guidance#

Visit these pages for more guidance (e.g., when to parallelize):

Parallel computing with Dask (xCDAT): https://xcdat.readthedocs.io/en/latest/examples/parallel-computing-with-dask.html
Parallel computing with Dask (Xarray): https://docs.xarray.dev/en/stable/user-guide/dask.html
Xarray with Dask Arrays: https://examples.dask.org/xarray.html

Key Takeaways#

A driving need for a modern successor to CDAT
Serves the climate community in the long-term
xCDAT is an extension of xarray for climate data analysis on structured grids
Goal of providing features and utilities for simple and robust analysis of climate data

Get Involved on GitHub!#

Code contributions are welcome and appreciated
- GitHub Repository: xCDAT/xcdat
- Contributing Guide: https://xcdat.readthedocs.io/en/latest/contributing.html
Submit and/or address tickets for feature suggestions, bugs, and documentation updates
- GitHub Issues: xCDAT/xcdat#issues
Participate in forum discussions on version releases, architecture, feature suggestions, etc.
- GitHub Discussions: xCDAT/xcdat#discussions

A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)

Contents

A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)#

“A Python package for simple and robust climate data analysis.”

Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee

Presentation Overview#

Notebook Kernel Setup#

The Driving Force Behind xCDAT#

Goals and Milestones for CDAT’s Successor#

Introducing xCDAT#

Before We Dive Deeper, Let’s Talk About Xarray#

Key Features and Capabilities in Xarray#

Why use Xarray?#

The Xarray Data Models#

Exploring the Xarray Data Models#

The `Dataset` Model#

The `DataArray` Model#

Resources for Learning Xarray#

xCDAT Extends Xarray for Climate Data Analysis#

The Technical Design Philosophy#

Leveraging the APIs#

Key Features in xCDAT#

A Demo of xCDAT Capabilities#

Installing `xcdat`#

Opening a dataset#

Scenario 1: Spatial Averaging#

Scenario 2: Calculate temporal average#

Scenario 3: Horizontal Regridding#

Create the output grid#

Plot the Input vs. Output Grid#

Regrid the data#

Parallelism with Dask#

High-level Overview of Dask Mechanics#

How do I activate Dask with Xarray/xCDAT?#

Example of Parallelism in xCDAT’s Spatial Averager#

Further Dask Guidance#

Key Takeaways#

Get Involved on GitHub!#

A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)

Contents

A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)#

“A Python package for simple and robust climate data analysis.”

Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee

Presentation Overview#

Notebook Kernel Setup#

The Driving Force Behind xCDAT#

Goals and Milestones for CDAT’s Successor#

Introducing xCDAT#

Before We Dive Deeper, Let’s Talk About Xarray#

Key Features and Capabilities in Xarray#

Why use Xarray?#

The Xarray Data Models#

Exploring the Xarray Data Models#

The Dataset Model#

The DataArray Model#

Resources for Learning Xarray#

xCDAT Extends Xarray for Climate Data Analysis#

The Technical Design Philosophy#

Leveraging the APIs#

Key Features in xCDAT#

A Demo of xCDAT Capabilities#

Installing xcdat#

Opening a dataset#

Scenario 1: Spatial Averaging#

Scenario 2: Calculate temporal average#

Scenario 3: Horizontal Regridding#

Create the output grid#

Plot the Input vs. Output Grid#

Regrid the data#

Parallelism with Dask#

High-level Overview of Dask Mechanics#

How do I activate Dask with Xarray/xCDAT?#

Example of Parallelism in xCDAT’s Spatial Averager#

Further Dask Guidance#

Key Takeaways#

Get Involved on GitHub!#

The `Dataset` Model#

The `DataArray` Model#

Installing `xcdat`#