LLNL Climate and Weather Seminar Series (01/25/2023) - A Gentle Introduction to xCDAT#
“A Python package for simple and robust climate data analysis.”
Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee
With thanks to Peter Gleckler, Paul Durack, Karl Taylor, and Chris Golaz
This work is performed under the auspices of the U. S. DOE by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344.
Presentation Overview#
Intended audience: Some or no familiarity with xarray
and/or xcdat
Driving force behind xCDAT
Goals and milestones of CDAT’s successor
Introducing xCDAT
Understanding the basics of Xarray
How xCDAT extends Xarray for climate data analysis
Technical design philosophy and APIs
Demo of capabilities
How to get involved
Notebook Setup#
Create an Anaconda environment for this notebook using the command below:
conda create -n xcdat -c conda-forge xarray xcdat xesmf matplotlib nc-time-axis jupyter
xesmf
is required for horizontal regridding with xESMFmatplotlib
is an optional dependency required for plotting with xarraync-time-axis
is an optional dependency required formatplotlib
to plotcftime
coordinates
The Driving Force Behind xCDAT#
The CDAT (Community Data Analysis Tools) library has provided a suite of robust and comprehensive open-source climate data analysis and visualization packages for over 20 years
A driving need for a modern successor
Focus on a maintainable and extensible library
Serve the needs of the climate community in the long-term
Goals and Milestones for CDAT’s Successor#
Offer similar core capabilities
For example geospatial averaging, temporal averaging, and regridding
Use modern technologies in the library’s stack
Support parallelism and lazy operations
Be maintainable, extensible, and easy-to-use
Python Enhancement Proposals (PEPs)
Automate DevOps processes (unit testing, code coverage)
Actively maintain documentation
Cultivate an open-source community that can sustain the project
Encourage GitHub contributions
Community engagement efforts (e.g., Pangeo, ESGF)
Introducing xCDAT#
xCDAT is an extension of xarray for climate data analysis on structured grids
Goal of providing features and utilities for simple and robust analysis of climate data
Jointly developed by scientists and developers from:
E3SM Project (Energy Exascale Earth System Model Project)
PCMDI (Program for Climate Model Diagnosis and Intercomparison)
SEATS Project (Simplifying ESM Analysis Through Standards Project)
Users around the world via GitHub
Before We Dive Deeper, Let’s Talk About Xarray#
Xarray is an evolution of an internal tool developed at The Climate Corporation
Released as open source in May 2014
NumFocus fiscally sponsored project since August 2018
Key Features and Capabilities in Xarray#
“N-D labeled arrays and datasets in Python”
Built upon and extends NumPy and pandas
Interoperable with scientific Python ecosystem including NumPy, Dask, Pandas, and Matplotlib
Supports file I/O, indexing and selecting, interpolating, grouping, aggregating, parallelism (Dask), plotting (matplotlib wrapper)
Supported formats include: netCDF, Iris, OPeNDAP, Zarr, and GRIB
Source: https://xarray.dev/#features
Why use Xarray?#
“Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.”
—https://xarray.pydata.org/en/v2022.10.0/getting-started-guide/why-xarray.html
Apply operations over dimensions by name
x.sum('time')
Select values by label (or logical location) instead of integer location
x.loc['2014-01-01']
orx.sel(time='2014-01-01')
Mathematical operations vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape
x - y
Easily use the split-apply-combine paradigm with groupby
x.groupby('time.dayofyear').mean()
.
Database-like alignment based on coordinate labels that smoothly handles missing values
x, y = xr.align(x, y, join='outer')
Keep track of arbitrary metadata in the form of a Python dictionary
x.attrs
Source: https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html#what-labels-enable
The Xarray Data Models#
“Xarray data models are borrowed from netCDF file format, which provides xarray with a natural and portable serialization format.”
—https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html
``xarray.Dataset``
A dictionary-like container of DataArray objects with aligned dimensions
DataArray objects are classified as “coordinate variables” or “data variables”
All data variables have a shared union of coordinates
Serves a similar purpose to a
pandas.DataFrame
``xarray.DataArray``
A class that attaches dimension names, coordinates, and attributes to multi-dimensional arrays (aka “labeled arrays”)
An N-D generalization of a
pandas.Series
Exploring the Xarray Data Models#
Example dataset: tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc
Open up this real dataset from ESGF using xarray’s OPeNDAP support.
Contains the
tas
variable (near-surface air temperature) recorded on a monthly frequency
It is not downloaded until calculations/computations are performed on the
Dataset
objectExample of an xarray lazy operation
[1]:
# This style import is necessary to properly render Xarray's HTML output with
# the Jupyer RISE extension.
# GitHub Issue: https://github.com/damianavila/RISE/issues/594
# Source: https://github.com/smartass101/xarray-pydata-prague-2020/blob/main/rise.css
from IPython.core.display import HTML
style = """
<style>
.reveal pre.xr-text-repr-fallback {
display: none;
}
.reveal ul.xr-sections {
display: grid
}
.reveal ul ul.xr-var-list {
display: contents
}
</style>
"""
HTML(style)
[1]:
[2]:
import xarray as xr
filepath = "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"
ds = xr.open_dataset(filepath)
The Dataset
Model#
[3]:
ds
[3]:
<xarray.Dataset> Dimensions: (time: 1980, bnds: 2, lat: 145, lon: 192) Coordinates: * time (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00 * lat (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0 * lon (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1 height float64 ... Dimensions without coordinates: bnds Data variables: time_bnds (time, bnds) datetime64[ns] ... lat_bnds (lat, bnds) float64 ... lon_bnds (lon, bnds) float64 ... tas (time, lat, lon) float32 ... Attributes: (12/48) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 87658.0 creation_date: 2020-06-05T04:06:11Z ... ... variant_label: r10i1p1f1 version: v20200605 license: CMIP6 model data produced by CSIRO is li... cmor_version: 3.4.0 tracking_id: hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2... DODS_EXTRA.Unlimited_Dimension: time
A dictionary-like container of labeled arrays (DataArray objects) with aligned dimensions.
Key properties:
dims
: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {‘x’: 6, ‘y’: 6, ‘time’: 8})coords
: a dict-like container of DataArrays intended to label points used in ``data_vars`` (e.g., arrays of numbers, datetime objects or strings)data_vars
: a dict-like container of DataArrays corresponding to variablesattrs
: dict to hold arbitrary metadata
Source: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataset
The DataArray
Model#
[4]:
ds.tas
[4]:
<xarray.DataArray 'tas' (time: 1980, lat: 145, lon: 192)> [55123200 values with dtype=float32] Coordinates: * time (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00 * lat (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0 * lon (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1 height float64 ... Attributes: standard_name: air_temperature long_name: Near-Surface Air Temperature comment: near-surface (usually, 2 meter) air temperature units: K cell_methods: area: time: mean cell_measures: area: areacella history: 2020-06-05T04:06:10Z altered by CMOR: Treated scalar dime... _ChunkSizes: [ 1 145 192]
A class that attaches dimension names, coordinates, and attributes to multi-dimensional arrays (aka “labeled arrays”)
Key properties:
values
: a numpy.ndarray holding the array’s valuesdims
: dimension names for each axis (e.g., (‘x’, ‘y’, ‘z’))coords
: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)attrs
: dict to hold arbitrary metadata (attributes)
Source: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataarray
Resources for Learning Xarray#
Here are some highly recommended resources:
xCDAT Extends Xarray for Climate Data Analysis#
Some key xCDAT features are inspired by or ported from the core CDAT library
e.g., spatial averaging, temporal averaging, regrid2 for horizontal regridding
Other features leverage powerful libraries in the xarray ecosystem
xESMF for horizontal regridding
xgcm for vertical interpolation
CF-xarray for CF convention metadata interpretation
xCDAT strives to support datasets CF compliant and common non-CF compliant metadata (time units in “months since …” or “years since …”)
Inherent support for lazy operations and parallelism through xarray + dask
The Technical Design Philosophy#
Streamline the user experience of developing code to analyze climate data
Reduce the complexity and overhead for implementing certain features with xarray (e.g., temporal averaging, spatial averaging)
Encourage reusable functionalities through a single library
Leveraging the APIs#
xCDAT provides public APIs in two ways:
Top-level APIs functions
e.g.,
xcdat.open_dataset()
,xcdat.center_times()
Usually for opening datasets and performing dataset level operations
Accessor classes
xcdat provides
Dataset
accessors, which are implicit namespaces for custom functionality.Accessor namespaces clearly identifies separation from built-in xarray methods.
Operate on variables within the
xr.Dataset
e.g.,
ds.spatial
,ds.temporal
,ds.regridder
xcdat spatial functionality is exposed by chaining the .spatial accessor attribute to the xr.Dataset object.
Key Features in xCDAT#
Feature | API | Description |
Extend |
|
|
Temporal averaging |
|
|
Geospatial averaging |
|
|
Horizontal regridding |
|
|
Vertical regridding |
|
|
A Demo of xCDAT Capabilities#
Prerequisites
Installing
xcdat
Import
xcdat
Open a dataset and apply postprocessing operations
Scenario 1 - Calculate the spatial averages over the tropical region
Scenario 2 - Calculate temporal average
Scenario 3 - Horizontal regridding (bilinear, gaussian grid)
Installing xcdat
#
xCDAT is available on Anaconda under the conda-forge
channel (https://anaconda.org/conda-forge/xcdat)
Two ways to install xcdat
with recommended dependencies (xesmf
):
Create a conda environment from scratch (
conda create
)conda create -n <ENV_NAME> -c conda-forge xcdat xesmf conda activate <ENV_NAME>
Install
xcdat
in an existing conda environment (conda install
)conda activate <ENV_NAME> conda install -c conda-forge xcdat xesmf
Source: https://xcdat.readthedocs.io/en/latest/getting-started.html
Opening a dataset#
Example dataset: tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc
(same as before)
[5]:
# This gives access to all xcdat public top-level APIs and accessor classes.
import xcdat as xc
# We import these packages specifically for plotting. It is not required to use xcdat.
import matplotlib.pyplot as plt
import pandas as pd
filepath = "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"
ds = xc.open_dataset(
filepath,
add_bounds=True,
decode_times=True,
center_times=True
)
# Unit adjustment from Kelvin to Celcius.
ds["tas"] = ds.tas - 273.15
[6]:
ds
[6]:
<xarray.Dataset> Dimensions: (time: 1980, bnds: 2, lat: 145, lon: 192) Coordinates: * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * lat (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0 * lon (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1 height float64 2.0 Dimensions without coordinates: bnds Data variables: time_bnds (time, bnds) object ... lat_bnds (lat, bnds) float64 ... lon_bnds (lon, bnds) float64 ... tas (time, lat, lon) float32 -27.19 -27.19 -27.19 ... -25.29 -25.29 Attributes: (12/48) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 87658.0 creation_date: 2020-06-05T04:06:11Z ... ... variant_label: r10i1p1f1 version: v20200605 license: CMIP6 model data produced by CSIRO is li... cmor_version: 3.4.0 tracking_id: hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2... DODS_EXTRA.Unlimited_Dimension: time
Scenario 1: Spatial Averaging#
Related accessor: ds.spatial
In this example, we calculate the spatial average of tas
over the tropical region and plot the first 100 time steps.
[7]:
ds_trop_avg = ds.spatial.average("tas", axis=["X","Y"], lat_bounds=(-25,25))
ds_trop_avg.tas
[7]:
<xarray.DataArray 'tas' (time: 1980)> array([25.24722608, 25.61795924, 25.96516235, ..., 26.79536823, 26.67771602, 26.27182383]) Coordinates: * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 height float64 2.0
[8]:
ds_trop_avg.tas.isel(time=slice(1, 100)).plot()
[8]:
[<matplotlib.lines.Line2D at 0x1441e60b0>]

Scenario 2: Calculate temporal average#
Related accessor: ds.temporal
In this example, we calculate the temporal average of ``tas`` as a single snapshot (time dimension is collapsed).
[9]:
ds_avg = ds.temporal.average("tas", weighted=True)
ds_avg.tas
[9]:
<xarray.DataArray 'tas' (lat: 145, lon: 192)> array([[-48.01481628, -48.01481628, -48.01481628, ..., -48.01481628, -48.01481628, -48.01481628], [-44.94085363, -44.97948214, -45.01815398, ..., -44.82408252, -44.86273067, -44.9009281 ], [-44.11875274, -44.23060624, -44.33960158, ..., -43.76766492, -43.88593717, -44.00303006], ..., [-18.21076615, -18.17513373, -18.13957458, ..., -18.32720478, -18.28428828, -18.2486193 ], [-18.50778243, -18.49301854, -18.47902819, ..., -18.55410851, -18.5406963 , -18.52413098], [-19.07366375, -19.07366375, -19.07366375, ..., -19.07366375, -19.07366375, -19.07366375]]) Coordinates: * lat (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0 * lon (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1 height float64 2.0 Attributes: operation: temporal_avg mode: average freq: month weighted: True
[10]:
ds_avg.tas.plot(label="weighted")
[10]:
<matplotlib.collections.QuadMesh at 0x1443d9f30>

Scenario 3: Horizontal Regridding#
Related accessor: ds.regridder
In this example, we will generate a gaussian grid with 32 latitudes to regrid our input data to.
Create the output grid#
[11]:
output_grid = xc.create_gaussian_grid(32)
output_grid
[11]:
<xarray.Dataset> Dimensions: (lat: 32, bnds: 2, lon: 65) Coordinates: * lat (lat) float64 85.76 80.27 74.74 69.21 ... -74.74 -80.27 -85.76 * lon (lon) float64 0.0 5.625 11.25 16.88 ... 343.1 348.8 354.4 360.0 Dimensions without coordinates: bnds Data variables: lat_bnds (lat, bnds) float64 90.0 83.21 83.21 77.61 ... -83.21 -83.21 -90.0 lon_bnds (lon, bnds) float64 -2.812 2.812 2.812 8.438 ... 357.2 357.2 362.8
Plot the Input vs. Output Grid#
[12]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 6))
input_grid = ds.regridder.grid
input_grid.plot.scatter(x='lon', y='lat', s=5, ax=axes[0], add_colorbar=False, cmap=plt.cm.RdBu)
axes[0].set_title('Input Grid')
output_grid.plot.scatter(x='lon', y='lat', s=5, ax=axes[1], add_colorbar=False, cmap=plt.cm.RdBu)
axes[1].set_title('Output Grid')
plt.tight_layout()

Regrid the data#
xCDAT offers horizontal regridding with xESMF
(default) and a Python port of regrid2
. We will be using xESMF
to regrid.
[13]:
# xesmf supports "bilinear", "conservative", "nearest_s2d", "nearest_d2s", and "patch"
output = ds.regridder.horizontal('tas', output_grid, tool='xesmf', method='bilinear')
[14]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 4))
ds.tas.isel(time=0).plot(ax=axes[0])
axes[0].set_title('Input data')
output.tas.isel(time=0).plot(ax=axes[1])
axes[1].set_title('Output data')
plt.tight_layout()

Parallelism with Dask#
Nearly all existing xarray methods have been extended to work automatically with Dask arrays for parallelism
—https://docs.xarray.dev/en/stable/user-guide/dask.html#using-dask-with-xarray
Parallelized xarray methods include indexing, computation, concatenating and grouped operations
xCDAT APIs that build upon xarray methods inherently support Dask parallelism
Dask arrays are loaded into memory only when absolutely required (e.g., generating weights for averaging)
[15]:
filepath = "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"
# Use .chunk() to activate Dask arrays
# NOTE: `open_mfdataset()` automatically chunks by the number of files, which
# might not be optimal.
ds = xc.open_dataset(
filepath,
chunks={"time": "auto"}
)
ds
[15]:
<xarray.Dataset> Dimensions: (time: 1980, bnds: 2, lat: 145, lon: 192) Coordinates: * time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 * lat (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0 * lon (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1 height float64 ... Dimensions without coordinates: bnds Data variables: time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray> lat_bnds (lat, bnds) float64 dask.array<chunksize=(145, 2), meta=np.ndarray> lon_bnds (lon, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray> tas (time, lat, lon) float32 dask.array<chunksize=(1205, 145, 192), meta=np.ndarray> Attributes: (12/49) Conventions: CF-1.7 CMIP-6.2 activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 branch_time_in_parent: 87658.0 creation_date: 2020-06-05T04:06:11Z ... ... version: v20200605 license: CMIP6 model data produced by CSIRO is li... cmor_version: 3.4.0 _NCProperties: version=2,netcdf=4.6.2,hdf5=1.10.5 tracking_id: hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2... DODS_EXTRA.Unlimited_Dimension: time
Further Dask Guidance#
Visit these pages for more guidance (e.g., when to parallelize):
Parallel computing with Dask (xCDAT): https://xcdat.readthedocs.io/en/latest/examples/parallel-computing-with-dask.html
Parallel computing with Dask (Xarray): https://docs.xarray.dev/en/stable/user-guide/dask.html
Xarray with Dask Arrays: https://examples.dask.org/xarray.html
Key Takeaways#
A driving need for a modern successor to CDAT
Serves the climate community in the long-term
xCDAT is an extension of xarray for climate data analysis on structured grids
Goal of providing features and utilities for simple and robust analysis of climate data
Where to Find xCDAT#
xCDAT is available for installation through Anaconda
Install command: ``conda install -c conda-forge xcdat xesmf``
Check out xCDAT’s Read the Docs, which we strive to keep up-to-date
Get Involved on GitHub!#
Code contributions are welcome and appreciated
GitHub Repository: xCDAT/xcdat
Contributing Guide: https://xcdat.readthedocs.io/en/latest/contributing.html
Submit and/or address tickets for feature suggestions, bugs, and documentation updates
GitHub Issues: xCDAT/xcdat#issues
Participate in forum discussions on version releases, architecture, feature suggestions, etc.
GitHub Discussions: xCDAT/xcdat#discussions