xCDAT logo

LLNL Climate and Weather Seminar Series (01/25/2023) - A Gentle Introduction to xCDAT#

“A Python package for simple and robust climate data analysis.”

Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee

With thanks to Peter Gleckler, Paul Durack, Karl Taylor, and Chris Golaz


This work is performed under the auspices of the U. S. DOE by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344.

Presentation Overview#

Intended audience: Some or no familiarity with xarray and/or xcdat

  1. Driving force behind xCDAT

  2. Goals and milestones of CDAT’s successor

  3. Introducing xCDAT

  4. Understanding the basics of Xarray

  5. How xCDAT extends Xarray for climate data analysis

  6. Technical design philosophy and APIs

  7. Demo of capabilities

  8. How to get involved

Notebook Setup#

Create an Anaconda environment for this notebook using the command below:

conda create -n xcdat -c conda-forge xarray xcdat xesmf matplotlib nc-time-axis jupyter
  • xesmf is required for horizontal regridding with xESMF

  • matplotlib is an optional dependency required for plotting with xarray

  • nc-time-axis is an optional dependency required for matplotlib to plot cftime coordinates

The Driving Force Behind xCDAT#

  • The CDAT (Community Data Analysis Tools) library has provided a suite of robust and comprehensive open-source climate data analysis and visualization packages for over 20 years

  • A driving need for a modern successor

    • Focus on a maintainable and extensible library

    • Serve the needs of the climate community in the long-term

CMIP6 logo CMIP6 logo

Goals and Milestones for CDAT’s Successor#

  1. Offer similar core capabilities

    1. For example geospatial averaging, temporal averaging, and regridding

  2. Use modern technologies in the library’s stack

    1. Support parallelism and lazy operations

  3. Be maintainable, extensible, and easy-to-use

    1. Python Enhancement Proposals (PEPs)

    2. Automate DevOps processes (unit testing, code coverage)

    3. Actively maintain documentation

  4. Cultivate an open-source community that can sustain the project

    1. Encourage GitHub contributions

    2. Community engagement efforts (e.g., Pangeo, ESGF)

Introducing xCDAT#

  • xCDAT is an extension of xarray for climate data analysis on structured grids

  • Goal of providing features and utilities for simple and robust analysis of climate data

  • Jointly developed by scientists and developers from:

    • E3SM Project (Energy Exascale Earth System Model Project)

    • PCMDI (Program for Climate Model Diagnosis and Intercomparison)

    • SEATS Project (Simplifying ESM Analysis Through Standards Project)

    • Users around the world via GitHub

E3SM logo PCMDI logo SEATS logo

Before We Dive Deeper, Let’s Talk About Xarray#

  • Xarray is an evolution of an internal tool developed at The Climate Corporation

  • Released as open source in May 2014

  • NumFocus fiscally sponsored project since August 2018

xarray logo NumFOCUS logo

Key Features and Capabilities in Xarray#

  • “N-D labeled arrays and datasets in Python”

    • Built upon and extends NumPy and pandas

  • Interoperable with scientific Python ecosystem including NumPy, Dask, Pandas, and Matplotlib

  • Supports file I/O, indexing and selecting, interpolating, grouping, aggregating, parallelism (Dask), plotting (matplotlib wrapper)

    • Supported formats include: netCDF, Iris, OPeNDAP, Zarr, and GRIB

Source: https://xarray.dev/#features

Why use Xarray?#

“Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.”

https://xarray.pydata.org/en/v2022.10.0/getting-started-guide/why-xarray.html

  • Apply operations over dimensions by name

    • x.sum('time')

  • Select values by label (or logical location) instead of integer location

    • x.loc['2014-01-01'] or x.sel(time='2014-01-01')

  • Mathematical operations vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape

    • x - y

  • Easily use the split-apply-combine paradigm with groupby

    • x.groupby('time.dayofyear').mean().

  • Database-like alignment based on coordinate labels that smoothly handles missing values

    • x, y = xr.align(x, y, join='outer')

  • Keep track of arbitrary metadata in the form of a Python dictionary

    • x.attrs

Source: https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html#what-labels-enable

The Xarray Data Models#

“Xarray data models are borrowed from netCDF file format, which provides xarray with a natural and portable serialization format.”

https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html

  1. ``xarray.Dataset``

    • A dictionary-like container of DataArray objects with aligned dimensions

      • DataArray objects are classified as “coordinate variables” or “data variables”

      • All data variables have a shared union of coordinates

    • Serves a similar purpose to a pandas.DataFrame

  2. ``xarray.DataArray``

    • A class that attaches dimension names, coordinates, and attributes to multi-dimensional arrays (aka “labeled arrays”)

    • An N-D generalization of a pandas.Series

xarray logo

Exploring the Xarray Data Models#

Example dataset: tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc

  • Open up this real dataset from ESGF using xarray’s OPeNDAP support.

    • Contains the tas variable (near-surface air temperature) recorded on a monthly frequency

  • It is not downloaded until calculations/computations are performed on the Dataset object

    • Example of an xarray lazy operation

[1]:
# This style import is necessary to properly render Xarray's HTML output with
# the Jupyer RISE extension.
# GitHub Issue: https://github.com/damianavila/RISE/issues/594
# Source: https://github.com/smartass101/xarray-pydata-prague-2020/blob/main/rise.css

from IPython.core.display import HTML

style = """
<style>
.reveal pre.xr-text-repr-fallback {
    display: none;
}
.reveal ul.xr-sections {
    display: grid
}

.reveal ul ul.xr-var-list {
    display: contents
}
</style>
"""


HTML(style)
[1]:
[2]:
import xarray as xr

filepath = "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

ds = xr.open_dataset(filepath)

The Dataset Model#

[3]:
ds
[3]:
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tas        (time, lat, lon) float32 ...
Attributes: (12/48)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    variant_label:                   r10i1p1f1
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

A dictionary-like container of labeled arrays (DataArray objects) with aligned dimensions.

Key properties:

  • dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {‘x’: 6, ‘y’: 6, ‘time’: 8})

  • coords: a dict-like container of DataArrays intended to label points used in ``data_vars`` (e.g., arrays of numbers, datetime objects or strings)

  • data_vars: a dict-like container of DataArrays corresponding to variables

  • attrs: dict to hold arbitrary metadata

Source: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataset

The DataArray Model#

[4]:
ds.tas
[4]:
<xarray.DataArray 'tas' (time: 1980, lat: 145, lon: 192)>
[55123200 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2020-06-05T04:06:10Z altered by CMOR: Treated scalar dime...
    _ChunkSizes:    [  1 145 192]

A class that attaches dimension names, coordinates, and attributes to multi-dimensional arrays (aka “labeled arrays”)

Key properties:

  • values: a numpy.ndarray holding the array’s values

  • dims: dimension names for each axis (e.g., (‘x’, ‘y’, ‘z’))

  • coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)

  • attrs: dict to hold arbitrary metadata (attributes)

Source: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataarray

Resources for Learning Xarray#

xCDAT Extends Xarray for Climate Data Analysis#

  • Some key xCDAT features are inspired by or ported from the core CDAT library

    • e.g., spatial averaging, temporal averaging, regrid2 for horizontal regridding

  • Other features leverage powerful libraries in the xarray ecosystem

    • xESMF for horizontal regridding

    • xgcm for vertical interpolation

    • CF-xarray for CF convention metadata interpretation

  • xCDAT strives to support datasets CF compliant and common non-CF compliant metadata (time units in “months since …” or “years since …”)

  • Inherent support for lazy operations and parallelism through xarray + dask

esmf logo cf-xarray logo dask logo

The Technical Design Philosophy#

  • Streamline the user experience of developing code to analyze climate data

  • Reduce the complexity and overhead for implementing certain features with xarray (e.g., temporal averaging, spatial averaging)

  • Encourage reusable functionalities through a single library

Leveraging the APIs#

xCDAT provides public APIs in two ways:

  1. Top-level APIs functions

    • e.g., xcdat.open_dataset(), xcdat.center_times()

    • Usually for opening datasets and performing dataset level operations

  2. Accessor classes

    • xcdat provides Dataset accessors, which are implicit namespaces for custom functionality.

    • Accessor namespaces clearly identifies separation from built-in xarray methods.

    • Operate on variables within the xr.Dataset

    • e.g., ds.spatial, ds.temporal, ds.regridder

xcdat accessor

xcdat spatial functionality is exposed by chaining the .spatial accessor attribute to the xr.Dataset object.

Source: https://xcdat.readthedocs.io/en/latest/api.html

Key Features in xCDAT#

Feature

API

Description

Extend xr.open_dataset() and xr.open_mfdataset()

  • open_dataset()

  • open_mfdataset()

  • Bounds generation

  • Time decoding (CF and select non-CF time units)

  • Centering of time coordinates

  • Conversion of longitudinal axis orientation

Temporal averaging

  • ds.temporal.average()

  • ds.temporal.group_average()

  • ds.temporal.climatology()

  • ds.temporal.departures()

  • Single snapshot and group average

  • Climatology and departure

  • Weighted or unweighted

  • Optional seasonal configuration< (e.g., custom seasons)

Geospatial averaging

ds.spatial.average()

  • Rectilinear grids

  • Weighted

  • Optional specification of region domain

Horizontal regridding

ds.regridder.horizontal()

  • Rectilinear and curvilinear grids

  • Extends xESMF horizontal regridding

  • Python implementation of regrid2

Vertical regridding

ds.regridder.vertical()

  • Transform vertical coordinates

  • Extends xgcm vertical interpolation

  • Linear, logarithmic, and conservative interpolation

  • Decode parametric vertical coordinates if required

A Demo of xCDAT Capabilities#

  • Prerequisites

    • Installing xcdat

    • Import xcdat

    • Open a dataset and apply postprocessing operations

  • Scenario 1 - Calculate the spatial averages over the tropical region

  • Scenario 2 - Calculate temporal average

  • Scenario 3 - Horizontal regridding (bilinear, gaussian grid)

Installing xcdat#

xCDAT is available on Anaconda under the conda-forge channel (https://anaconda.org/conda-forge/xcdat)

Two ways to install xcdat with recommended dependencies (xesmf):

  1. Create a conda environment from scratch (conda create)

    conda create -n <ENV_NAME> -c conda-forge xcdat xesmf
    conda activate <ENV_NAME>
    
  2. Install xcdat in an existing conda environment (conda install)

    conda activate <ENV_NAME>
    conda install -c conda-forge xcdat xesmf
    

Source: https://xcdat.readthedocs.io/en/latest/getting-started.html

Opening a dataset#

Example dataset: tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc (same as before)

[5]:
# This gives access to all xcdat public top-level APIs and accessor classes.
import xcdat as xc

# We import these packages specifically for plotting. It is not required to use xcdat.
import matplotlib.pyplot as plt
import pandas as pd

filepath = "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

ds = xc.open_dataset(
    filepath,
    add_bounds=True,
    decode_times=True,
    center_times=True
)

# Unit adjustment from Kelvin to Celcius.
ds["tas"] = ds.tas - 273.15
[6]:
ds
[6]:
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 2.0
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tas        (time, lat, lon) float32 -27.19 -27.19 -27.19 ... -25.29 -25.29
Attributes: (12/48)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    variant_label:                   r10i1p1f1
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

Scenario 1: Spatial Averaging#

Related accessor: ds.spatial

In this example, we calculate the spatial average of tas over the tropical region and plot the first 100 time steps.

[7]:
ds_trop_avg = ds.spatial.average("tas", axis=["X","Y"], lat_bounds=(-25,25))
ds_trop_avg.tas
[7]:
<xarray.DataArray 'tas' (time: 1980)>
array([25.24722608, 25.61795924, 25.96516235, ..., 26.79536823,
       26.67771602, 26.27182383])
Coordinates:
  * time     (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    height   float64 2.0
[8]:
ds_trop_avg.tas.isel(time=slice(1, 100)).plot()
[8]:
[<matplotlib.lines.Line2D at 0x1441e60b0>]
../../_images/demos_1-25-23-cwss-seminar_introduction-to-xcdat_34_1.png

Scenario 2: Calculate temporal average#

Related accessor: ds.temporal

In this example, we calculate the temporal average of ``tas`` as a single snapshot (time dimension is collapsed).

[9]:
ds_avg = ds.temporal.average("tas", weighted=True)
ds_avg.tas
[9]:
<xarray.DataArray 'tas' (lat: 145, lon: 192)>
array([[-48.01481628, -48.01481628, -48.01481628, ..., -48.01481628,
        -48.01481628, -48.01481628],
       [-44.94085363, -44.97948214, -45.01815398, ..., -44.82408252,
        -44.86273067, -44.9009281 ],
       [-44.11875274, -44.23060624, -44.33960158, ..., -43.76766492,
        -43.88593717, -44.00303006],
       ...,
       [-18.21076615, -18.17513373, -18.13957458, ..., -18.32720478,
        -18.28428828, -18.2486193 ],
       [-18.50778243, -18.49301854, -18.47902819, ..., -18.55410851,
        -18.5406963 , -18.52413098],
       [-19.07366375, -19.07366375, -19.07366375, ..., -19.07366375,
        -19.07366375, -19.07366375]])
Coordinates:
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 2.0
Attributes:
    operation:  temporal_avg
    mode:       average
    freq:       month
    weighted:   True
[10]:
ds_avg.tas.plot(label="weighted")
[10]:
<matplotlib.collections.QuadMesh at 0x1443d9f30>
../../_images/demos_1-25-23-cwss-seminar_introduction-to-xcdat_37_1.png

Scenario 3: Horizontal Regridding#

Related accessor: ds.regridder

In this example, we will generate a gaussian grid with 32 latitudes to regrid our input data to.

Create the output grid#

[11]:
output_grid = xc.create_gaussian_grid(32)
output_grid
[11]:
<xarray.Dataset>
Dimensions:   (lat: 32, bnds: 2, lon: 65)
Coordinates:
  * lat       (lat) float64 85.76 80.27 74.74 69.21 ... -74.74 -80.27 -85.76
  * lon       (lon) float64 0.0 5.625 11.25 16.88 ... 343.1 348.8 354.4 360.0
Dimensions without coordinates: bnds
Data variables:
    lat_bnds  (lat, bnds) float64 90.0 83.21 83.21 77.61 ... -83.21 -83.21 -90.0
    lon_bnds  (lon, bnds) float64 -2.812 2.812 2.812 8.438 ... 357.2 357.2 362.8

Plot the Input vs. Output Grid#

[12]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 6))

input_grid = ds.regridder.grid
input_grid.plot.scatter(x='lon', y='lat', s=5, ax=axes[0], add_colorbar=False, cmap=plt.cm.RdBu)
axes[0].set_title('Input Grid')

output_grid.plot.scatter(x='lon', y='lat', s=5, ax=axes[1], add_colorbar=False, cmap=plt.cm.RdBu)
axes[1].set_title('Output Grid')

plt.tight_layout()
../../_images/demos_1-25-23-cwss-seminar_introduction-to-xcdat_42_0.png

Regrid the data#

xCDAT offers horizontal regridding with xESMF (default) and a Python port of regrid2. We will be using xESMF to regrid.

[13]:
# xesmf supports "bilinear", "conservative", "nearest_s2d", "nearest_d2s", and "patch"
output = ds.regridder.horizontal('tas', output_grid, tool='xesmf', method='bilinear')
[14]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 4))

ds.tas.isel(time=0).plot(ax=axes[0])
axes[0].set_title('Input data')

output.tas.isel(time=0).plot(ax=axes[1])
axes[1].set_title('Output data')

plt.tight_layout()
../../_images/demos_1-25-23-cwss-seminar_introduction-to-xcdat_45_0.png

Parallelism with Dask#

Dask logo

Nearly all existing xarray methods have been extended to work automatically with Dask arrays for parallelism

https://docs.xarray.dev/en/stable/user-guide/dask.html#using-dask-with-xarray

  • Parallelized xarray methods include indexing, computation, concatenating and grouped operations

  • xCDAT APIs that build upon xarray methods inherently support Dask parallelism

    • Dask arrays are loaded into memory only when absolutely required (e.g., generating weights for averaging)

[15]:
filepath = "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

# Use .chunk() to activate Dask arrays
# NOTE: `open_mfdataset()` automatically chunks by the number of files, which
# might not be optimal.
ds = xc.open_dataset(
    filepath,
    chunks={"time": "auto"}
)
ds
[15]:
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(145, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 dask.array<chunksize=(1205, 145, 192), meta=np.ndarray>
Attributes: (12/49)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    _NCProperties:                   version=2,netcdf=4.6.2,hdf5=1.10.5
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

Further Dask Guidance#

Visit these pages for more guidance (e.g., when to parallelize):

Key Takeaways#

  • A driving need for a modern successor to CDAT

  • Serves the climate community in the long-term

  • xCDAT is an extension of xarray for climate data analysis on structured grids

  • Goal of providing features and utilities for simple and robust analysis of climate data

xcdat logo

Where to Find xCDAT#

  • xCDAT is available for installation through Anaconda

    • Install command: ``conda install -c conda-forge xcdat xesmf``

  • Check out xCDAT’s Read the Docs, which we strive to keep up-to-date

Anaconda logo conda-forge logo

RTD logo RTD screenshot

Get Involved on GitHub!#