xCDAT logo

A Gentle Introduction to xCDAT (Xarray Climate Data Analysis Tools)#

“A Python package for simple and robust climate data analysis.”

Core Developers: Tom Vo, Stephen Po-Chedley, Jason Boutte, Jill Zhang, Jiwoo Lee

With thanks to Peter Gleckler, Paul Durack, Karl Taylor, and Chris Golaz

This work is performed under the auspices of the U. S. DOE by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344.

Presentation Overview#

Intended audience: Some or no familiarity with xarray and xcdat

  • What is xCDAT?

  • An overview of Xarray

  • How does xCDAT fit in the Xarray ecosystem?

  • The API design of xCDAT

  • Demo of xCDAT capabilities and Dask parallelism

  • Wrap up and resources

What is xCDAT?#

  • xCDAT is an extension of xarray for climate data analysis on structured grids

  • The goal is to provide generalizable features and utilities for simple and robust analysis of climate data

  • Jointly developed by scientists and developers from E3SM and PCMDI at Lawrence Livermore National Lab

    • In collaboration with external users and organizations through GitHub

  • Performed for the E3SM and SEATS (Simplifying ESM Analysis Through Standards) projects

E3SM logo PCMDI logo SEATS logo

  • Some key xCDAT features are inspired by or ported from the core CDAT library

    • Examples: spatial/temporal averaging, regrid2 for horizontal regridding

  • Other features leverage powerful libraries in the xarray ecosystem

    • Examples: xESMF and CF-xarray

  • xCDAT strives to support CF compliant datasets and datasets with common non-CF compliant metadata

    • Example common non-CF metadata: time units in “months since …” or “years since …”

CMIP6 logo cf-xarray logo

First, Let’s Dive into Xarray#

  • Xarray is an evolution of an internal tool developed at The Climate Corporation

  • Released as open source ina May 2014

  • NumFocus fiscally sponsored project since August 2018

xarray logo NumFOCUS logo

Key Features and Capabilities in Xarray#

  • “N-D labeled arrays and datasets in Python”

    • Built upon and extends NumPy and pandas

  • Interoperable with scientific Python ecosystem including NumPy, Dask, Pandas, and Matplotlib

  • Supports file I/O, indexing and selecting, interpolating, grouping, aggregating, parallelism (Dask), plotting (matplotlib wrapper)

    • Supports I/O for netCDF, Iris, OPeNDAP, Zarr, and GRIB

NumPy logo Pandas logo Dask logo

Source: https://xarray.dev/#features

Why Xarray?#

“Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.”

https://xarray.pydata.org/en/v2022.10.0/getting-started-guide/why-xarray.html

Xarray uses labels on arrays to provide a powerful and concise interface

  • Apply operations over dimensions by name

    • x.sum('time')

  • Select values by label (or logical location) instead of integer location

    • x.loc['2014-01-01'] or x.sel(time='2014-01-01')

  • Mathematical operations vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape

    • x - y

  • Easily use the split-apply-combine paradigm with groupby

    • x.groupby('time.dayofyear').mean().

  • Database-like alignment based on coordinate labels that smoothly handles missing values

    • x, y = xr.align(x, y, join='outer')

  • Keep track of arbitrary metadata in the form of a Python dictionary

    • x.attrs

Source: https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html#what-labels-enable

The Xarray Core Data Structures#

“Xarray data models are borrowed from netCDF file format, which provides xarray with a natural and portable serialization format.”

https://docs.xarray.dev/en/v2022.10.0/getting-started-guide/why-xarray.html

Xarray has two core data structures:

  1. xarray.DataArray

    • A class that attaches dimension names, coordinates, and attributes to multi-dimensional arrays (aka “labeled arrays”)

    • An N-D generalization of a pandas.Series

  2. xarray.Dataset

    • A dictionary-like container of DataArray objects with aligned dimensions

      • DataArray objects are classified as “coordinate variables” or “data variables”

      • All data variables have a shared union of coordinates

    • Serves a similar purpose to a pandas.DataFrame

xarray logo

Dissecting Xarray Data Structures in a Real-World Dataset#

This example netCDF4 dataset is opened directly from ESGF using xarray’s OPeNDAP support.

It contains the tas variable, which represents near-surface air temperature. tas is recorded on a monthly frequency.

[1]:
# This style import is necessary to properly render Xarray's HTML output with
# the Jupyer RISE extension.
# GitHub Issue: https://github.com/damianavila/RISE/issues/594
# Source: https://github.com/smartass101/xarray-pydata-prague-2020/blob/main/rise.css

from IPython.core.display import HTML

style = """
<style>
.reveal pre.xr-text-repr-fallback {
    display: none;
}

.reveal ul.xr-sections {
    display: grid
}

.reveal ul ul.xr-var-list {
    display: contents
}
</style>
"""


HTML(style)
[1]:
[2]:
import xarray as xr

filepath = "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

ds = xr.open_dataset(filepath)

The Dataset Model#

A dictionary-like container of labeled arrays (DataArray objects) with aligned dimensions.

Key properties:

  • dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {‘x’: 6, ‘y’: 6, ‘time’: 8})

  • coords: another dict-like container of DataArrays intended to label points used in data_vars (e.g., arrays of numbers, datetime objects or strings)

  • data_vars: a dict-like container of DataArrays corresponding to variables

  • attrs: dict to hold arbitrary metadata

Source: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataset

[3]:
ds
[3]:
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 2.0
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] 1850-01-01 1850-02-01 ... 2015-01-01
    lat_bnds   (lat, bnds) float64 -90.0 -89.38 -89.38 ... 89.38 89.38 90.0
    lon_bnds   (lon, bnds) float64 -0.9375 0.9375 0.9375 ... 357.2 357.2 359.1
    tas        (time, lat, lon) float32 ...
Attributes: (12/48)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    variant_label:                   r10i1p1f1
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

The DataArray Model#

A class that attaches dimension names, coordinates, and attributes to multi-dimensional arrays (aka “labeled arrays”)

Key properties:

  • values: a numpy.ndarray holding the array’s values

  • dims: dimension names for each axis (e.g., (‘x’, ‘y’, ‘z’))

  • coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)

  • attrs: dict to hold arbitrary metadata (attributes)

Source: https://docs.xarray.dev/en/stable/user-guide/data-structures.html#dataarray

[4]:
ds.tas
[4]:
<xarray.DataArray 'tas' (time: 1980, lat: 145, lon: 192)>
[55123200 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 2.0
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2020-06-05T04:06:10Z altered by CMOR: Treated scalar dime...
    _ChunkSizes:    [  1 145 192]

Resources for Learning Xarray#

Jumping Forward to xCDAT, an Extension of Xarray#

“Xarray is designed as a general purpose library, and hence tries to avoid including overly domain specific functionality. But inevitably, the need for more domain specific logic arises.”

https://docs.xarray.dev/en/v2022.10.0/internals/extending-xarray.html#extending-xarray

  • xCDAT aims to provide generalizable features and utilities for simple and robust analysis of climate data.

  • xCDAT’s design philosophy is focused on reducing the overhead required to accomplish certain tasks in xarray.

Available xCDAT Features#

  • Extension of xarray’s open_dataset() and open_mfdataset() with post-processing options

    • Generate bounds that don’t exist

    • Keep a single data variable in the Dataset

    • Optional decoding of time coordinates, centering of time coordinates, swapping longitudinal axis orientation between [0, 360) and [-180, 180)

  • Temporal averaging

    • Time series averages (single snapshot and grouped), climatologies, and departures

    • Weighted or unweighted

    • Optional seasonal configuration (e.g., DJF vs. JFD, custom seasons)

  • Geospatial weighted averaging (rectilinear grid)

    • Optional specification of regional domain

  • Horizontal structured regridding (rectilinear and curvilinear grids)

    • Python implementation of regrid2_ for handling cartesian latitude longitude grids

    • API that wraps xESMF

xCDAT’s API Design#

xCDAT provides public APIs in two ways:

  1. Top-level APIs functions

    • Example: xcdat.open_dataset(), xcdat.center_times()

  2. Accessor classes

    • xcdat provides Dataset accessors, which are implicit namespaces for custom functionality.

    • Accessor namespaces clearly identifies separation from built-in xarray methods.

    • Example: ds.spatial, ds.temporal, ds.regridder

xcdat accessor

xcdat spatial functionality is exposed by chaining the .spatial accessor attribute to the xr.Dataset object.

Source: https://xcdat.readthedocs.io/en/latest/api.html

A Demo of xCDAT Capabilities#

  • Prerequisites

    • Installing xcdat

    • Import xcdat

    • Open a dataset and apply postprocessing operations

  • Scenario 1 - Calculate the spatial averages over the tropical region

  • Scenario 2 - Calculate the annual anomalies

  • Scenario 3 - Horizontal regridding (bilinear, gaussian grid)

Installing xcdat#

xCDAT is available on Anaconda under the conda-forge channel (https://anaconda.org/conda-forge/xcdat)

Two ways to install xcdat with recommended dependencies (xesmf):

  1. Create a conda environment from scratch (conda create)

    conda create -n <ENV_NAME> -c conda-forge xcdat xesmf
    conda activate <ENV_NAME>
    
  2. Install xcdat in an existing conda environment (conda install)

    conda activate <ENV_NAME>
    conda install -c conda-forge xcdat xesmf
    

Source: https://xcdat.readthedocs.io/en/latest/getting-started.html

Opening a dataset#

This example netCDF4 dataset is opened directly from ESGF using xarray’s OPeNDAP support.

It contains the tas variable, which represents near-surface air temperature. tas is recorded on a monthly frequency.

[5]:
# This gives access to all xcdat public top-level APIs and accessor classes.
import xcdat as xc

# We import these packages specifically for plotting. It is not required to use xcdat.
import matplotlib.pyplot as plt
import pandas as pd
[6]:
filepath = "https://esgf-data1.llnl.gov/thredds/dodsC/css03_data/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

ds = xc.open_dataset(
    filepath,
    add_bounds=True,
    decode_times=True,
    center_times=True
)

# Unit adjustment from Kelvin to Celcius.
ds["tas"] = ds.tas - 273.15
[7]:
ds
[7]:
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 2.0
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] 1850-01-01 1850-02-01 ... 2015-01-01
    lat_bnds   (lat, bnds) float64 -90.0 -89.38 -89.38 ... 89.38 89.38 90.0
    lon_bnds   (lon, bnds) float64 -0.9375 0.9375 0.9375 ... 357.2 357.2 359.1
    tas        (time, lat, lon) float32 -27.19 -27.19 -27.19 ... -25.29 -25.29
Attributes: (12/48)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    variant_label:                   r10i1p1f1
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

Scenario 1: Spatial Averaging#

Related accessor: ds.spatial

In this example, we calculate the spatial average of tas over the tropical region.

[8]:
ds_trop_avg = ds.spatial.average("tas", axis=["X","Y"], lat_bounds=(-25,25))
[9]:
ds_trop_avg.tas
[9]:
<xarray.DataArray 'tas' (time: 1980)>
array([25.24722608, 25.61795924, 25.96516235, ..., 26.79536823,
       26.67771602, 26.27182383])
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
    height   float64 2.0

Plot the first 100 time steps#

[10]:
ds_trop_avg.tas.isel(time=slice(0, 100)).plot()
[10]:
[<matplotlib.lines.Line2D at 0x7f86aac814f0>]
../_images/examples_introduction-to-xcdat_31_1.png

Scenario 2: Calculate temporal average#

Related accessor: ds.temporal

In this example, we calculate the temporal average of tas as a single snapshot. The time dimension is removed after averaging.

[11]:
ds_avg = ds.temporal.average("tas", weighted=True)
ds_avg.tas
[11]:
<xarray.DataArray 'tas' (lat: 145, lon: 192)>
array([[-48.01481628, -48.01481628, -48.01481628, ..., -48.01481628,
        -48.01481628, -48.01481628],
       [-44.94085363, -44.97948214, -45.01815398, ..., -44.82408252,
        -44.86273067, -44.9009281 ],
       [-44.11875274, -44.23060624, -44.33960158, ..., -43.76766492,
        -43.88593717, -44.00303006],
       ...,
       [-18.21076615, -18.17513373, -18.13957458, ..., -18.32720478,
        -18.28428828, -18.2486193 ],
       [-18.50778243, -18.49301854, -18.47902819, ..., -18.55410851,
        -18.5406963 , -18.52413098],
       [-19.07366375, -19.07366375, -19.07366375, ..., -19.07366375,
        -19.07366375, -19.07366375]])
Coordinates:
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 2.0
Attributes:
    operation:  temporal_avg
    mode:       average
    freq:       month
    weighted:   True

Plot the temporal average#

[12]:
ds_avg.tas.plot(label="weighted")
[12]:
<matplotlib.collections.QuadMesh at 0x7f8688694520>
../_images/examples_introduction-to-xcdat_35_1.png

Scenario 3: Horizontal Regridding#

Related accessor: ds.regridder

In this example, we will generate a gaussian grid with 32 latitudes to regrid our input data to.

Create the output grid#

[13]:
output_grid = xc.create_gaussian_grid(32)
[14]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 4))

ds.regridder.grid.plot.scatter('lon', 'lat', s=0.01, ax=axes[0])
axes[0].set_title('Input Grid')

output_grid.plot.scatter('lon', 'lat', s=0.1, ax=axes[1])
axes[1].set_title('Output Grid')

plt.tight_layout()
../_images/examples_introduction-to-xcdat_39_0.png

Regrid the data#

xCDAT offers horizontal regridding with xESMF (default) and a Python port of regrid2. We will be using xESMF to regrid.

[15]:
# xesmf supports "bilinear", "conservative", "nearest_s2d", "nearest_d2s", and "patch"
output = ds.regridder.horizontal('tas', output_grid, tool='xesmf', method='bilinear')
[16]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 4))

ds.tas.isel(time=0).plot(ax=axes[0])
axes[0].set_title('Input data')

output.tas.isel(time=0).plot(ax=axes[1])
axes[1].set_title('Output data')

plt.tight_layout()
../_images/examples_introduction-to-xcdat_42_0.png

Parallelism with Dask#

Nearly all existing xarray methods have been extended to work automatically with Dask arrays for parallelism — https://docs.xarray.dev/en/stable/user-guide/dask.html#using-dask-with-xarray

  • Parallelized xarray methods include indexing, computation, concatenating and grouped operations

  • xCDAT APIs that build upon xarray methods inherently support Dask parallelism

    • Dask arrays are loaded into memory only when absolutely required (e.g., decoding time, handling bounds)

Dask logo

High-level Overview of Dask Mechanics#

  • Dask divides arrays into many small pieces, called “chunks” (each presumed to be small enough to fit into memory)

  • Dask arrays operations are lazy

    • Operations queue up a series of tasks mapped over blocks

    • No computation is performed until values need to be computed (lazy)

    • Data is loaded into memory and computation is performed in streaming fashion, block-by-block

  • Computation is controlled by multi-processing or thread pool

Source: https://docs.xarray.dev/en/stable/user-guide/dask.html

How do I activate Dask with Xarray/xCDAT?#

  • The usual way to create a Dataset filled with Dask arrays is to load the data from a netCDF file or files

  • You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function

    • By default, open_mfdataset() will chunk each netCDF file into a single Dask array

    • Supply the chunks argument to control the size of the resulting Dask arrays

    • Xarray maintains a Dask array until it is not possible (raises an exception instead of loading into memory)

Source: https://docs.xarray.dev/en/stable/user-guide/dask.html#reading-and-writing-data

[17]:
filepath = "http://esgf.nci.org.au/thredds/dodsC/master/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/historical/r10i1p1f1/Amon/tas/gn/v20200605/tas_Amon_ACCESS-ESM1-5_historical_r10i1p1f1_gn_185001-201412.nc"

# Use .chunk() to activate Dask arrays
# NOTE: `open_mfdataset()` automatically chunks by the number of files, which
# might not be optimal.
ds = xc.open_dataset(
    filepath,
    chunks={"time": "auto"}
)
[18]:
ds
[18]:
<xarray.Dataset>
Dimensions:    (time: 1980, bnds: 2, lat: 145, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lat_bnds   (lat, bnds) float64 dask.array<chunksize=(145, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 dask.array<chunksize=(192, 2), meta=np.ndarray>
    tas        (time, lat, lon) float32 dask.array<chunksize=(990, 145, 192), meta=np.ndarray>
Attributes: (12/49)
    Conventions:                     CF-1.7 CMIP-6.2
    activity_id:                     CMIP
    branch_method:                   standard
    branch_time_in_child:            0.0
    branch_time_in_parent:           87658.0
    creation_date:                   2020-06-05T04:06:11Z
    ...                              ...
    version:                         v20200605
    license:                         CMIP6 model data produced by CSIRO is li...
    cmor_version:                    3.4.0
    _NCProperties:                   version=2,netcdf=4.6.2,hdf5=1.10.5
    tracking_id:                     hdl:21.14100/af78ae5e-f3a6-4e99-8cfe-5f2...
    DODS_EXTRA.Unlimited_Dimension:  time

Example of Parallelism in xCDAT#

[19]:
tas_global = ds.spatial.average("tas", axis=["X", "Y"], weights="generate")["tas"]
tas_global
[19]:
<xarray.DataArray 'tas' (time: 1980)>
dask.array<truediv, shape=(1980,), dtype=float64, chunksize=(990,), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2014-12-16T12:00:00
    height   float64 ...
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2020-06-05T04:06:10Z altered by CMOR: Treated scalar dime...
    _ChunkSizes:    [  1 145 192]

Further Dask Guidance#

Visit these pages for more guidance (e.g., when to parallelize):

Wrapping Things Up#

  • “Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience”

  • xCDAT is an extension of xarray for climate data analysis on structured grids.

  • xCDAT aims to make analyzing climate data simple and robust with xarray

We’d love your support!#

  • The xCDAT core team’s mission is to provide a maintainable and extensible package that serves the needs of the climate community in the long-term.

  • Please check out the repository and give it a star to increase xCDAT’s visibility!

  • We’re always open to contributions, whether through GitHub issues, pull requests, discussions, etc.

Repository: https://github.com/xCDAT/xcdat

Resources#

If you comments or questions, reach out to us on the GitHub discussions forum!