Ocean Carbon States — Database and Toolbox

The “Ocean Carbon States” represent the regimes of variability of the ocean carbon cycle, as expressed by the partial pressure of CO2 (pCO2) in the ocean and sea surface temperature (SST). These regimes are obtained using an advanced data-mining technique, cluster analysis, which organizes the multivariate data in groups of similar behavior. As we identify and compare the spatial and temporal patterns extracted from observational, but also model datasets, we gain insight into the physical and biogeochemical processes controlling the ocean carbon cycle in nature as well as the skill with which these processes are simulated by the models.

A detailed description of the technique and its application to two ocean basins in observations and models of the ocean carbon cycle is provided in:

In this study, the self-organized patterns of pCO2 and SST are calculated. The variables, pCO2 and SST, are chosen such that they represent the main pathways that the ocean carbon concentration changes; namely, changes in air-sea flux and ocean biogeochemistry, and in ocean circulation. In future work, other key variables of the ocean carbon cycle will be investigated.

The toolbox presented here comprises the methodology and the scripts used to extract the Ocean Carbon States and the database includes results from applying the method in the simple case of a climatological dataset, in order to test its applicability. In future work, both the toolbox and the database will be extended to include more functionality and datasets.

The Ocean Carbon States Toolbox

Screenshot of toolbox MATLAB GUI

The statistical method used here to determine the pCO2-SST regimes in the North Atlantic is the k-means clustering method, which partitions and allocates the spatially and temporally defined 2D histograms of pCO2-SST into groups, called clusters or regimes. This algorithm iteratively searches for a predefined number of clusters (k), converging when the squared error between the mean of each cluster and the 2D histograms assigned per cluster is minimized. To ensure that the number of clusters that was predetermined is representative of the system, we developed an objective method where k is determined based on a sensitivity test that compares the average distance of all monthly 2D histograms to the centroid of their assigned cluster versus that of other clusters, using an assigned score which quantifies that distance. The objective method is meant to complement and not replace visual inspection of the datasets.

The Ocean Carbon States toolbox is a collection of MATLAB scripts that allows users to prepare the datasets and implement the cluster analysis, as described in detail in Latto and Romanou (2018). All functions are located in the OCS_toolbox together with the GUI script to allow for customization. The scripts have been tested in MATLAB R2015a and R2016a. The procedure is split in three steps: pre-clustering, clustering and post-clustering.

  • For the pre-clustering analysis the user may choose a basin (North Atlantic or Southern Ocean), a data type (observations or model), and which function to run, in order to produce any of the corresponding paper figures. By editing the appropriate scripts the user may focus on a different basin or ingest another dataset.
  • For the next step, the user may run k-means cluster routine after specifying the optimal number of clusters and iterations for the chosen basin and data type, as determined by the previous analysis. In the Latto and Romanou (2018) a number k=3 was obtain as the optimal number for clusters, however, one may apply the method for any k.
  • Lastly, in the post processing step, the user may perform spatial and/or temporal attribution of the clusters obtained.

The North Atlantic observed Ocean Carbon States

For example, by specifying the North Atlantic basin, Observations, and running k-Means with 3 clusters and 10 iterations, the user can generate the ocean carbon states for the observations shown in Fig. 2 as well as the temporal attribution as seen in Fig. 3 by selecting Temporal in the Post-Clustering Analysis.

North Atlantic ocean carbon states (regimes) in the observations
Figure 2: North Atlantic ocean carbon states (regimes) in the observations (Takahashi 2009).

Monthly attribution of each ocean carbon state (regimes) in the observations
Figure 3: Monthly attribution of each ocean carbon state (regimes) in the observations. Temporal attribution is based on the distance of each monthly 2D histogram to the centroid of each cluster.

All functions in the OCS toolbox are described in the README_toolbox file.

Accessing the Database

All data used as input to the clustering method have been published in the open literature. While the OCS Toolbox already has the data loaded in the OCS_data.mat file, the initial as well as the pre-processed data can be found online as described in the README_data.txt.

Datasets presently used (Latto and Romanou 2018) include:

  • Takahashi2009 Products: pCO2sw, SST, salinity, wind speed, ice_percent, CO2 flux
  • WOA2013: Nitrate climatology; 12 monthly fields
  • Numerical Simulations:
    Model output includes the ensemble mean, annual mean climatology from five ensemble historical climate runs using the NASA-GISS modelE2.1 and CMIP5 forcings. The ensemble mean has been re-gridded to the Takahashi2009 climatology grid and land-mask.

What's New

We will update the portal with new scripts and data as our research develops. These developments will be listed here.

  • 2018-03-27: Latto and Romanou (2018) published in Earth System Science Data.
  • 2017-09-25: First upload of the clustering method, scripts and input datasets. (An archive of this version is available at doi:10.5281/zenodo.996892.)


The authors wish to thank:

  • William B. Rossow, George Tselioudis and Yuan Zhang for their thoughtful discussions about the method,
  • the researchers who developed the input datasets, including Taro Takahashi at Lamont-Doherty Earth Observatory (LDEO), the National Oceanographic Data Center (NODC) and the NASA-GISS modeling group,
  • David Carlson, editor of Earth System Science Data for publication of this dataset.

Model runs, data analysis and scripting resources were provided by the NASA High-End Computing (HEC) Program through the NASA Center for Climate Simulation (NCCS) at Goddard Space Flight Center. Clustering analysis was performed using the MATLAB ver 2015 computing environment.

Funding for Anastasia Romanou and Rebecca Latto for this work was provided by NASA-ROSES Modeling, Analysis and Prediction 2013 NNX14AB99A-MAP for GISS Model-E development and NNX15AJ05A NASA Cooperative Agreement 2015-2018.


For questions/suggestions and for any help with the datasets and the methods please contact Dr. Anastasia Romanou.