ESG-CET Tiered Architecture: A Design for the Future
ESG-CET is preparing to accommodate ever-larger climate archives that are expected to come online within the next four years. Specific focus is on preparing for the IPCC's Fifth Assessment Report archive, scheduled for early 2013, readying tools for publishing and processing massive amounts of data produced by the Climate Science Computational End Station ORNL, and supporting a wide range of improved climate model evaluation activities. To date, the ESG story has been told in terabytes (trillions of bytes), but scientists must soon manage, analyze, and visualize datasets counted in petabytes (quadrillions of bytes).
Today, ESG stores data at three primary repositories in the United States, but the climate community is moving toward petabyte-scale datasets, physically located at many sites throughout the world. For example, "core" data for the IPCC Fifth Assessment are expected to be one petabyte stored at PCMDI and another 25 to 50 petabytes housed at 20 or more ESG satellite nodes.
Source: D. N. Williams, LLNL Illustration: A. Tovey
Figure 12. The ESG Phase 2 tiered architecture showing the three levels of data services represented as Tier 1-Global; Tier 2-Gateway; and Tier 3-Data Node. Three ESG gateways are planned initially, at PCMDI, ORNL, and NCAR. The picture also shows where data users and data providers gain access.
The extremely large sizes of the individual collections will mean that much more of the analysis will need to be performed on the data where they are stored; there will not be resources or bandwidth to constantly transfer data to and from analysis locations. Also, much of the analyses require access to multiple data repositories for comparisons between models or datasets. Hence, by 2011, ESG will be a full data-sharing environment and will provide users with open access to a broad range of data assets, spanning models, satellites, field data, biogeochemistry, ecosystems, and more.
This comprehensive infrastructure will federate many data centers into a virtual data repository, provide a full suite of analysis and data manipulation tools, integrate model and observational data, and provide model inter-comparison metrics, user support, and life-cycle maintenance.
The new architecture is based on three tiers of data services (figure 12):
  • Tier 1 Global Metadata Services for Search and Discovery: Comprises a set of services providing shared functionality across the worldwide ESG-CET federation. The exact specifications of the global services have not been finalized but will likely include services for sharing search metadata, exchanging user attributes and resource policies, and overall monitoring of the ESG-CET system. An overall single sign-on authentication and authorization scheme will allow a registered user to access resources across the whole system and to find data throughout the federation, independent of the site at which a search is launched.
  • Tier 2 Data Gateways as Data-Request Brokers: Comprises a limited number of ESG data gateways that act as brokers handling data requests to serve specific user communities. Services deployed on a gateway include the user interface for searching and browsing metadata, requesting data (including analysis and visualization) products, and orchestrating complex workflows. Gateways will be operated directly by ESG-CET engineering staff.
  • Tier 3 ESG Nodes With Actual Data Holdings and Metadata Accessing Services: Includes the actual data holdings and resides on a (potentially large) number of federated ESG nodes, which host those data and metadata services needed to publish data onto ESG and execute data-product requests through an ESG gateway. Personnel at local institutions will operate ESG nodes. A single ESG gateway serves data requests to many associated ESG nodes. For example, more than 20 institutions are expected to set up ESG data nodes as part of the IPCC Fifth Assessment.
To be ready for the expected data onslaught, a distributed testbed must be in place by mid-2009. To accomplish this end, five institutions have been selected for the test bed: PCMDI, NCAR, and ORNL will function as ESG gateways (and also as nodes), while Los Alamos and Lawrence Berkeley national laboratories will function as ESG nodes. Strong international interest, particularly in IPCC, will quickly bring expansion of the test bed to include several ESG nodes at international locations, paving the way for a truly global network of climate data services.