DOESciDAC ReviewOffice of Science
OPEN SCIENCE GRID
A National High-Throughput Facility for SCIENCE
The Open Science Grid strives to help satisfy the ever-growing computing and data management requirements of scientific researchers, especially collaborative science requiring high-throughput computing. The Open Science Grid is a consortium of software, service and resource providers, and researchers—from universities, national laboratories, and computing centers across the United States—and the members' independently owned and managed resources make up the distributed facility, agreements between them provide the glue for it, their requirements drive its evolution, and they contribute their effort to make it happen.

Open Science Grid
Scientific discovery in the 21st century is tightly coupled to computational activity of enormous complexity and throughput. Distributed facilities that provide effective access for processing and storage reduce the intellectual distance between the globally distributed researchers and the resources they require. The Open Science Grid (OSG) consortium is a leader in providing dependable and scalable access to a shared national high-throughput computational facility benefiting science, research, and education. OSG provides an open environment for the engagement of multi-disciplinary scientists and researchers, information technology (IT) providers, software developers, educators, and administrators, which demonstrates tangible results across a broad range of community-based science.
OSG's goals are accomplished through a project that maintains each facility, trains communities in its use, and supports and evolves technologies and software based on scientific needs. The OSG project is jointly funded by the DOE SciDAC-2 and the National Science Foundation (NSF) for an initial five-year program begun in September 2006 (figure 1). This multi-agency sponsorship is important in facilitating engagement across the full span of educational and research communities—from small university groups and individual PIs to the large DOE laboratory facilities with their thousands of people engaged in physics collaborations. OSG is thus not only important to the Office of Science objectives in providing the scientific community access to world-class computation and network facilities, but also supremely necessary for the NSF mission of democratizing computing for research and education.
Figure 1.The stakeholder and project timeline, showing development of the OSG consortium.
The scale of resources, users, capacity, and performance of the OSG distributed facility is driven by the user communities, with the most challenging today being high-energy physics experiments with petascale data and processing needs. OSG is the U.S. computing infrastructure for the ATLAS and CMS experiments—part of the worldwide Large Hadron Collider (LHC) Computing Grid. OSG contributes to this global infrastructure, supporting access to more than 20 petabytes of archived data, enabling access to those data through up to 30 petabytes of disk cache, and managing processing by more than 100,000 computers across 100 sites worldwide. OSG is thus not only important to the Office of Science objectives in providing the scientific community access to world-class computation and network facilities, but also supremely necessary for the NSF mission of democratizing computing for research and education.
Other physics experiments, which already have mature data distribution and analysis systems, are adapting their applications to take advantage of this shared infrastructure. The Laser Interferometer Gravitational Wave Observatory (LIGO), the Fermilab Tevatron experiments (CDF and DZero; figure 2, p30), and the STAR Relativistic Heavy Ion Experiment are the most prolific. There is active engagement in OSG by scientists from other areas including computer science, biology, and astrophysics. This mix of applications is ensuring the evolution of a generic cyber-infrastructure to support the needs of additional scientific communities.
Figure 2. The limits for the frequency of Bsdecay, as determined by the two Tevatron collaborations, CDF and DZero, have changed as measurements become more sensitive to supersymmetry and other new physics. The latest limits are closer to the Standard Model expectation.
The aim is to provide a common distributed computational facility, with end-to-end capabilities, which makes individual and collaborative scientific research more effective and which accommodates large, dynamic peak demands across a robust production steady state infrastructure. The effort to bring technologies, applications, methods, and processes to quality production use is valued research of its own.
The scope of OSG is to operate a secure, heterogeneous infrastructure and to address end-to-end distributed computing needs for high-throughput computational science by engaging existing and new users. To provide this access, OSG makes available common software technologies, extends the capabilities and capacities of the facility, enables cross-campus infrastructures, partners with national and international peers, and educates and engages students, faculty, and the workforce. Beyond this is the acquisition and operation of the resources, such as processing farms, storage caches, and software artifacts. These continue to be owned and operated by the consortium members themselves. Outside the scope is the development of software—middleware, applications, and research algorithms and methods—used by or on the facility. These continue to be the responsibility of external projects and collaborations. Similarly, the scientific data archives continue to be owned and operated by the research communities. OSG lives in the computational eco-system of many projects and organizations.
Applications running on OSG span data simulation and analysis of small-scale (CPU days) or large-scale (CPU centuries) scientific applications. The facility's architecture has special utility for high-throughput computing applications, large ensembles of loosely coupled parallel applications for which the overhead in placing the application and data on a remote resource is a fraction of the overall processing time, and for which the computations are sufficiently loosely coupled to be able to take advantage of opportunistic resources. Many examples exist today of large-scale simulations and event processing in experimental and observational sciences (figure 3, p31).
Figure 3. Opportunistic computing and reuse of experiment frameworks. Molecular dynamics simulations reveal that the staphylococcal nuclease protein residue can twist into another conformation, usually forcing water molecules to leave. But which conformation is adopted and for what proportion of the time? And what happens to the water structure around these conformations? Many more simulations are required to find answers.
OSG benefits computer science research. The large-scale distributed infrastructure provides a real-world laboratory for the measurement of the effectiveness of advances in computer science, and researchers use OSG to develop and analyze new distributed algorithms and methodologies. The "open source" paradigm is at the core of OSG methodology. It is defined as an open system improved and extended by community contributions and involvement, staffed with a core of dedicated experts ensuring quality, performance, and effectiveness.
OSG lives in the computational eco-system of many projects and organizations. The consortium collaborates closely with those of most importance to its users and providers—the most active being the Enabling Grids for E-sciencE (EGEE) project in Europe and TeraGrid in the United States.
Finally, OSG expands the value of the infrastructure and expertise it deploys by reaching out to other scientific domains and communities. This is painstaking as it must be matched to the development cycle and culture of the organizations involved. Researchers have little time to learn how to use and master new technologies. OSG must demonstrate early competitive advantages to the user by increasing productivity and tackling problems and research areas not previously addressed.

Background
OSG's beginnings as a grass-roots organization emerged through a collaboration of SciDAC-1 and NSF Information Technology Research (ITR) projects. In 2003, the leadership of the Particle Physics Data Grid, Grid Physics Networks, and International Virtual Grid Laboratory projects (later joined by the DOE Science Grid) formed "Trillium," an ad hoc community to build and operate a nationwide distributed infrastructure, Grid3, for the benefit of their participants. This was so successful that Grid3 was maintained for the next two years as an operational infrastructure and the community evolved into the OSG consortium (figure 1, p29).
OSG has an open membership with a low fee to join. The consortium now has about 100 registered member organizations, including computer science groups, software development projects, large DOE and university IT facilities, research collaborations, regional and campus infrastructures, and even universities in South America and Asia. OSG's achievements rely to a large extent on the commitment of the members to the work of the organization and its goals. OSG's leadership is truly inter-disciplinary, bringing together computer and domain scientists and IT experts.

Methods and Techniques
OSG aims for a system providing thousands of users access to more than a hundred thousand processing cores and many tens of petabytes of storage located across hundreds of heterogeneous and autonomous sites. This system must be robust against failure and unavailability of any one or any set of components. The system must also support dynamic integration of new resources and applications and respond to a dynamic, diverse workload.
Distributed computing principles provide the foundation for the implementation, architecture, and design of the facility. Principles of symmetry and recursion inform the approach to providers (computing, networks, data storage, databases, and software agents), consumers (software agents, middleware, users, and applications), security, and fault tolerance.
The "open source" paradigm is at the core of OSG methodology. It is defined as an open system improved and extended by community contributions and involvement, staffed with a core of dedicated experts ensuring quality, performance, and effectiveness. OSG is a full partner in the annual two-week International Grid School in Europe, which is praised for the accessibility of respected educators who teach the principles and theories on which distributed computing is based as well as the practical techniques.
Conceptually, at the heart of the model is the Virtual Organization (VO) collaborative group whose scope includes people, resources, technologies, and policies connected with a common purpose and goal. VOs contribute, share, and exchange support, resources, and services subject to agreements and well-defined interfaces. VOs retain control over their policies and contributions, contributing to and incorporating those that are OSG-wide. OSG is equivalent to the "mother VO" providing people, technologies, policies, and services accessible by all others.
The economic usage model of OSG supports direct use by the owners, guaranteed use by other communities with whom the owners have made short- or long-term agreements, and opportune use of resources on an "as available" basis. The owners retain control over their resources while sharing them with the community. Contributors typically provide access to 10%-20% of their overall capacity. This economic model is viable for three reasons: the cyclic nature of many communities' computational needs; the technologies support some level of automated dynamic reassignment of work; and the contributors to OSG are actively committed to the success of the model.

Immersive Engagement and Training
Collaborative and multidisciplinary computational science develops by direct, embedded engagement of OSG experts with researchers adapting their applications to use the common infrastructure. This model of immersion and interaction leads to close working relationships between OSG and domain experts. The transmitted knowledge is reused vertically within the engaged community, with the trained members helping each other locally, and horizontally with OSG experts moving to help new communities over time. Similarly, OSG experts work directly with campus organizations in preparing their faculty and IT facilities to develop a shared common resource equivalent to OSG itself. Once self-organized locally, campuses interact as needed with OSG to use similar computing resources remotely.
In addition, expansion of a trained and emerging workforce continues with a series of hands-on workshops for students and faculty from universities as well as those from the commercial sector. These workshops provide education in the techniques and technologies of OSG and its partners. Once trained, students are encouraged to continue to develop, use and research in distributed computing within their local environments. Also, OSG is a full partner in the annual two-week International Grid School in Europe (figure 4), which is praised for the accessibility of respected educators who teach the principles and theories on which distributed computing is based as well as the practical techniques.

The facility packages and distributes more than 60 modules from the open source community for the benefit of consortium members and partners--the Virtual Data Toolkit.

These methods establish an understanding of CI fundamentals alongside tangible practical techniques, advance our understanding of the role of mentoring, and allow scaling of the transfer of knowledge.
Conceptually, OSG provides a plethora of cyber-infrastructures intersecting and contributing to the global network. The goal is to retain autonomy over the parochial system while facilitating federation and negotiation. Gateways provided between OSG and other infrastructures allow users to transparently access data, processing, and services across administrative domains. Alan Blatecky, a member of the OSG executive board, reflects: "We are opening up gateways to enable communication between grids so that scientists can run the same application on multiple grids. Interoperability is further increasing the resources available to individual scientists."
Figure 4. Students hard at work at a Grid School. The FIGS'08 school included about 35 participants from more than 10 universities worldwide, ranging from undergraduate students to researchers and faculty.

The OSG Facility
To sustain a robust production quality facility, OSG furnishes round-the-clock operational and support services including security and troubleshooting. To support the often complex applications and workflows, OSG provides a comprehensive suite of common middleware for job management, data, and collaborative scientific processes. To enable managed evolution of the capabilities and capacities, the facility also provides a significantly-sized distributed test bed for validation of software and applications. And, to support expansion in domains and scale of use, OSG engages with new communities of users through immersive collaboration between experts and researchers.

The Virtual Data Toolkit
Dependable and secure software is essential to effectively manage and use OSG. The facility packages and distributes more than 60 modules from the open source community for the benefit of consortium members and partners—the Virtual Data Toolkit (VDT; figure 5). The software modules are built, tested, and distributed for the more than 12 LINUX platforms accessed by OSG users and other communities. The VDT includes core grid middleware from Condor and Globus, as well as software from over 10 different development groups, including EGEE and OSG members. The goal is to make software easy to install, configure, operate, and update. Expertise gained is proving invaluable in providing timely response to updates addressing security vulnerabilities and software failures. The OSG team maps dependencies between software modules, works with providers for bug fixes, provides automatic configurations, and understands security issues. There are several packaged collections of modules for different needs—for example, processing farms, storage services, VO administrators, with components selected as needed.
The number of software releases, including patches for security and software fixes, has now risen to more than 20. Software provisioning and problem diagnosis in such complex end-to-end systems that include application components as well as the common middleware and site-specific operating and management software pose significant challenges. Efforts over the past year have reduced the time from identification of a needed critical software patch to distribution of the software update from weeks to days—but there is yet more to be accomplished.
Figure 5. VDT distributions. The number of software components in the VDT.
Components of the VDT are supported for an increasing set of projects—including the EGEE and TeraGrid. In the latter case, OSG and TeraGrid have worked closely to align the core middleware distributions. Alain Roy, OSG software coordinator, says, "One key element of the first phase towards interoperability of the OSG and TeraGrid infrastructures has been our agreement to align on the same version of Condor and Globus. A second key element was TeraGrid's adoption of the NSF-sponsored Metronome build and test software that OSG also uses."

Scale of the Facility
Quantifying the scale and capacity of the facility is difficult because local control and policies govern the actual use and availability of resources. While nodes of a compute farm may be accessible through OSG, local policies may be such that in practice the nodes are locked down for local work. While OSG may account for thousands of CPU hours of processing, much of it may in practice be by users running on their own organizations' clusters—albeit geographically or organizationally distributed. However, characterization of the changes during 2007 gives a measure of growth of the facility: the number of CPUs accessible to OSG has risen from an estimated 33 million to 50 million SpecInt2000s; CPU usage increased from 10,000 to up to roughly 15,000 CPU wall clock days per day; the number of resources registered to the facility increased from the mid-1960s to the mid-1970s; and the number of campus/regional infrastructures federating with OSG in the United States has increased from three (Wisconsin, Purdue, and Fermilab) to six (with the addition of the New York State regional grid, the North West Indiana Computational Grid, and Clemson University's cross-campus infrastructure).

Scientific Accomplishments
Computation and data access provided by OSG are only one part of the end-to-end systems on which the scientific and research results depend. The collaborative method, the common middleware, the collaboratively owned and opportunely available resources, and the mentoring and training by experts all contribute to scientific accomplishments. Three examples described below show various aspects of the benefit and value of shared cyber-infrastructure that is the OSG mission.

Getting Ready for the LHC
To meet the data analysis capacity and performance needed when the LHC at CERN reaches full luminosity, scientific collaborations build and test their worldwide infrastructure in stages. The goal is a coherently managed global petascale data distribution and processing facility—the World Wide LHC Computing Grid. The challenges include keeping a 24 x 6 x 365 system ticking worldwide, flowing terabytes of data a day seamlessly across the globe, and managing transparent and robust access to data stores and processing across multiple independent infrastructures.
2007 was an exciting and challenging year where the scale of the resources reached more than 25% and the data distribution was tested to the full extent of that needed for initial data analysis. Globus GridFTP and disk caching technologies from OSG software installations supported distribution of data from the Tier-1 facilities at Brookhaven National Laboratory and Fermilab to the more than a dozen Tier-2 university facilities across the United States (as well as two in Brazil). The OSG facility provided the common security, operations, and software infrastructures in the United States on which this challenge took place. Job scheduling and execution relied on the VDT Condor-G and Globus GRAM software. Particle event data were also simulated on opportunely available OSG resources—including LIGO sites and campus infrastructures at the Universities of Wisconsin and Buffalo. Ian Fisk, U.S. CMS Facilities Manager, explains that OSG "has contributed more than 30% of the utilized resources in for the data processing challenge in 2007, whereas the U.S. collaboration is responsible for about 25%." The collaborative method, the common middleware, the collaboratively owned and opportunely available resources, and the mentoring and training by experts all contribute to scientific accomplishments.
Figure 6. Events processed for DZero in 2007.   Total events reprocessed up to mid-May 2007 on OSG (blue), DZERO-SAM (red), and LCG (green).

Sharing Resources for the Common Good
The ability of the OSG consortium to respond to unanticipated demand is a major accomplishment of the Tevatron DZero experiment (figure 6) to reprocess their complete dataset of five million events using a majority of resources they did not own. Through a focused joint effort of collaboration and OSG staff, in a few months the legacy DZero system of up to a million lines of code was enabled to effectively benefit from resources available at multiple OSG sites. By the summer of 2007, DZero clocked use at more than a dozen OSG sites, sustained execution of over a thousand simultaneous jobs, used more than two million hours over a period of three months, moved over 70 terabytes of data, and reported physics results dependent on OSG runs.
"This was the first major production of real high-energy physics data (as opposed to simulations) ever run on OSG resources," said Brad Abbott of the University of Oklahoma, then head of the DZero Computing group. DZero has been increasing its usage of OSG ever since then.

Enabling New Research
An example of a new science using OSG for production computing in 2007 is the Kuhlman Lab at RENCI which runs local variations of Rosetta—the molecular modeling program created in David Baker's laboratory at the University of Washington—to study protein structures (figure 7). The pre-existing job submission and management scripts were adapted to run jobs on both OSG and on local campus resources resulting in the use of more than 20 sites and more than 100,000 CPU hours a month over OSG.
The Kuhlman application demonstrated a cyclical usage pattern typical of research groups of one to a few members, where the analysis of and preparations for computational processing runs (in this example, the proteins modeled are then made in the web lab for comparison and analysis) result in periods where little or no data processing or simulation is needed. OSG's model of dynamic allocation and response of a large, shared resource pool have proven effective in providing on-demand throughput across multiple diverse programs where time-dependent individual peak needs can be accommodated within relatively stable, sustained overall throughput. OSG's model of engagement with collaborative science enables viral transmission of knowledge and adoption into an increasingly diverse set of domains and communities.
Figure 7. A de novo designed protein-protein interface created with Rosetta.

The Future
Processing and data-intensive science is being transformed through effective access to ubiquitous shared cyber-infrastructure across campus and community infrastructures and across national and international boundaries. OSG is playing a unique role in this transformation through its support for shared high-throughput computational facilities, its engagement with the end-to-end needs of exploration of a single researcher's hypothesis, and the simulation and analysis needs of the largest multi-disciplinary scientific programs.
OSG's model of multiple federated infrastructures results in natural growth of the distributed facility in scale and capacity as more and more campuses self-organize and embed computation into every facet of their research and education. OSG's model of engagement with collaborative science enables viral transmission of knowledge and adoption into an increasingly diverse set of domains and communities.
In practical terms, the next few years will see a significant increase in the capacity and capability needs of core physics stakeholders, with the ramp-up of data-taking at the LHC and the start of advanced LIGO. The targets of 100,000 processing cores accessible from and 1,000 individuals using OSG seem attainable. While many challenges remain in making the ensemble effective with regard to cost, usability, and robustness, the focus on making small steps successful is leading to fundamental advances in the abilities of computational science.

Acknowledgements:This work is supported by the Office of Science, U.S. Department of Energy, SciDAC program under Contract DE-FC02-06ER41436 and the National Science Foundation Cooperative Agreement, PHY-0621704