 


Planck will provide the definitive measurement of the temperature anisotropies, as well as the most detailed polarization observations to date. These results will be an essential complement to a number of followon experiments currently being developed to improve our understanding of the dark energy (indeed the projected Planck results are routinely assumed as given by those experiments). Beyond Planck, first suborbital and ultimately another satellite mission will search for the faintest CMB signal, known as its Bmode polarization. In an extraordinary convergence of physical processes across scales and systems, this is expected to carry the imprint of gravity waves generated during the inflation of the Universe following the Big Bang.  
As the science goals of CMB observations have evolved, they have required everhigher resolution observations of everfainter signals (figure 7, p21). Achieving sufficient signaltonoise to extract the science has necessarily required an increase in the volume of data being gathered, and their analysis has become a significant computational challenge. Over the last decade, NERSC has become the leading provider of highperformance computing (HPC) resources for CMB experiments worldwide (sidebar "CMB Data Analysis at NERSC," p24).  


Correlation is Key, but Computationally Intractable The key feature of CMB data is that they are correlated, meaning that individual observations are not independent of one another. Moreover, each of the three major components of an observation—the instrument noise, the foreground signal, and the CMB signal—is correlated in a different domain. Detector noise is correlated in the timedomain; foreground signals, such as the emission from dust in the Galactic plane, are spatially correlated; and the CMB signal itself is correlated on different angular scales—indeed it is precisely the spectrum of these angular correlations that scientists want to determine. 
As the science goals of CMB observations have evolved, they have required everhigher resolution observations of everfainter signals. 

These correlations preclude the kind of divideandconquer embarrassingly parallel approaches used in other dataintensive fields, such as accelerator physics. Instead, any CMB analysis has to be able to manipulate an entire dataset simultaneously and coherently, ideally keeping track not just of the data but also of all of the details of these correlations as they are reduced. In practice, tracking the correlations exactly has become computationally intractable, and instead a range of approximate methods have evolved with the degree of approximation needed being set by the size of the dataset and the available computing resources. 

Scaling In determining the suitability of a particular approach for the analysis of a dataset an important concept is that of scaling. Scaling tells us the computational cost of a calculation for a particular data size, and typically consists of a function of the data size (often a power law) and a numerical prefactor. For example, an algorithm with an N3 power law means that doubling the data size results in an eightfold increase in computing cost, while the cost of a calculation scaling as 10(N3) would be ten times that of one scaling as N3 for the same data size. As a rule of thumb, the power law depends on the algorithm while the prefactor depends on the implementation. 

Often scaling depends on several parameters that describe different dimensions of the dataset, which are typically distinguished by their subscripts. In addition there may be several steps in a complete analysis, each of which has its own scaling behavior. In this case we often simply cite the most computationally costly step, known as the dominant scaling, since it will determine the overall tractability of the approach (and also identify the area where efforts to improve performance should be focused). 

CMB Scaling Limits For CMB data analysis, the key parameters are the number of observation times, N_{t}, and the number of map pixels, N_{p}. The exact maximum likelihood analysis of a CMB dataset—that is, writing down a Gaussian likelihood function and maximizing it—scales as the cube of the number of pixels, N_{p}^{3}. This approach was adequate for the datasets gathered by suborbital experiments in the last decade—including the seminal BOOMERanG and MAXIMA balloon missions, the first to be analyzed at the DOE's NERSC Center. However for more recent experiments—both suborbital and satellite missions like WMAP and Planck—they have become completely intractable except on small subsets of the data. 
In determining the suitability of a particular approach for the analysis of a dataset an important concept is that of scaling. 

The first approximation used to address this problem has been to decouple the two phases of the analysis—mapmaking and power spectrum estimation—and calculate maps without attempting to keep track of the correlations between the noise in each pair of pixels. Using an iterative preconditioned conjugate gradient (PCG) method to solve for the most likely map given the observations and their noise properties scales linearly with the number of observations Nt, with a prefactor of around 100 that includes the number of iterations required for the algorithm to converge. Making reasonable assumptions about growth in the available computing power, this approach should remain viable—at least on the largest supercomputers—for all foreseeable generations of CMB experiments.  
Deriving the angular power spectrum from this map is now complicated by the absence of information about the pixel correlations. Various moreorless approximate approaches exist to handle this, including Monte Carlo methods that involve generating and analyzing large numbers of simulated datasets with the same properties as the real data (at least as far as they are known) and using these to estimate the uncertainties imposed by the approximation. Since we typically expect uncertainties to fall as the square root of the number of realizations, achieving the desired percent accuracy using this method requires ten thousand realizations, each of which requires simulating and then mapping the data. This approach has been tractable to date, being widely used in the analysis of the WMAP data, but will begin to push the limits of our capabilities with the Planck data and beyond.  
A number of other approximate methods, including Gibbs and Hamiltonian sampling techniques and a hierarchical maximum likelihood approach, have been proposed, although none has yet been successfully applied to real experimental data. The analysis of the coming generation of experiments remains an open challenge that will likely be met by a hybrid approach using multiple methods to try to control the errors introduced by the approximations inherent in each. 

Data Proliferation Since the publication of the BOOMERanG and MAXIMA results in early 2000, the use of NERSC resources by CMB experiments has increased very significantly. In any allocation year, there are now about a dozen experiments and 100 data analysts (many on multiple teams) sharing an allocation of upwards of one million CPUhours. 

One challenge presented by this proliferation came from the fact that every experiment adopts its own way (and often more than one way) of storing its data, in terms of both the data distribution within and between files, and the format or formats of those files. To an analysis applications programmer who wants to be able to apply his or her code to any data, these details are at best an irrelevance and at worst an encumbrance. To circumvent this we have developed a data abstraction layer, the M3 library, which allows the application programmer to request any particular interval or subset of any particular type of CMB data (for example, sample data, pixel data, and so on) without needing to know what files and formats those data are stored in. The details of the data files and formats for a particular analysis are held in an XML run configuration file which is parsed by the M3 library at the outset and referenced at the lowest level whenever actual input/output (I/O) is performed. Extending support to any new file format then simply requires the addition of one new function able to handle that file type. As well as providing transparency to the applications programmer, the use of the M3 layer has enabled CMB researchers to analyze the combined data from multiple experiments despite their different formats, and provided instant upgrades to all applications whenever a new dataaccess optimization or functionality is incorporated.  


Despite the very important CMB science that these experiments are doing, for many years now the Planck satellite mission has been setting the bar in terms of the computational challenge. With a dataset that is orders of magnitude larger than anything that has gone before, and with science goals that require the most precise control of errors of any CMB analysis to date, Planck represents the rising tide that floats all ships.  
As with any mission before its deployment, Planck work has necessarily involved the generation and analysis of simulated data. However this is an essential exercise for a number of reasons:


In addition, Planckscale CMB data analysis has proved to be a very useful scientific application for highlevel system benchmarking at NERSC (sidebar "MADbench Software for Realistic Benchmarking," p23), and a symbiotic relationship has emerged in which new NERSC capabilities are regularly proved by performing Planck calculations at some previously intractable scale while simultaneously enabling unprecedented Planck science with the results.  


In the summer of 2003, NERSC doubled the size of its SP3 to 6,000 compute processors. Unfortunately, the system's standard MPI implementation included a hard limit of 4,096 tasks, meaning that any code wanting to exploit the new capabilities of the system had to be an MPI/OpenMP hybrid. Late in 2004, a new MPI implementation was received that supported MPI across the entire system. Given the opportunity to test this with a scientific application, the LBNL team performed the first mapmaking analysis of the anticipated data from all 12 of the detectors at a single Planck frequency. With 75 billion simulated samples being mapped to 150 million CMB temperature and polarization pixels, the run required 2.5 TB of disk space to store the input data, and ran on all 6,000 Seaborg processors for about two hours—the first calculation to utilize the entire system.  With a dataset that is orders of magnitude larger than anything that has gone before, and with science goals that require the most precise control of errors of any CMB analysis to date, Planck represents the rising tide that floats all ships. 

The resulting maps not only demonstrated a critical dataprocessing capability for Planck, but their subsequent spectral analysis confirmed that, despite claims to the contrary, Planck would indeed be able to recover the largeangularscale (lowmultipole) signals despite the presence of longrange correlations in its instrument noise.  
Scaling to this concurrency presented a number of problems, most specifically with the I/O subsystem. Having 6,000 processors simultaneously attempting to access 2.5 TB of data, some of which were unique to a processor and some common to all processors, proved to take about 50% of the analysis' run time. To reduce the I/O load we looked at ways to compress the data, while keeping in mind that, given the subtlety of the signals we wanted to extract, any compression would have to result in a dataset that produced exactly the same results as the uncompressed data. A closer look at the input dataset revealed that only about 15% of it was signal from the detectors, with the remainder being the socalled pointing solution, recording where on the sky each detector was pointing at the instant each sample was taken. These pointing data offered two dimensions for effectively lossless compression:


The tradeoff is then between the disk space and I/O time required for the full pointing and the calculation time needed to reconstruct it on the fly from the generalized compressed data. Quantitatively this tradeoff is obviously systemspecific; on the SP3 the Generalized Compressed Pointing (GCP) library halved the time to load the data, saving of about 25% of the runtime. The dominant cost in the reconstruction comes from trigonometric functions, and systems with vector math libraries show much better performance than those without. More significantly though, the use of the GCP library reduced the disk space and memory requirements by up to 85%—a critical saving for a dataset of this size. Indeed, as a consequence the Planck team was able to increase the size of the dataset being analyzed from 12 detectors at a single frequency to 32 detectors at three frequencies, with a corresponding increase in the quality of the results. Finally, by accessing the GCP library through the M3 data abstraction layer discussed above, the new capabilities it offered were immediately and transparently accessible to all of the analysis codes using M3.  
Ultimately, Planck will gather data using 74 detectors at nine frequencies. The obvious next milestone would therefore be to perform an analysis of data simulated for the full mission duration across the full focal plane. Until very recently such a calculation has been beyond the capabilities of the NERSC systems, but with the recent arrival of Franklin—a Cray XT4 with 20,000 cores each having 2GB of memory and capable of 5.2 gigaflops—it entered the realm of the possible.  
Once again, the LBNL team was invited to submit a previously intractable calculation to test the limits of a new computational capability, and once again we took the opportunity also to attempt an unprecedented scientific calculation for Planck. To add to the realism of the analysis, this time we included not only the CMB signal and detector noise but also many of the foreground signal components that will contaminate the real Planck data—including point source, synchrotron, freefree and dust emission, as well as the SunyaevZel'dovich effect (a spectral distortion due to the upscattering of CMB photons by the hot plasma within dense galaxy clusters). To maximize the science opportunities for their postprocessing these data were simulated in two sets—one containing the CMB and noise and the other all of the foregrounds. In addition the data were stored in separate files, each holding a single day's observations, as they will be downloaded from the satellite. Overall this simulation comprised 800 billion samples stored in 50,000 files occupying 3TB of disk space.  The Planck team was able to increase the size of the dataset being analyzed from 12 detectors at a single frequency to 32 detectors at three frequencies, with a corresponding increase in the quality of the results. 

With this dataset in hand we first produced maps of the microwave sky at each individual frequency (figure 8). These served both as a check on the integrity of the data and provided the inputs expected by the Planck componentseparation working group. The most challenging calculation, however, was the previously impossible analysis of the entire focal plane data volume simultaneously, which, amongst other things, supports componentseparation within the timeordered data analysis rather than at the map level. This approach has the potential to offer a significantly cleaner separation, and a comparative analysis of the outputs from the two approaches will provide the first insight into their respective strengths and weaknesses. We were able to perform this calculation with the capabilities offered by Franklin. 

NextGeneration Experiments For all that the Planck data will provide, the fine details, of the CMB Bmode polarization in particular, will require a new generation of experiments. Having reduced the detector noise to the limits of quantum efficiency and shot noise, the only way to achieve the signaltonoise necessary for the Bspectrum measurement is through a dramatic increase in the data volume. Increasing the observation time would be one way to achieve this, although this is not always practical. The alternative is to increase the number of detectors fielded by an experiment. Despite the challenges that this presents, suborbital experiments multiplexing thousands, and ultimately tens of thousands, of detectors are planned, and their first incarnations are currently being prototyped. Ultimately another satellite mission will be required to map the entire microwave sky with such detectorsets, and both NASA and ESA have such missions in their longrange plans. 

What the Future Holds For the CMB data analysts the future therefore holds the prospect of a 100fold increase in the raw data volume, to the order of one petabyte and beyond. This will undoubtedly present a new set of challenges. Two examples of areas that we have begun to work on in preparation for this are automated data management and onthefly simulation. With limited spinning disk space and multiple real and simulated datasets it will not be possible to have all the data spinning continuously. However, any particular analysis requires all of its data to be accessible simultaneously, so we will need to automate the management of limited disk resources to meet the needs of the spectrum of analysis jobs as efficiently as possible. Given this disk space constraint, we will also need to couple simulations and their analyses much more tightly, replacing the current model, in which a simulation is performed once and its data are written to disk for subsequent analysis, with one in which the data are simulated and identically resimulated onthefly (and entirely incore) whenever an analysis requests them. 
New NERSC capabilities are regularly proved by performing Planck calculations at some previously intractable scale while simultaneously enabling unprecedented Planck science with the results. 


Equipped with worldleading highperformance computing capacity and capability, we remain confident that we will continue to be able to meet the data analysis needs of current and future generations of CMB experiments. 

Undoubtedly other challenges will arise as these data start to flow. However, equipped with worldleading highperformance computing capacity and capability, we remain confident that we will continue to be able to meet the data analysis needs of current and future generations of CMB experiments, and so help to realize the extraordinary scientific potential of this last echo of the Big Bang. 

Contributors: This work has been the result of a number of collaborations with the CMB and HPC communities. The author, Dr. Julian Borrill, would like to thank members of the BOOMERanG and MAXIMA experiments, the U.S. Planck team and Planck Working Groups 2 and 3, as well as colleagues in the LBNL Computational Research Division and University of CaliforniaBerkeley Computer Science Division and Space Sciences Laboratory. Particular recognition goes to Christopher Cantalupo, Andrew Jaffe, Theodore Kisner, and Radoslaw Stompor.  
Further Reading "Cosmology and Computation: An Interview with George Smoot," p10

