HaloFinder AHF -------------- 1. Scenario Overview -------------------- 1.1 Background and Purpose Cosmological simulations are nowadays the key tool for investigating the different processes involved in the formation of the universe from small initial density perturbations to galaxies and clusters of galaxies observed today. The identification and analysis of bound objects, haloes, is one of the most important steps in drawing useful physical information from simulations. In the advent of larger and larger simulations, a reliable and parallel halo finder, able to cope with the ever-increasing data files, is a must. AHF is a freely available parallel halo finder. 1.2 More information * AMIGA code web page http://www.aip.de/People/AKnebe/AMIGA/ ------------------------------- 2. Current Scenario description ------------------------------- 2.1 Environment 2.1.1 Hardware - Processing * parallelized for use on clusters - Storage * 100 Mb -- 100 Gb - Network * cluster interconnect, file transfer - Describe special hardware or other hardware resources that are relevant for the scenario. * N/A 2.1.2 Software - Describe used software such as operating system, software libraries, e.g. HDF5-plugin for GridFTP, ... * posix, MPI2 - What programming language is used and what compiler/linker version is required? * c99 - How is the program deployed? * build on cluster head node - How is the program compiled? * make - State the program license and any commercial 3rd party licenses. * GPL 2.2 User Interaction 2.2.1 Initiation - Describe how the program is started and any steps needed before the actual initiation. * copy snapshot files * set up parameter file * run program - compilation (cf. Section 2.1.2), * as above - Where is the program executed? * cluster compute nodes - How is the program initiated? * mpiexec AHFstep 2.2.2 Monitoring/Steering/Visualization during the run-time of the program - What type of data is produced by the program during run-time used for monitoring/steering/visualization? * log files are written during execution * no steering - What methods/tools exists for accessing data produced by the program during run-time? * N/A - Does your application support any standard for monitoring/steering? * N/A - Describe any security measures related to program access for monitoring/steering/visualization. * N/A - Who can access the running program OR run-time produced monitoring data? * N/A - From where can run-time produced monitoring data be accessed? * N/A - How is the program termination detected? * batch system indicates job finished - How much monitoring data and how often is monitoring data transferred during a program run (min/max/avg)? * a few megs in the log file - Does your program generate metadata and stores this externally (e.g. in a catalog)? * no - Who accesses this metadata? From where? Does your program access metadata generated by other programs? - How many executions/jobs must be monitored/steered in parallel? By how many users? * N/A 2.3 Input 2.3.1 Parameters 2.3.2 Input data - How is the input data prepared? * "snapshots" generated by AMIGA, or Gadget simulation * parameter file set up * arrange files in directory * directories tarred, transferred to head node, un-tarred - Where is the input data stored? Describe all central and distributed locations. * comes from central storage server - Are file-names known in advance (before the program is started)? * yes - Are data locations (directory, server, ...) known in advance? * yes - Describe the different ways data is accessed. * OS file opening - Non-file based data access (XML, database, ...) * N/A - How much data is accessed at each run? * 400 Mb - 32 Gb - Is it possible that a data set/file is accessed multiple times over a short period of time? * ? - How many users are using the same data simultaneously? * 1 - Elaborate on the use of metadata related to input data. * snapshot files contain their own metadata 2.3.3 Additional Notes 2.4 Output 2.4.1 Output data - Where is the output data stored? Describe all centralized or distributed locations. * first written to cluster filesystem * tarred, then written to central storage - How is the output data structured? * file names according to process ID - Describe what happens when the program finishes? How are the results used? * catalog files of halos found, and their properties * make plots of halos - Describe the different ways data is created/changed. * ? - Non-file based data access (XML, database, ...) * N/A - How much data is written by the program at each run? * 100 Mb - 6 Gb - Describe the parameters which influence the amount of data and number of files/data sets generated. * refinement criterion of grid heirarchy affects data size. - Elaborate on the use of metadata related to output data. * N/A 2.4.2 Additional Notes 2.5 Information resources * N/A 2.6 Data Stream Management * N/A 2.7 Resource Security and Access Restriction * N/A 2.8 Additional Information - How long (avg) does the scenario execute (minutes, hours, days)? * 20 min -- 2 hrs - How often will the scenario be executed? * just once, after runs to adjust compile options, parameters, code features - Are the executions time-critical? * no ---------------------------------------- 3. Future Scenario and AstroGrid-D Usage ---------------------------------------- 3.0 General goals - use more compute resources if available * use AIP's AstroData storage server * this storage available anywhere on grid * expect process of switching from one cluster to another much simplified - provide data to other users * might be interesting - completely new scenario 3.2 Environment - Are there any constraints due to your participation in other projects or international collaborations? * no 3.3 User Interaction - Which parts should be automated? * generation of job description files - Which user interface are you planning to use? * a job monitoring interface would be nice - Are you planning to use any standard for application monitoring/steering? * no (didn't know they exist) - Aspects of a Portal / WWW based interface: . Which portal features are mandatory/optional * N/A . How are user managed? Where is information about users defined / stored? * N/A . Which authentication/authorisation methods are needed ? * N/A . Do you want to access specific data services (web services, databases, etc.) via a portal? * no . Are there any existing programs, on which the user interface should be based OR which should be replaced by the portal? * no . Should there be a central AstroGrid portal OR do you want to set up a portal server for each scenario/application ? * N/A . Does the scenario require any special interfaces OR is it sufficient to use generic interfaces ? * N/A - Aspects of a generic Grid Application Programming API (GAT) . Which GAT functionality would you like to make use of (eg. job submission, file handling, resource brokering, etc.) ? * use Globus Toolkit directly (GAT not used) * job submission * data transfer . What programming languages must be supported ? Which platforms ? * c99, posix . Which Grid Middleware should be supported (Globus, Unicore, gLite, etc.) ? * Globus . For specific GAT functionality, which protocols/packages/tools should be supported ? eg. for job management: clusters with PBS, SGE, Condor * job management (any LRMS) * file handling 3.4 Input - Do you handle input data manually or do you need an automated management of data? * manually 3.5 Output - Do you handle output data manually or do you need an automated management of data? * manually 3.6 Additional Information - How long (avg) does the scenario execute (minutes, hours, days)? Do you aim at a specific speedup? * see above for time. no speedup expected - How often will the scenario be executed? * see above - Which restrictions of the current approach (as described in section 2) do you want to overcome? * complications transferring data files * complications dealing with various cluster LMRSs 4. Bigger Picture for the far future 4.1 Organization of Multiple Runs 4.2 Handling relationships between data products 4.3 Constructing More Complex Runs