http://www.nic.uoregon.edu/mediawiki-point/api.php?action=feedcontributions&user=Scottb&feedformat=atomPoint - User contributions [en]2024-03-29T11:27:19ZUser contributionsMediaWiki 1.31.6http://www.nic.uoregon.edu/mediawiki-point/index.php?title=Project_Info&diff=245Project Info2010-03-08T18:58:34Z<p>Scottb: </p>
<hr />
<div>The Productivity from Open, INtegrated Tools (POINT) project is funded as part of the NSF's [http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5174 Software Development for Cyberinfrastructure (SDCI)] program. The goal of the this project is to integrate, harden, and deploy an open, portable, robust performance tools environment for the NSF-funded high-performance computing centers. We are leveraging the widely-used [http://tau.uoregon.edu TAU], [http://icl.cs.utk.edu/papi/ PAPI], [http://www.scalasca.org/ Scalasca], and [http://perfsuite.ncsa.uiuc.edu/ PerfSuite] technologies as core components, improving them as necessary to meet user and application needs.<br />
* [[The POINT of Performance|Project News Release]]<br />
* [[Milestones|Project Milestones]] (members only)<br />
<br />
<br />
Four major institutions are collaborating in this project: [http://www.uoregon.edu University of Oregon], [http://www.utk.edu University of Tennessee at Knoxville] and [http://www.ncsa.uiuc.edu National Center for Supercomputing Applications] are developing and integrating the performance tools. The [http://psc.edu Pittsburgh Supercomputing Center] is leading the application engagement and outreach effort.<br />
<br />
* [[People|Principal Researchers]]<br />
<br />
==LCI '10==<br />
The POINT team will be giving a all day tutorial at this year's LCI conference. The slides for this presentation are available [http://nic.uoregon.edu/POINT_LCI_2010_v07.pdf here].<br />
<br />
== Contact ==<br />
We would like to hear from anyone interested in the POINT project. If you have any questions, comments, or requests, please [mailto:%70%6f%69%6e%74%40%6e%69%63%2e%75%6f%72%65%67%6f%6e%2e%65%64%75 send us an email].</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=Project_Info&diff=244Project Info2010-03-08T18:58:22Z<p>Scottb: </p>
<hr />
<div>The Productivity from Open, INtegrated Tools (POINT) project is funded as part of the NSF's [http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5174 Software Development for Cyberinfrastructure (SDCI)] program. The goal of the this project is to integrate, harden, and deploy an open, portable, robust performance tools environment for the NSF-funded high-performance computing centers. We are leveraging the widely-used [http://tau.uoregon.edu TAU], [http://icl.cs.utk.edu/papi/ PAPI], [http://www.scalasca.org/ Scalasca], and [http://perfsuite.ncsa.uiuc.edu/ PerfSuite] technologies as core components, improving them as necessary to meet user and application needs.<br />
* [[The POINT of Performance|Project News Release]]<br />
* [[Milestones|Project Milestones]] (members only)<br />
<br />
<br />
Four major institutions are collaborating in this project: [http://www.uoregon.edu University of Oregon], [http://www.utk.edu University of Tennessee at Knoxville] and [http://www.ncsa.uiuc.edu National Center for Supercomputing Applications] are developing and integrating the performance tools. The [http://psc.edu Pittsburgh Supercomputing Center] is leading the application engagement and outreach effort.<br />
<br />
* [[People|Principal Researchers]]<br />
<br />
==LCI '10==<br />
The POINT team will be giving a all day tutorial at this year's LCI conference. The slides for this presentation are available [http://nic.uoregon.edu/POINT_LCI_2010_v07.pdf here].<br />
<br />
== Contact ==<br />
We would like to hear from anyone interested in the POINT project. If you have any questions, comments, or requests, please [mailto:%70%6f%69%6e%74%40%6e%69%63%2e%75%6f%72%65%67%6f%6e%2e%65%64%75 send us an email].<br />
[[Media:Example.ogg]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=Project_Info&diff=243Project Info2010-03-08T18:50:55Z<p>Scottb: </p>
<hr />
<div>The Productivity from Open, INtegrated Tools (POINT) project is funded as part of the NSF's [http://www.nsf.gov/funding/pgm_summ.jsp?pims_id=5174 Software Development for Cyberinfrastructure (SDCI)] program. The goal of the this project is to integrate, harden, and deploy an open, portable, robust performance tools environment for the NSF-funded high-performance computing centers. We are leveraging the widely-used [http://tau.uoregon.edu TAU], [http://icl.cs.utk.edu/papi/ PAPI], [http://www.scalasca.org/ Scalasca], and [http://perfsuite.ncsa.uiuc.edu/ PerfSuite] technologies as core components, improving them as necessary to meet user and application needs.<br />
* [[The POINT of Performance|Project News Release]]<br />
* [[Milestones|Project Milestones]] (members only)<br />
<br />
<br />
Four major institutions are collaborating in this project: [http://www.uoregon.edu University of Oregon], [http://www.utk.edu University of Tennessee at Knoxville] and [http://www.ncsa.uiuc.edu National Center for Supercomputing Applications] are developing and integrating the performance tools. The [http://psc.edu Pittsburgh Supercomputing Center] is leading the application engagement and outreach effort.<br />
<br />
* [[People|Principal Researchers]]<br />
<br />
==LCI '10==<br />
The POINT team will be giving a all day tutorial at this year's LCI conference. The slides for the presentation are available here.<br />
<br />
== Contact ==<br />
We would like to hear from anyone interested in the POINT project. If you have any questions, comments, or requests, please [mailto:%70%6f%69%6e%74%40%6e%69%63%2e%75%6f%72%65%67%6f%6e%2e%65%64%75 send us an email].<br />
[[Media:Example.ogg]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=MediaWiki:Sidebar&diff=221MediaWiki:Sidebar2009-07-14T23:17:01Z<p>Scottb: </p>
<hr />
<div>* navigation<br />
** Project Info|Project Info<br />
** News|News<br />
** Performance Tools|Tools<br />
** Scientific Applications|Applications<br />
** TeraGrid Support|TeraGrid<br />
** Outreach|Outreach</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=NAMDPerformance&diff=189NAMDPerformance2009-06-10T00:16:48Z<p>Scottb: </p>
<hr />
<div>=NAMD Performance Study=<br />
<br />
NAMD is written in [http://charm.cs.uiuc.edu/ charm++] and thus has some unique attributes when profiled by TAU. For example the charm++ scheduler, which assigns task to processors and helps in load balancing the program, has a notion of Idling while waiting for tasks to complete. Thus TAU creates a event to capture time spent when the scheduler is in its idle state (Idle) as well as a event (Main) to account for the communication latencies. You can see how NAMD performs on different hardware with these charts:<br />
<br />
[[Image:intrepid-ranger-breakdown.png]]<br />
<br />
Where on Intrepid (BlueGene P) Idle time (red) increases as NAMD scales on Ranger (Sun x86 cluster) Main time increases (blue). This shows how Rangers relatively slower communications layer results in larger latencies as NAMD scales compared to how NAMD scales on Intrepid.<br />
<br />
<br />
The ability for NAMD to scale to a large number of processors is highly dependent on how it is configured. Many options are provided to tweak NAMD's performance structure to optimize performance for different simulation parameters and machines. So insteed of focusing on NAMD's scaling behavior we showed how TAU can identify other performance aspects of NAMD. This chart shows the the incressing variation across processors for varius NAMD events. Notice how after each load balancing phase the divergence among processors is temporally arrested.<br />
<br />
[[Image:namd-deviation-snapshot.png|800px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:Namd-deviation-snapshot.png&diff=188File:Namd-deviation-snapshot.png2009-06-10T00:15:57Z<p>Scottb: Shows the divergence among processors running NAMD on Ranger.</p>
<hr />
<div>Shows the divergence among processors running NAMD on Ranger.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:Intrepid-ranger-breakdown.png&diff=187File:Intrepid-ranger-breakdown.png2009-06-10T00:14:09Z<p>Scottb: Show the performance breakdown of NAMD on Intrepid and Ranger.</p>
<hr />
<div>Show the performance breakdown of NAMD on Intrepid and Ranger.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=NAMDPerformance&diff=186NAMDPerformance2009-06-10T00:12:14Z<p>Scottb: </p>
<hr />
<div>=NAMD Performance Study=<br />
<br />
NAMD is written in [http://charm.cs.uiuc.edu/ charm++] and thus has some unique attributes when profiled by TAU. For example the charm++ scheduler, which assigns task to processors and helps in load balancing the program, has a notion of Idling while waiting for tasks to complete. Thus TAU creates a event to capture time spent when the scheduler is in its idle state (Idle) as well as a event (Main) to account for the communication latencies. You can see how NAMD performs on different hardware with these charts:<br />
<br />
[Image:intrepid-ranger-breakdown.png]<br />
<br />
Where on Intrepid (BlueGene P) Idle time (red) increases as NAMD scales on Ranger (Sun x86 cluster) Main time increases (blue). This shows how Rangers relatively slower communications layer results in larger latencies as NAMD scales compared to how NAMD scales on Intrepid.<br />
<br />
<br />
The ability for NAMD to scale to a large number of processors is highly dependent on how it is configured. Many options are provided to tweak NAMD's performance structure to optimize performance for different simulation parameters and machines. So insteed of focusing on NAMD's scaling behavior we showed how TAU can identify other performance aspects of NAMD. This chart shows the the incressing variation across processors for varius NAMD events. Notice how after each load balancing phase the divergence among processors is temporally arrested.<br />
<br />
[Image:namd-deviation-snapshot.png|800px]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=NAMDPerformance&diff=185NAMDPerformance2009-06-09T23:47:30Z<p>Scottb: </p>
<hr />
<div>=NAMD Performance Study=<br />
<br />
NAMD is written in [http://charm.cs.uiuc.edu/ charm++] and thus has some unique attributes when profiled by TAU. For example the charm++ scheduler, which assigns task to processors and helps in load balancing the program, has a notion of Idling while waiting for tasks to complete. Thus TAU creates a event to capture time spent when the scheduler is in its idle state (Idle) as well as a event (Main) to account for the communication latencies. You can see how NAMD performs on different hardware with these charts:<br />
<br />
[Image:intrepid-ranger-breakdown.png]<br />
<br />
Where on Intrepid (BlueGene P) Idle time (red) increases as NAMD scales on Ranger (Sun x86 cluster) Main time increases (blue). This shows how Rangers relatively slower communications layer results in larger latencies as NAMD scales compared to how NAMD scales on Intrepid.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=NAMDPerformance&diff=184NAMDPerformance2009-06-09T23:43:31Z<p>Scottb: </p>
<hr />
<div>=NAMD Performance Study=<br />
<br />
NAMD is written in [http://charm.cs.uiuc.edu/ charm++] and thus has some unique attributes when profiled by TAU. For example the charm++ scheduler, which assigns task to processors and helps in load balancing the program, has a notion of Idling while waiting for tasks to complete. Thus TAU creates a event to capture time spent when the scheduler is in its idle state (Idle) as well as a event (Main) to account for the communication latencies. You can see how NAMD performs on different hardware with these charts:<br />
<br />
<br />
Where on Intrepid (BlueGene P) Idle time (red) increases as NAMD scales on Ranger (Sun x86 cluster) Main time increses (blue). This shows how Rangers relatively slower communications layer results in larger latencies as NAMD scales.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=NAMDPerformance&diff=183NAMDPerformance2009-06-09T23:19:32Z<p>Scottb: New page: =NAMD Performance Study= NAMD is written in [http://charm.cs.uiuc.edu/ charm++] and thus has some unique attributes when profiled by TAU. For example charm++ scheduler which assigns task ...</p>
<hr />
<div>=NAMD Performance Study=<br />
<br />
NAMD is written in [http://charm.cs.uiuc.edu/ charm++] and thus has some unique attributes when profiled by TAU. For example charm++ scheduler which assigns task to processors and helps in load balancing the program has a notion of Idling while waiting for tasks to complete. TAU thus creates a event to capture time spent when the scheduler is in its idle state (Idle) as well as a event (main) to account for the communication latencies.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=182ENZO2009-06-03T21:57:07Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effects of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. <br />
<br />
The configuration files used were like these:<br />
<br />
* [http://nic.uoregon.edu/~scottb/point.inits.large inits]<br />
* [http://nic.uoregon.edu/~scottb/point.param.large param]<br />
<br />
(The grid and particle sizes change between experiments).<br />
<br />
This chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=181ENZO2009-06-03T21:24:54Z<p>Scottb: /* Enzo Version 1.5 */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. <br />
<br />
The configuration files used were like these:<br />
<br />
* [http://nic.uoregon.edu/~scottb/point.inits.large inits]<br />
* [http://nic.uoregon.edu/~scottb/point.param.large param]<br />
<br />
(The grid and particle sizes change between experiments).<br />
<br />
This chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=180ENZO2009-06-03T21:24:03Z<p>Scottb: /* Enzo Version 1.5 */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. <br />
<br />
The configuration files used were like these:<br />
<br />
* [http://nic.uoregon.edu/~scottb/point.inits.large inits]<br />
* [http://nic.uoregon.edu/~scottb/point.param.large param]<br />
<br />
This chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=179ENZO2009-06-03T21:23:50Z<p>Scottb: /* Enzo Version 1.5 */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. <br />
<br />
The configuration files used were like these:<br />
<br />
* [http://nic.uoregon.edu/~scottb/point.inits.large inits]<br />
* [http://nic.uoregon.edu/~scottb/point.param.large param]<br />
<br />
This chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=178ENZO2009-06-03T21:16:54Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. <br />
<br />
The configuration files used were like these:<br />
<br />
<br />
<br />
This chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoMeanBreakdown.png&diff=177File:EnzoMeanBreakdown.png2009-06-03T20:42:10Z<p>Scottb: uploaded a new version of "Image:EnzoMeanBreakdown.png"</p>
<hr />
<div>Runtime breakdown of Enzo on Ranger.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=176ENZO2009-06-03T20:38:52Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November '08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. For example, see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication, increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latencies on Ranger's InfiniBand interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effects performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoScalingRanger.png&diff=175File:EnzoScalingRanger.png2009-06-03T20:06:39Z<p>Scottb: Scaling behavior of Enzo (svn version) on Ranger.</p>
<hr />
<div>Scaling behavior of Enzo (svn version) on Ranger.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=174ENZO2009-06-03T20:05:31Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. For example, see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|600px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|600px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=173ENZO2009-06-03T20:04:48Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. For example, see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger:<br />
<br />
[[Image:EnzoScalingRanger.png]]<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png|width=500px]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png|width=500px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=172ENZO2009-06-03T17:39:47Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This page shows the performance result from ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. For example, see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication increasing the number of processors allocated to more than 64 is unlikely to result in a much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely ENZO is experiencing a load imbalance causing some processors to wait for others to enter the MPI_Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced (blue) vs. load balanced simulation (red):<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation and become progressively more varied in length with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=171ENZO2009-06-03T01:14:47Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page we will show the performance result for ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. For example, see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication increasing the number of processors above 64 is unlikely to result in much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then ENZO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation:<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and brief at the beginning of the simulation becoming progressively more varied with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent.png]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=170ENZO2009-06-03T01:13:34Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page we will show the performance result for ENZO (svn repository version). We choose this version in part to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1.0.1. For example, see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
With this much time spent in MPI communication increasing the number of processors above 64 is unlikely to result in much lower total execution time. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then ENZO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation:<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of ENZO between two load balancing phases. The first thing to notice is that these phases are regular and brief at the beginning of the simulation becoming progressively more varied with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecvPercent]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoSnapMpiRecvPercent.png&diff=169File:EnzoSnapMpiRecvPercent.png2009-06-03T00:55:21Z<p>Scottb: Snapshot breakdown for MPI_Recv in Enzo.</p>
<hr />
<div>Snapshot breakdown for MPI_Recv in Enzo.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoSnapMpiBarrierPercent.png&diff=168File:EnzoSnapMpiBarrierPercent.png2009-06-03T00:54:45Z<p>Scottb: Snapshot breakdown for MPI_Barrier in Enzo.</p>
<hr />
<div>Snapshot breakdown for MPI_Barrier in Enzo.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoCallpathMpiRecv.png&diff=167File:EnzoCallpathMpiRecv.png2009-06-03T00:54:01Z<p>Scottb: Callpath for MPI_Recv in Enzo.</p>
<hr />
<div>Callpath for MPI_Recv in Enzo.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoCallpathMpiBarrier.png&diff=166File:EnzoCallpathMpiBarrier.png2009-06-03T00:49:43Z<p>Scottb: Callpath for MPI_Barrier in Enzo.</p>
<hr />
<div>Callpath for MPI_Barrier in Enzo.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoMeanComp.png&diff=165File:EnzoMeanComp.png2009-06-03T00:42:14Z<p>Scottb: Comparison between load balanced and non-load balanced runs of Enzo.</p>
<hr />
<div>Comparison between load balanced and non-load balanced runs of Enzo.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=164ENZO2009-06-03T00:41:06Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then EZNO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or send via MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpathMpiRecv.png]]<br />
<br />
[[Image:EnzoCallpathMpiBarrier.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of the ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation becoming progressively more varied with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecPercent]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoMeanBreakdown.png&diff=163File:EnzoMeanBreakdown.png2009-06-03T00:23:03Z<p>Scottb: uploaded a new version of "Image:EnzoMeanBreakdown.png"</p>
<hr />
<div>Runtime breakdown of Enzo on Ranger.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoMeanBreakdown.png&diff=162File:EnzoMeanBreakdown.png2009-06-03T00:21:30Z<p>Scottb: Runtime breakdown of Enzo on Ranger.</p>
<hr />
<div>Runtime breakdown of Enzo on Ranger.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=161ENZO2009-06-02T23:51:05Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then EZNO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or send via MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpath.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of the ENZO between two load balancing phases. The first thing to notice is that these phases are regular and short at the beginning of the simulation becoming progressively more varied with some becoming much longer. <br />
<br />
(The time spent before that first load balancing has been removed--mostly initialization)<br />
<br />
For MPI_Recv:<br />
[[Image:EnzoSnapMpiRecPercent]]<br />
<br />
For MPI_Barrier:<br />
[[Image:EnzoSnapMpiBarrierPercent.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=160ENZO2009-06-02T23:40:20Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then EZNO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or send via MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpath.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().<br />
<br />
==Snapshot profiles==<br />
<br />
Additionally, we used snapshot profiling to get a sense of how the ENZO's performance changed over the course of the entire execution. A snapshot was taken at each load balancing step such that each bar represents a single phase of the EZNO between two load balancing phases.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=159ENZO2009-06-02T23:03:19Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then EZNO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or send via MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
[[Image:EnzoCallpath.png]]<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=158ENZO2009-06-02T23:01:54Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then EZNO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or send via MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.<br />
<br />
MPI Barriers take place in EvolveLevel(). And MPI_Recv takes place in grid::CommunicationSendRegions().</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=157ENZO2009-06-02T22:49:31Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect. Mostly likely then EZNO is experiencing a load imbalance causing some processors to wait for others to enter the Barrier or send via MPI_Send.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by the increase in time spent in MPI_Recv.<br />
<br />
Callpath profiling gives us an idea where most of the costly MPI communications are taking place.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=156ENZO2009-06-02T22:00:38Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
[[Image:EnzoMeanComp.png]]<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by and increase in time spent in MPI_Recv.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=155ENZO2009-06-02T22:00:10Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect.<br />
<br />
Next we looked at how enabling load balancing effect performance. This a runtime comparison between non-load balanced vs. load balanced simulation.<br />
<br />
<br />
Time spent MPI_Barrier decrease but is mostly offset by and increase in time spent in MPI_Recv.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=154ENZO2009-06-02T21:40:10Z<p>Scottb: </p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 5.2ms is spent per call in MPI_Recv and 40.4ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=153ENZO2009-06-02T21:37:48Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]<br />
<br />
Scaling behavior was very similar on Ranger.<br />
<br />
This poor scaling behavior could be anticipated by looking at the runtime breakdown (mean of 64 processors on Ranger):<br />
<br />
[[Image:EnzoMeanBreakdown.png]]<br />
<br />
with this much time spent in MPI communication increasing the number of processors is unlikely to result in much faster simulations. Looking more closely at MPI_Recv and MPI_Barrier we see that on average 6.9ms is spent per call in MPI_Recv and 38.5ms for MPI_Barrier. This is much longer than can be explained by communication latency on Ranger's infiniband interconnect.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=EnzoV1Performance&diff=152EnzoV1Performance2009-06-02T18:25:13Z<p>Scottb: New page: =ENZO version 1 performance results= This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files: * [http://giusto.ni...</p>
<hr />
<div>=ENZO version 1 performance results=<br />
This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files:<br />
<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly.inits inits]<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly_amr.param param]<br />
<br />
This is a relatively small experiment but was sufficient to generate some interesting performance results. For this study we used the [http://tau.uoregon.edu TAU Performance System®] to gather information about ENZO's performance, in particular we are interested in the performance of the AMR simulation at scale. We ran these experiments on NCSA's Intel 64 Linux Cluster (Abe).<br />
<br />
==TAU Measurement overhead==<br />
Here is a short table listing the run-times for various experiments and the instrumentation overhead observed. Each run was on 64 processors (8 nodes).<br />
<br />
{|<br />
|-<br />
! Run Type <br />
! Runtime (seconds) <br />
! Overhead %<br />
|-<br />
|Uninstrumented runtime <br />
|1072 <br />
|NA<br />
|-<br />
|Trace of only MPI event <br />
|1085 <br />
|4.8%<br />
|-<br />
|Profile of all significant events <br />
|1136 <br />
|6.0%<br />
|-<br />
|Profile with Call-path information <br />
|1196 <br />
|11.6%<br />
|-<br />
|Profile of each Phase of execution <br />
|1208 <br />
|12.7%<br />
|}<br />
<br />
==Runtime Breakdown on 64 processors==<br />
<br />
Here is a chart showing the contribution each function makes to the overall runtime. Notice that MPI communication time takes over 60% of the total runtime. <br />
<br />
[[Image:MeanFunctionLinux.png]]<br />
<br />
==Experiment Scalability==<br />
<br />
These chart show the relative efficiency for a grid size of 128^3 and 256^3. Relative efficiency is the measure of how far an run of the application is slower compared to ideal efficiency. In this case, ideal efficiency would mean a doubling of the processor count would reduces the runtime in half.<br />
<br />
[[image:Scaling128.png]] [[image:Scaling256.png]]<br />
<br />
<br />
This is chart show the breakdown in the runtime of different functions across different numbers of processors (128^3 grid size). MPI communication time, like in the 64 processor case, continues to dominate the runtime--and to an even greater extent when a larger number of processors are involved. <br />
<br />
[[image:MeanRuntineAtScale2.png]]<br />
<br />
==Experiment Trace==<br />
This graphic shows how load imbalances causes long wait times for MPI_Allreduce. Some processors are experiencing as much as 8 seconds of wait time per reduce.<br />
<br />
[[Image:trace.png|1000px]]<br />
<br />
==Experiment Call-Paths==<br />
We observe the follow relationships in the experiment call-path:<br />
<br />
* Almost all the time spend in MPI_Bcast is when it is called from MPI_Allreduce.<br />
* Almost all the time spend in MPI_Recv is when it is called from grid::CommunicationSendRegion.<br />
* Most all the time spend in MPI_Allgather is when it is called from CommunicationShareGrids.<br />
* Almost all the time spend in MPI_Allreduce is when it is called from CommunicationMinValue.<br />
<br />
This chart show the details:<br />
<br />
[[Image:CallpathRuntime3.png]]<br />
<br />
==Experiment Phases==<br />
We also looked at ENZO's runtime through each iteration on the main loop in EvolveHierarchy. Here is the CommunicationShareGrids function representing the computation work done during the consecutive loops (time is microseconds)<br />
<br />
[[Image:CommunicationShareGrids2.png]]<br />
<br />
Notice that some iterations are involved in writing out the grid (lots of time spent in WriteDataHierarchy).<br />
<br />
Here is a breakdown, by function, of the time spent over the course of the experiment. Y-axis is exclusive time spend in each function, and X-axis is overall elapsed runtime:<br />
<br />
[[Image:snapshot.png|1000px]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=151ENZO2009-06-02T18:24:54Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=150ENZO2009-06-02T18:24:41Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at [[EnzoV1Performance | here]].<br />
<br />
<br />
This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files:<br />
<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly.inits inits]<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly_amr.param param]<br />
<br />
This is a relatively small experiment but was sufficient to generate some interesting performance results. For this study we used the [http://tau.uoregon.edu TAU Performance System®] to gather information about ENZO's performance, in particular we are interested in the performance of the AMR simulation at scale. We ran these experiments on NCSA's Intel 64 Linux Cluster (Abe).<br />
<br />
==TAU Measurement overhead==<br />
Here is a short table listing the run-times for various experiments and the instrumentation overhead observed. Each run was on 64 processors (8 nodes).<br />
<br />
{|<br />
|-<br />
! Run Type <br />
! Runtime (seconds) <br />
! Overhead %<br />
|-<br />
|Uninstrumented runtime <br />
|1072 <br />
|NA<br />
|-<br />
|Trace of only MPI event <br />
|1085 <br />
|4.8%<br />
|-<br />
|Profile of all significant events <br />
|1136 <br />
|6.0%<br />
|-<br />
|Profile with Call-path information <br />
|1196 <br />
|11.6%<br />
|-<br />
|Profile of each Phase of execution <br />
|1208 <br />
|12.7%<br />
|}<br />
<br />
==Runtime Breakdown on 64 processors==<br />
<br />
Here is a chart showing the contribution each function makes to the overall runtime. Notice that MPI communication time takes over 60% of the total runtime. <br />
<br />
[[Image:MeanFunctionLinux.png]]<br />
<br />
==Experiment Scalability==<br />
<br />
These chart show the relative efficiency for a grid size of 128^3 and 256^3. Relative efficiency is the measure of how far an run of the application is slower compared to ideal efficiency. In this case, ideal efficiency would mean a doubling of the processor count would reduces the runtime in half.<br />
<br />
[[image:Scaling128.png]] [[image:Scaling256.png]]<br />
<br />
<br />
This is chart show the breakdown in the runtime of different functions across different numbers of processors (128^3 grid size). MPI communication time, like in the 64 processor case, continues to dominate the runtime--and to an even greater extent when a larger number of processors are involved. <br />
<br />
[[image:MeanRuntineAtScale2.png]]<br />
<br />
==Experiment Trace==<br />
This graphic shows how load imbalances causes long wait times for MPI_Allreduce. Some processors are experiencing as much as 8 seconds of wait time per reduce.<br />
<br />
[[Image:trace.png|1000px]]<br />
<br />
==Experiment Call-Paths==<br />
We observe the follow relationships in the experiment call-path:<br />
<br />
* Almost all the time spend in MPI_Bcast is when it is called from MPI_Allreduce.<br />
* Almost all the time spend in MPI_Recv is when it is called from grid::CommunicationSendRegion.<br />
* Most all the time spend in MPI_Allgather is when it is called from CommunicationShareGrids.<br />
* Almost all the time spend in MPI_Allreduce is when it is called from CommunicationMinValue.<br />
<br />
This chart show the details:<br />
<br />
[[Image:CallpathRuntime3.png]]<br />
<br />
==Experiment Phases==<br />
We also looked at ENZO's runtime through each iteration on the main loop in EvolveHierarchy. Here is the CommunicationShareGrids function representing the computation work done during the consecutive loops (time is microseconds)<br />
<br />
[[Image:CommunicationShareGrids2.png]]<br />
<br />
Notice that some iterations are involved in writing out the grid (lots of time spent in WriteDataHierarchy).<br />
<br />
Here is a breakdown, by function, of the time spent over the course of the experiment. Y-axis is exclusive time spend in each function, and X-axis is overall elapsed runtime:<br />
<br />
[[Image:snapshot.png|1000px]]<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=149ENZO2009-06-02T18:23:48Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
=On this page will show the performance result for ENZO from the svn repository. We did this in par to see the effect of load balancing (not enabled in version 1.5) on scaling performance. The previous performance results for ENZO version 1 are at EnzoV1Performance.<br />
<br />
<br />
This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files:<br />
<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly.inits inits]<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly_amr.param param]<br />
<br />
This is a relatively small experiment but was sufficient to generate some interesting performance results. For this study we used the [http://tau.uoregon.edu TAU Performance System®] to gather information about ENZO's performance, in particular we are interested in the performance of the AMR simulation at scale. We ran these experiments on NCSA's Intel 64 Linux Cluster (Abe).<br />
<br />
==TAU Measurement overhead==<br />
Here is a short table listing the run-times for various experiments and the instrumentation overhead observed. Each run was on 64 processors (8 nodes).<br />
<br />
{|<br />
|-<br />
! Run Type <br />
! Runtime (seconds) <br />
! Overhead %<br />
|-<br />
|Uninstrumented runtime <br />
|1072 <br />
|NA<br />
|-<br />
|Trace of only MPI event <br />
|1085 <br />
|4.8%<br />
|-<br />
|Profile of all significant events <br />
|1136 <br />
|6.0%<br />
|-<br />
|Profile with Call-path information <br />
|1196 <br />
|11.6%<br />
|-<br />
|Profile of each Phase of execution <br />
|1208 <br />
|12.7%<br />
|}<br />
<br />
==Runtime Breakdown on 64 processors==<br />
<br />
Here is a chart showing the contribution each function makes to the overall runtime. Notice that MPI communication time takes over 60% of the total runtime. <br />
<br />
[[Image:MeanFunctionLinux.png]]<br />
<br />
==Experiment Scalability==<br />
<br />
These chart show the relative efficiency for a grid size of 128^3 and 256^3. Relative efficiency is the measure of how far an run of the application is slower compared to ideal efficiency. In this case, ideal efficiency would mean a doubling of the processor count would reduces the runtime in half.<br />
<br />
[[image:Scaling128.png]] [[image:Scaling256.png]]<br />
<br />
<br />
This is chart show the breakdown in the runtime of different functions across different numbers of processors (128^3 grid size). MPI communication time, like in the 64 processor case, continues to dominate the runtime--and to an even greater extent when a larger number of processors are involved. <br />
<br />
[[image:MeanRuntineAtScale2.png]]<br />
<br />
==Experiment Trace==<br />
This graphic shows how load imbalances causes long wait times for MPI_Allreduce. Some processors are experiencing as much as 8 seconds of wait time per reduce.<br />
<br />
[[Image:trace.png|1000px]]<br />
<br />
==Experiment Call-Paths==<br />
We observe the follow relationships in the experiment call-path:<br />
<br />
* Almost all the time spend in MPI_Bcast is when it is called from MPI_Allreduce.<br />
* Almost all the time spend in MPI_Recv is when it is called from grid::CommunicationSendRegion.<br />
* Most all the time spend in MPI_Allgather is when it is called from CommunicationShareGrids.<br />
* Almost all the time spend in MPI_Allreduce is when it is called from CommunicationMinValue.<br />
<br />
This chart show the details:<br />
<br />
[[Image:CallpathRuntime3.png]]<br />
<br />
==Experiment Phases==<br />
We also looked at ENZO's runtime through each iteration on the main loop in EvolveHierarchy. Here is the CommunicationShareGrids function representing the computation work done during the consecutive loops (time is microseconds)<br />
<br />
[[Image:CommunicationShareGrids2.png]]<br />
<br />
Notice that some iterations are involved in writing out the grid (lots of time spent in WriteDataHierarchy).<br />
<br />
Here is a breakdown, by function, of the time spent over the course of the experiment. Y-axis is exclusive time spend in each function, and X-axis is overall elapsed runtime:<br />
<br />
[[Image:snapshot.png|1000px]]<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=File:EnzoScalingKraken.png&diff=148File:EnzoScalingKraken.png2009-05-13T00:49:54Z<p>Scottb: ENZO 1.5 scaling on Kraken. Result for two different problem sizes.</p>
<hr />
<div>ENZO 1.5 scaling on Kraken. Result for two different problem sizes.</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=147ENZO2009-05-13T00:48:36Z<p>Scottb: /* Enzo Version 1.5 */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files:<br />
<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly.inits inits]<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly_amr.param param]<br />
<br />
This is a relatively small experiment but was sufficient to generate some interesting performance results. For this study we used the [http://tau.uoregon.edu TAU Performance System®] to gather information about ENZO's performance, in particular we are interested in the performance of the AMR simulation at scale. We ran these experiments on NCSA's Intel 64 Linux Cluster (Abe).<br />
<br />
==TAU Measurement overhead==<br />
Here is a short table listing the run-times for various experiments and the instrumentation overhead observed. Each run was on 64 processors (8 nodes).<br />
<br />
{|<br />
|-<br />
! Run Type <br />
! Runtime (seconds) <br />
! Overhead %<br />
|-<br />
|Uninstrumented runtime <br />
|1072 <br />
|NA<br />
|-<br />
|Trace of only MPI event <br />
|1085 <br />
|4.8%<br />
|-<br />
|Profile of all significant events <br />
|1136 <br />
|6.0%<br />
|-<br />
|Profile with Call-path information <br />
|1196 <br />
|11.6%<br />
|-<br />
|Profile of each Phase of execution <br />
|1208 <br />
|12.7%<br />
|}<br />
<br />
==Runtime Breakdown on 64 processors==<br />
<br />
Here is a chart showing the contribution each function makes to the overall runtime. Notice that MPI communication time takes over 60% of the total runtime. <br />
<br />
[[Image:MeanFunctionLinux.png]]<br />
<br />
==Experiment Scalability==<br />
<br />
These chart show the relative efficiency for a grid size of 128^3 and 256^3. Relative efficiency is the measure of how far an run of the application is slower compared to ideal efficiency. In this case, ideal efficiency would mean a doubling of the processor count would reduces the runtime in half.<br />
<br />
[[image:Scaling128.png]] [[image:Scaling256.png]]<br />
<br />
<br />
This is chart show the breakdown in the runtime of different functions across different numbers of processors (128^3 grid size). MPI communication time, like in the 64 processor case, continues to dominate the runtime--and to an even greater extent when a larger number of processors are involved. <br />
<br />
[[image:MeanRuntineAtScale2.png]]<br />
<br />
==Experiment Trace==<br />
This graphic shows how load imbalances causes long wait times for MPI_Allreduce. Some processors are experiencing as much as 8 seconds of wait time per reduce.<br />
<br />
[[Image:trace.png|1000px]]<br />
<br />
==Experiment Call-Paths==<br />
We observe the follow relationships in the experiment call-path:<br />
<br />
* Almost all the time spend in MPI_Bcast is when it is called from MPI_Allreduce.<br />
* Almost all the time spend in MPI_Recv is when it is called from grid::CommunicationSendRegion.<br />
* Most all the time spend in MPI_Allgather is when it is called from CommunicationShareGrids.<br />
* Almost all the time spend in MPI_Allreduce is when it is called from CommunicationMinValue.<br />
<br />
This chart show the details:<br />
<br />
[[Image:CallpathRuntime3.png]]<br />
<br />
==Experiment Phases==<br />
We also looked at ENZO's runtime through each iteration on the main loop in EvolveHierarchy. Here is the CommunicationShareGrids function representing the computation work done during the consecutive loops (time is microseconds)<br />
<br />
[[Image:CommunicationShareGrids2.png]]<br />
<br />
Notice that some iterations are involved in writing out the grid (lots of time spent in WriteDataHierarchy).<br />
<br />
Here is a breakdown, by function, of the time spent over the course of the experiment. Y-axis is exclusive time spend in each function, and X-axis is overall elapsed runtime:<br />
<br />
[[Image:snapshot.png|1000px]]<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling behavior of Enzo 1.5 on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=ENZO&diff=146ENZO2009-05-13T00:47:55Z<p>Scottb: /* ENZO Performance Study Summary */</p>
<hr />
<div>=ENZO Performance Study Summary=<br />
<br />
This is a short overview of the performance result from the ENZO application. For each experiment we used these inits/param files:<br />
<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly.inits inits]<br />
* [http://giusto.nic.uoregon.edu/~scottb/SingleGrid_dmonly_amr.param param]<br />
<br />
This is a relatively small experiment but was sufficient to generate some interesting performance results. For this study we used the [http://tau.uoregon.edu TAU Performance System®] to gather information about ENZO's performance, in particular we are interested in the performance of the AMR simulation at scale. We ran these experiments on NCSA's Intel 64 Linux Cluster (Abe).<br />
<br />
==TAU Measurement overhead==<br />
Here is a short table listing the run-times for various experiments and the instrumentation overhead observed. Each run was on 64 processors (8 nodes).<br />
<br />
{|<br />
|-<br />
! Run Type <br />
! Runtime (seconds) <br />
! Overhead %<br />
|-<br />
|Uninstrumented runtime <br />
|1072 <br />
|NA<br />
|-<br />
|Trace of only MPI event <br />
|1085 <br />
|4.8%<br />
|-<br />
|Profile of all significant events <br />
|1136 <br />
|6.0%<br />
|-<br />
|Profile with Call-path information <br />
|1196 <br />
|11.6%<br />
|-<br />
|Profile of each Phase of execution <br />
|1208 <br />
|12.7%<br />
|}<br />
<br />
==Runtime Breakdown on 64 processors==<br />
<br />
Here is a chart showing the contribution each function makes to the overall runtime. Notice that MPI communication time takes over 60% of the total runtime. <br />
<br />
[[Image:MeanFunctionLinux.png]]<br />
<br />
==Experiment Scalability==<br />
<br />
These chart show the relative efficiency for a grid size of 128^3 and 256^3. Relative efficiency is the measure of how far an run of the application is slower compared to ideal efficiency. In this case, ideal efficiency would mean a doubling of the processor count would reduces the runtime in half.<br />
<br />
[[image:Scaling128.png]] [[image:Scaling256.png]]<br />
<br />
<br />
This is chart show the breakdown in the runtime of different functions across different numbers of processors (128^3 grid size). MPI communication time, like in the 64 processor case, continues to dominate the runtime--and to an even greater extent when a larger number of processors are involved. <br />
<br />
[[image:MeanRuntineAtScale2.png]]<br />
<br />
==Experiment Trace==<br />
This graphic shows how load imbalances causes long wait times for MPI_Allreduce. Some processors are experiencing as much as 8 seconds of wait time per reduce.<br />
<br />
[[Image:trace.png|1000px]]<br />
<br />
==Experiment Call-Paths==<br />
We observe the follow relationships in the experiment call-path:<br />
<br />
* Almost all the time spend in MPI_Bcast is when it is called from MPI_Allreduce.<br />
* Almost all the time spend in MPI_Recv is when it is called from grid::CommunicationSendRegion.<br />
* Most all the time spend in MPI_Allgather is when it is called from CommunicationShareGrids.<br />
* Almost all the time spend in MPI_Allreduce is when it is called from CommunicationMinValue.<br />
<br />
This chart show the details:<br />
<br />
[[Image:CallpathRuntime3.png]]<br />
<br />
==Experiment Phases==<br />
We also looked at ENZO's runtime through each iteration on the main loop in EvolveHierarchy. Here is the CommunicationShareGrids function representing the computation work done during the consecutive loops (time is microseconds)<br />
<br />
[[Image:CommunicationShareGrids2.png]]<br />
<br />
Notice that some iterations are involved in writing out the grid (lots of time spent in WriteDataHierarchy).<br />
<br />
Here is a breakdown, by function, of the time spent over the course of the experiment. Y-axis is exclusive time spend in each function, and X-axis is overall elapsed runtime:<br />
<br />
[[Image:snapshot.png|1000px]]<br />
<br />
<br />
==Enzo Version 1.5==<br />
<br />
Following the release of Enzo 1.5 in November 08 we have done some follow up performance studies. Our initial findings are similar to what we found for version 1. For example see this chart showing the scaling on Kraken:<br />
<br />
[[Image:EnzoScalingKraken.png]]</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=Sc09samples/&diff=145Sc09samples/2009-04-21T01:55:21Z<p>Scottb: </p>
<hr />
<div>=Demonstration Performance Data=<br />
<br />
This page has a collection of performance data from the tools in the POINT project.<br />
<br />
There is a large data set from a performance study done on the S3D application [http://tau.uoregon.edu/s3d here]. In addition we have included some more downloads with snapshot on this page as well.<br />
<br />
==Extra Performance Data==<br />
<br />
First some data from TAU. The links below will allow you to start TAU's ParaProf profile viewer from the web using Java Web Start.<br />
<br />
<br />
TAU flat profile of NAS parallel benchmarks [http://tau.nic.uoregon.edu/trial/launch_paraprof?datasource=1020&workspace=48 full 8 processor profile].<br />
<br />
[[Image:paraprof.png|400px]]<br />
<br />
TAU profile of S3D data [http://tau.nic.uoregon.edu/trial/launch_paraprof?datasource=2075&workspace=14 full 1728 processor profile].<br />
<br />
[[Image:S3D_3d_view.png|400px]]<br />
<br />
<br />
Vampair trace<br />
<br />
[[Image:trace.png|400px]]<br />
<br />
<br />
Scalasca</div>Scottbhttp://www.nic.uoregon.edu/mediawiki-point/index.php?title=Sc09samples/&diff=144Sc09samples/2009-04-21T01:51:56Z<p>Scottb: </p>
<hr />
<div>=Demonstration Performance Data=<br />
<br />
Here a collection of performance data from the the tools in the POINT project.<br />
<br />
There is a large data set from a performance study done on the S3D application [http://tau.uoregon.edu/s3d here]. In addition we have included some more downloads with snapshot on this page as well.<br />
<br />
==Extra Performance Data==<br />
<br />
First some data from TAU. The links below will allow you to start TAU's ParaProf profile viewer from the web using java web start.<br />
<br />
<br />
TAU flat profile of NAS parallel benchmarks [http://tau.nic.uoregon.edu/trial/launch_paraprof?datasource=1020&workspace=48 full 8 processor profile].<br />
<br />
[[Image:paraprof.png|400px]]<br />
<br />
TAU profile of S3D data [http://tau.nic.uoregon.edu/trial/launch_paraprof?datasource=2075&workspace=14 full 1728 processor run].<br />
<br />
[[Image:S3D_3d_view.png|400px]]<br />
<br />
<br />
Vampair trace<br />
<br />
[[Image:trace.png|400px]]<br />
<br />
<br />
Scalasca</div>Scottb