| |
FY 2003 User Survey Results: Hardware Resources
Legend:
| Satisfaction |
Average Score |
| Very Satisfied |
6.5 - 7 |
| Mostly Satisfied |
5.5 - 6.4 |
| Somewhat Satisfied |
4.5 - 5.4 |
|
| Significance of Change |
| significant increase |
| significant decrease |
| not significant |
|
Satisfaction - Compute Platforms
Sorted by average score
| Question |
No. of Responses |
Average |
Std. Dev. |
Change from 2002 |
Change from 2001 |
| SP Overall | 192 | 6.43 | 0.78 | 0.05 | 0.61 |
| SP Uptime | 191 | 6.42 | 0.83 | -0.14 | 0.89 |
| PDSF Overall | 68 | 6.41 | 0.87 | 0.15 | NA |
| PDSF Uptime | 62 | 6.35 | 1.04 | -0.16 | NA |
| SP Disk Configuration and I/O Performance | 156 | 6.15 | 1.03 | 0.18 | 0.48 |
| PDSF Queue Structure | 59 | 6.00 | 0.96 | 0.03 | NA |
| PDSF Batch Wait Time | 61 | 5.93 | 1.12 | 0.19 | NA |
| PDSF Ability to Run Interactively | 64 | 5.77 | 1.39 | -0.41 | NA |
| PDSF Disk Configuration and I/O Performance | 59 | 5.69 | 1.15 | 0.06 | NA |
| SP Queue Structure | 177 | 5.69 | 1.22 | -0.23 | 0.50 |
| SP Ability to Run Interactively | 162 | 5.57 | 1.49 | 0.10 | 0.86 |
| SP Batch Wait Time | 190 | 5.24 | 1.52 | -0.17 | 0.32 |
Satisfaction - Compute Platforms
Sorted by Platform
| Question |
No. of Responses |
Average |
Std. Dev. |
Change from 2002 |
Change from 2001 |
| SP Overall | 192 | 6.43 | 0.78 | 0.05 | 0.61 |
| SP Uptime | 191 | 6.42 | 0.83 | -0.14 | 0.89 |
| SP Disk Configuration and I/O Performance | 156 | 6.15 | 1.03 | 0.18 | 0.48 |
| SP Queue Structure | 177 | 5.69 | 1.22 | -0.23 | 0.50 |
| SP Ability to Run Interactively | 162 | 5.57 | 1.49 | 0.10 | 0.86 |
| SP Batch Wait Time | 190 | 5.24 | 1.52 | -0.17 | 0.32 |
| PDSF Overall | 68 | 6.41 | 0.87 | 0.15 | NA |
| PDSF Uptime | 62 | 6.35 | 1.04 | -0.16 | NA |
| PDSF Queue Structure | 59 | 6.00 | 0.96 | 0.03 | NA |
| PDSF Batch Wait Time | 61 | 5.93 | 1.12 | 0.19 | NA |
| PDSF Ability to Run Interactively | 64 | 5.77 | 1.39 | -0.41 | NA |
| PDSF Disk Configuration and I/O Performance | 59 | 5.69 | 1.15 | 0.06 | NA |
Max Processors Used and Max Code Can Effectively Use
| Question |
No. of Responses |
Average |
Std. Dev. |
Change from 2002 |
Change from 2001 |
| SP Processors Can Use | 139 | 609.41 | 1006.23 | 63.41 | -141.59 |
| Max SP Processors Used | 161 | 444.84 | 733.31 | 273.84 | 242.84 |
| Max PDSF Processors Used | 35 | 13.06 | 43.25 | -21.94 | NA |
| PDSF Processors Can Use | 34 | 10.26 | 21.37 | -86.74 | NA |
Satisfaction - HPSS
| Question |
No. of Responses |
Average |
Std. Dev. |
Change from 2002 |
Change from 2001 |
| Reliability | 126 | 6.61 | 0.77 | 0.10 | -0.02 |
| Uptime | 126 | 6.54 | 0.79 | 0.17 | 0.21 |
| Performance | 126 | 6.46 | 0.88 | 0.11 | 0.10 |
| HPSS Overall | 134 | 6.46 | 0.84 | 0.07 | -0.04 |
| User Interface | 127 | 5.98 | 1.24 | 0.03 | -0.04 |
Satisfaction - Servers:
| Question |
No. of Responses |
Average |
Std. Dev. |
Change from 2002 |
Change from 2001 |
| Escher | 13 | 5.23 | 1.30 | -0.15 | 0.15 |
| Newton | 15 | 5.20 | 1.37 | -0.24 | -0.27 |
Satisfaction - Networking
| Question; |
No. of Responses |
Average |
Std. Dev. |
Change from 2002 |
Change from 2001 |
| LAN | 114 | 6.54 | 0.67 | NA | NA |
| WAN | 100 | 6.12 | 1.02 | NA | NA |
Summary of Hardware Comments
Comments on NERSC's IBM SP
[Read all 51 responses]
Comments on NERSC's PDSF Cluster
[Read all 17 responses]
Comments on NERSC's HPSS Storage System
[Read all 29 responses]
Comments about NERSC's math and vis servers
[Read all 5 responses]
Comments on NERSC's IBM SP: 51 responses
- Good machine:
-
Great! Now where is the power 4 version? But really, this is a great machine.
... I appreciate the fact the machine is rarely down.
It is an amazing machine. It has enabled us to do work that would be entirely
out of reach otherwise.
This is a very useful machine for us.
... The system is so good and so well managed in just about every other
way [except that there are not enough interactive resources].
On a
positive note, I am very happy that NERSC opted to expand the POWER3 system
rather than moving to the POWER4 or another vendor. Seaborg is very stable and,
the above comment notwithstanding, very well managed. It is also large enough
to do some serious work at the forefront of parallel computing. This strategy
is right in line with the aims of my research group and, I believe, in line
with a path that will lead to advancements in supercomputing technology. ...
Great Machine!
SP is very user friendly.
It is doing its job as expected
Nice system when everything works.
NERSC's facilities are run in impressive fashion. ...
My code uses MPI-2 one sided primitives and I have been pleasantly surprised by
the dual plane colony switch performance. Previously we have worked extensively
on Compaq SCs with quadrics, and while the IBM performance is not quite equal,
there are far fewer penalties for running n MPI processes on n processors of a
node (on the SC is often better to run only 3 processes per 4 processor node).
Although the latency of the IBM is quite high (measured performance is at least
x3 of quadrics) we can accept this penalty and still achieve good performance.
...
Excellent! ...
The IBM SP is very powerful.
It is a very fast machine. ...
Great!
Excellent system with the most advanced performance!!!!!!!!!!
- Queue issues:
-
- long turnaround for smaller node jobs:
Batch can be very slow and seems to have gotten worse in the past year despite
the increase in CPU # -- I often use the "debug" queue to get results when the
job is not too big. Things seem to have worsened for users requesting 1 or 2
nodes --- perhaps because of the increasing emphasis on heavy-duty users.
Wait time for short runs with many processors sometimes takes too long.
The queue structure is highly tilted towards large jobs. While it is not as bad
as at some other supercomputer centers, running jobs that only use 64-128
processors with reasonable wait times still requires extensive use of the
premium queue. An 8 hour job on the regular queue using 64 processors generally
requires several days of waiting time. Such low compute efficiencies are not
only frustrating, they make it very difficult to use a reasonably sized
allocation in a year.
Batch turnaround time is long, but this is unavoidable because of the number of
users. The premium queue is necessary for running benchmark/diagnostic
short-time jobs.
.... Also, the current queue structure favors many-node jobs which is a
disadvantage for me. I am integrating equations of motion in time, I don't
need many nodes but would like to see queue wait time go down. Currently my
jobs may spend more time waiting in the queue than actually running.
The batch queues take much longer than last year (probably because there are
more jobs submitted). The queue can also stall when a large long job (>128
nodes, >8 hours) is at the head of the queue while the machine waits for enough
processors to free up. Is it possible to allow jobs to allow small jobs to
start then checkpoint so there are less wasted cycles? It would also be good if
the pre_1 queue had higher priority, which could be offset by a higher cost for
using it.
The job stays in queue too long. Sometime, I need to wait 2-3 days to run a
one-hour job.
sometimes there are so many big nodes jobs running it sort of locks out
everyone else.
- Long turnaround / increase wall limit for "reg_1l" class:
With the old queue structure, we seldom had to wait more than a day for a job
to start. With the new queue structure the wait has increased to of order a
week. If this continues into FY2004, we will not be able to get our work done.
...
waiting time for small jobs is quite frequently inadequately long; there should
be also a queue for small, but non-restartable jobs with longer than 24 hours
limit.
- very long waits at low priority:
My students complains are usually due to the waiting time in queues, especially
for medium-size jobs, if using priority 0.5,
which is a must if we want to optimize the use of our allocation.
- long waits in September:
I am very satisfied with the average batch job wait time, but I noticed that
recently (as of 09/12/03) the waiting time is too long.
- long wait time for large memory jobs:
... and waiting for a 64-node
job on the 32 GB nodes is often not practical.
- wants pre-emption capability:
... Alternatively [to providing a Linux cluster for smaller users] is there any
kind of scheduling algorithm that would pre-empt long-running 4-processor jobs
when the bigger users need 1000+ processors? ...
- has to divide up jobs:
I have found that to get things to run, you have to break the job into smaller
pieces. ...
- Scaling comments:
-
- limits to scaling:
Please inform me if the 4k MPI limit is not existent anymore
... I have noticed a large increase in memory usage as the number of MPI
processes is increased. This is quite a concern since very often we run in a
memory limited capacity. I would like to be able to provided further comment
about the performance of the machine when using 4096 P but as yet I have only
been able to partial run my code on this many processors. I'd like to see a
focus on increasing network bandwidth and reducing latency in the next
generation machine. I am not convinced that clusters of fat SMPs is an
effective model unless the connecting pipes are made much larger.
I'd VERY MUCH like to see more large-memory nodes put into service, or memory
upgrades for the existing nodes. Using 16 MPI tasks per node, the 1
GB/processor limit is a definite constraint for me.
... The IBM SP is capable of only coarse-grained parallelism. This limits the
number of processors that it can use efficiently on our jobs. Its capabilities
are now exceeded by large PC (Linux) clusters, which probably have a better
price/performance ratio than the SP. Of course such comparisons are unfair,
since the SP is no longer new. I should thus be comparing new clusters with
Regatta class IBM machines. I am unable to do this since, on the only Regatta
class machine to which I have access, I am limited to running on a single (32
processor) node.
I can't believe any one can actually make use of this machine with an
efficiency level that's any more than pathetic.
- general scaling comments:
We're improving our code to make use of many more processors - e.g. 300-600
within the next year.
Currently, our jobs do not yet require more than 64 processors. Our code scales
quite well with number of processors up to 64.
Max. number of processors your code can effectively use per job depends on the
size of the problem I am dealing with.
The max. # of processors really depends on how big a given lattice is. For next
year, we may have access to 40^3x96 lattices, so I expect the max number of
nodes to increase to 32, may'be more.
We could use more processors, but our current simulation size dictates that
this is the most efficient use of resources. In the upcoming year we plan to
use more processors, and perform some larger simulations.
- too much emphasis on large number of processors:
See earlier comments on promoting super large jobs, for reasoning behind our
reservations.
[Although we successfully tested large jobs, I do not believe these jobs could
serve our scientific goals well. I could easily see using 8 nodes of seaborg
for our activation energy barriers determination jobs, but using more nodes
than that would not be efficient or necessary. In fact, I see a significant
threat to excellent quality supercomputing research by expanding in the
direction of using more and more nodes per job. I suspect that a good fraction
of the work pursued at seaborg, although excellent, because of the very nature
of the problems handled, one cannot expect linear scaling to very many nodes.
We believe that 25% of the resources devoted on this super large jobs is
already too much.]
Please, see my comments regarding the scaling initiative.
[... Although we are using state-of-the-art parallel linear algebra software
(SuperLU_DIST), scaling to
increase speed for a given problem has limits. Furthermore, when solving
initial-value problems, the
time-dimension is completely excluded from any 'domain' decomposition. Our
computations typically scale well to
100-200 processors. This leaves us in a middle ground, where the problems are
too large for local Linux clusters and
too small to qualify for the NERSC queues that have decent throughput. My
opinion is that it is unfair for NERSC to
have the new priority initiative apply to all nodes of the very large flagship
machine, since it weights the "trivially
parallel" applications above the more challenging computations, which have
required a more serious approach to achieve parallelism.]
- Provide more interactive and debugging resources:
-
It is virtually impossible to do any interactive work on seaborg. This is a
major shortcoming of its configuration, not of the IBM SP architecture. With a
system of seaborg's size, it should be straightforward and not ultimately
detrimental to system throughput to allow more interactive work. I usually
debug interactively on a Linux cluster prior to moving an application to the
SP. Often in moving it over, there are IBM-specific issues that I need to
address prior to submitting a long run. Being forced to do this final bit of
debugging by submitting a succession of batch jobs is not a good use of my time
nor optimal for job throughput on seaborg itself. PLEASE....fix this.
I commented in the 2002 that interactive access to seaborg was terrible. It
still is. YOU NEED TO HAVE DEDICATED _SHARED_ACCESS_ NODES TO SOLVE THIS
PROBLEM!!!!! I am sick to death of trying to run Totalview and being DENIED
because of "lack of available resources" error messages. GET WITH IT
ALREADY!!
NERSC response:
Please see the "-retry" and "-retrycount" flags to poe (man poe). These flags
can set your interactive job to retry its submission automatically so that you
won't have to do so manually due to "lack of available resources".
We are developers so interactive, benchmarking time even for very large node
configurations is often important.
Ability to run interactively after-hours is sometimes < desirable. Because of
the dedicated nodes, it is generally satisfactory during office hours, but grad
students are not restricted to office hours... :-)
... (Except the ability for interactive runs)
For interactive jobs, 30 min limit is too short, I feel. Would you increase
this limit to 1 or 2 hours?
Sould reduce the waiting time for debug queue
... Debug class should be given priority during the weekends and the nights.
NERSC response:
One of the major constraints of the batch queue system on seaborg is the
speed with which resources can be shifted from one use to another. This
impacts how quickly we can take the system down and very directly impacts
the speed at which we can provide interactive/debug resources. Resource
demands for debug and interactive tend to be very spiky and our approach
thus far has been to try to estimate demand based on past usage.
In order to best meet future demand for debug and interactive we monitor
the utilization of the resources currently devoted to the debug and interactive
classes.
We also allow debug and interactive work to run anywhere in the machine if such an oppurtunity arises. This is somewhat at odds with the fact that utilization
in the main batch pool is very high, but every little bit helps and we do what
we can within the given constraints.
The hours prior to system downtimes are a excellent time to do debug cycles. The cycles lost to the period of hours prior to the downtime over which the machine is drained can be in part recovered by debug and interactive work that is allowed to
proceed during that time. This is a very limited time frame, but may be
useful users who can schedule a time to do development/debug work on their parallel codes.
Batch queue policies and system issues are often discussed at the NERSC Users Group meeting. If you feel debug/interactive classes are not working, we
encourage you to participate in NUG's work to improve the system. Your ideas
and suggestions for how to work within the contraints inherent in the machine are welcome.
- Provide more serial services:
-
I am still partly on the stage of porting the codes to the AIX system, and it
would be good to have some nodes available as single processors.
For some time a number of users have been asking for you to set aside some
processors for serial jobs that run for longer than 30 minutes. I also would
find this useful for calculations that are not readily parallizable and need
the software environment provided by NERSC. Given the large number of
processors that your system now has this would seem to have a very minimal
impact on your overall throughput. You need to be responsive to this need in
your queue structure.
Need to run on one processor for a parallel code (e.g., MPI code), and to be
charged for only one processor. The problem is many programs do not have
serial versions, but nevertheless need to be run for small systems (on a single
processor) from time to time. One solution is to have a special queue run on a
few nodes (1-2) and to be run on a time sharing fashion. So, a MPI job can
always be run (no waiting time), and the performance for a single processor job
will be okay, and the charge is based on a single processor.
I'd like to see a high priority queue with very long CPU time limit (days or
weeks) [this project only ran 1 processor jobs]
- User environment issues:
-
The IBM SP is not native 64 bit and this has created headaches which were
inexistent on Cray hardware.
My comment concerns the output, writing to a file. It happened that my program
exceeded time limits. It was terminated (as it should be) and it returned no
data in the output files. I lost many hours of valuable computing. Is there any
way to do something about it? It would be useful to have an output even when
the program exceeds the time limit. Other than that I am very satisfied with
I/O performance.
... Finally, for such a large system, there is frightenly little disk space for
checkpoint files, graphics dump files etc. I have been in a situation where a
job of mine is going to generate more output than $SCRATCH can accomodate. This
leaves me to sit there and quickly offload files interactively to HPSS. I would
like to see this rectified in the current system, if possible. Additionally, I
would like this problem to be kept in mind when budgeting is done for follow-on
systems.
- Other:
-
... But I don't understand the regular down time due to maintenace.
Is it possible
to shutdown only the troubled nodes and keep the rest of the machine running?
Processors are getting older. You need to get new, faster systems
Comments on NERSC's PDSF Cluster: 17 responses
- Disk, I/O and file issues:
-
I've only had a little trouble with the disks; I've had jobs die with
uninteruptible-sleep status that is probably due to disk access problems. Also
the "effective" number of CPUs I can use for a job (well, a set of jobs) is
limited by disk access.
... One of the main issues I have faced as a user is disk I/O for lots of jobs
running against a common dataset. The dvio resource tool helps keep things
running smoothly but it has a complicated syntax/interface and limits the
number of active jobs, thus slowing down the performance of my work.
... Disk vaults seem to crash quit often and it takes a long time for them to come
back. Since all my data is sitting on one vault this causes delays of my
analysis.
There is still work needed on the overall data management couple with I/O
performance issues.
Problems with the disk to which I was writing put constraints on the number of
processors that I could use at a time to run my jobs.
... situation with NFS on the datavaults not very satisfactory ...
The commodity disk vaults are not as reliable as we would like them to be. The
reliability has been markedly better this past year then previously, but we
still lose a few days a year to disk vault failures. In addition, the
simultaneous load caps for the disk vaults is starting to impact us as we scale
up our processing. ...
iI think the individual machines could have larger swap space.
Also the disk vault system seems to be a bit unreliable, perhaps it is NFS.
... Slow IO
get file list could be implemented
- Batch issues:
-
It sometimes takes over a day for a job to start. ...
... a new batch queue between short and medium, say 8 hours of CPU,
would be nice.
... We would also like an intermediate queue on PDSF in between short and medium.
The jump between 1-hour and 24-hours is a large jump and we have a number of
jobs of a few hours or less but more than 1 hour which would benefit from such
an intermediate queue.
... I also understand that it is not good to allow jobs to run for too long.
... The only reason I didn't put "Very" satisfied in some of my answers above is
that the LSF software has some "features" which I don't like that well (you
have to write a script if you want to selectively kill a large number of jobs,
for example), but I don't think it's NERSC's fault. It's a pretty good
batching system overall. ...
LSF sometimes terminates my job when the network bandwidth gets slow. This is
unpredictable, because it depends not only on my jobs, but other people's jobs
that share the network. I'm not sure if there's a fix for this, but I'd like
to know about it if there is.
- Good system:
-
Very nice system, well maintained and running smoothly. System admins try and
maintain the system mostly 'behind the scenes' which is a great relief
compared to some other large scale clusters. The well running of PDSF was
essential in achieving our scientific goals.
A well-oiled operation!
great system. ...
pdsf is well maintained and very useful. ...
- Provide more interactive and debugging resources:
-
We need better interactive response. ...
... Oftentimes when I am running interactively, the wait is very, very long to
do anything (even if I just type "ls").
- interactive nodes overloaded
- when someone is using HSI on an interactive
node, the node is basically unusable ...
No problems except: - No working debugger for STAR software - Slow IO
- Down time issues:
-
Recently I got the impression that a lot of nodes were down and therefore
eating up my jobs.
... We also would like to decrease the down time.
- Other:
-
The machines are somewhat too slow. I understand that there are cost
considerations. ...
My own code is usually rather utilitarian and not run many times, KamLAND data
processing is excluded from this statement.
Comments on NERSC's HPSS Storage System: 29 responses
- Good system:
-
HPSS is the best mass storage I've ever used or heard of.
Better than it's ever been.
HPSS is very useful. ...
Just like with the PDSF system - very well run, with little visible
interference. The Storage Group has gone out of their way to help us out doing
our science. Staff has contacted us on various occasions to help optimize our
usage.
Very fast and very useful.
I am very happy with the "unix-like" interface.
very good
Fast, efficient, and simple.
This system is great. Of all of the mass storage systems I have used at
multiple sites around the world, this is the best.
It could be that I've been around long enough to have experienced the old
storage system and thus have very low expectations, but I think this is a great
system.
I have been impressed with the relative ease of use and efficiency of HPSS.
HPSS is great, ...
I like it.
I couldn't get much done without this.
- Hard to use / user interface issues:
-
web documentation is overwhelming for a first time user of HPSS and HSI.
My only real substantive comment is that I find the hsi interface to be
unnecessarily user unfriendly. Why can it not be endowed with some basic UNIX
shell functionality (examples: (1) a history mechanism, (2) command-line
editing, (3) recall of previous commands with emacs and vi editing capability,
(4) more mnemonic command names in parallel to UNIX.) None of this improvements
is rocket science and could be easily implemented to make everybody's life
easier.
HSI is a truly godawful tool, and useless for manipulation of data on mass
storage by programs. Thus I am restricted to FTP for any work involving data on
HPSS. Thank goodness there's a Perl module.
You can't backspace when you are inside hpss, which means I either have to be
very slow and precise in how I type my commands (which is a pain for navigating
through multiple directory levels) or I end up typing them more than once. Can
this be fixed? ...
A transparent user interface i.e. direct use of ls & cp would be a nice
feature.
I did not use this system for quite a while. I remeber the user interface was
not very good before. It is not easy to get multiple files using one command.
... I haven't gotten to spend enough time with HSI 2.8 to see if it offers
improvements with it's scheduling.
- Performance improvements needed:
-
... It is a little slow in fetching data, but I guess that is to be expected
given the amount of data stored there.
... Otherwise, I find hpss useful and even though transfer rates are slower than my
patience would like, I understand the time limitations of transfering from tape
to disk and can work around the speed issue.
... the only negative is the occational long wait to access something I have put on
their a long time previously, but this is understandable. ...
... Faster file retrieval would always be nice. ...
KamLAND has worked with the HPSS people to improve throughput by taking
advantage of the fact that we read out data in big chunks, but that could still
be utilised further.
- Authentication is difficult:
-
Very difficult to understand. I stopped using it because everytime I have to
use it again, I forget the whole Jazz of the login/password. For example, today
I couldn't use it! Why not make it simpler?
please think about the initial password setup. maybe it is possible to make
this easier for the user. Once it is setup it works really well.
I have never used it because it sounds so complicated to use it for the first
time.
- Don't like the down times
-
HPSS's weekly maintenance occurs in the middle of the work day for both the
East and the West Coast. To my eyes it would make more sense to take advantage
of time zones and inconvenience fewer users. (And yes, I know this comes off as
self-serving since the most likely alternative --- afternoon on the West Coast
--- is a great help for those of us in the East. However, I still think the
idea is sound.)
The Tuesday maintenance is always irritating, but I understand that if it needs
to be done it's better to have it scheduled during the day when people will
notice rather than at night we people's jobs will fail because they'll
forget. ...
- Network / file transfer problems:
-
I was having problems with not getting complete file transfers from HPSS to
Seaborg sometimes, I think. I would have to reload from HPSS to get complete
large files.
I transfered about 800Mb of data between the HPSS at NERSC and ORNL. Apart from
some bugs in hsi which were eventually resolved, it worked well, except the
link between NERSC and ORNL would die every few hours, which meant more
intensive baby-sitting on my part. I don't know what would cause the link to
die. It may just have been random hiccups.
- Other:
-
Looking forward to Grid access tools.
... Further, the accounting for HPSS, i.e. SRU units, is a bit strange in that
prior year information stored counts so much (Gb x 4).
Comments about NERSC's math and vis servers: 5 responses
- Network connection too slow:
-
The only problem is that the network connection from Germany makes interactive
visualization impractical. But I'm not sure you can do anything about this.
From my point of view as a remote user, the visualization resources are not
always convenient to use or well projected to those of us in the outside world
(network response times are too slow). ...
It is toow slow, so I was not able to use them in last year.
- Good service:
-
The math server is very helpful for me.
I don't use escher as much as I'd like, but it has certainly been nice and
easy to use when I have done anything on it.
- Remote licenses:
-
... You need to better develop the floating
license approach and make it easier to use for remote users.
Next:
Software Resources
|