... High Performance Computing Steering Committee > HPC Steering Committee > MeetingNotes20090227 help
HPC Steering Committee HPC Steering Committee (permalink)
MeetingNotes20090227 (permalink)
last edited by Brian Davison on Wednesday, 03/25/2009 10:50 PM

 

Attendees - Bruce Taggart, Brian Davison, Kamil Klier, Terry Delph,  Ben Felzer, Gale Fritsche, Liangjie Hong, Steve Lidie, Imre Polik, Slava Rotkin, David Myers, Brandon Leeds (minutes)

Guests - Sougata Roy (ATLSS Research Laboratory)

Excused - Peter Bryan, Bruce Dodson

Agenda

  • Opening Remarks -- Brian
  • Operations
1- Super user privileges for GCM cluster - Gale
Ben - model running with roadblocks with no specifics to ask Jim E. right now. Not having rights can provide stalling to trouble shooting. May influence future purchases if not having a mechanism to provide root privileges to owner of cluster.
Gale - Higher level issue of rights and stalling research. Systems people needs have outweighed owners needs.
Brian - understands tradeoffs and needs of owner and systems people. What are best practices? Doesn't support root privileges for owners, but can have faster response time. But still must understand LTS' limited resources. Brian agrees to the systems training program to allow users 
Steve L. - on sensitive subnet, number 2. Wants to know if responsiveness is the issue. Steve will arrange fast response to smooth over some of Ben's concerns. Steve will try and work out a special solution for Ben's situation.
Slava - question of policies and who has the power to change the policy on computer access policy. Maybe a phased approach to people who show competence in systems administration.  He believes this to be a not special problem and that services need to be spelled out in detail in the placement policy with details on access privileges. Even hours turnaround is unacceptable. He supports full access to owners.
Imre- maybe a phased configuration access approach. Once done, move to standard support policy. Wanted to know if users could install a system that is configured prior to entry into machine room.  Answer has been no. Must use LTS configured machines.
 
Two changes to policy - configuration time and transition to next stage; and LTS configures machines.
Researcher providing an operations person has to be considered by the LTS / SNA  not HPCSC.
 
2- Altair performance update - Steve L.
This question came up while discussing dedicated use of HPC resources.  The original problem has not been corrected.
I/o related - SNA. Stale jobs policy.  Limits on how long a job runs or scheduled reboots. Brandon will resend kernel mods for cpu sets and process tied down to improve cache performance.

3- Software upgrades update - Steve L.

Skipped

4- Ganglia monitoring utility update - Steve L.

Here is the URL for access to the new monitoring tool: http://blaze.cc.lehigh.edu/

5- Suggestions for use of remaining Egenera blades - Steve L.

Used for license servers, general purpose. Public condor pool connection. (one-way). Steve L. will look into what it will take to create a one-way flocking mechanism to allow jobs dispatched to the Private cluster pool (Blaze as host) to migrate to idle cycles on the public pool, another Condor pool,  that includes all the LEAF nodes, which are usually idle.

6- Quotas on limited filesystems - Steve L. and Brandon

Mail, afs, altair have separate quotas becasue they are "public" access machine - no pay. But not on blaze ( private system behind pay wall). Don't want to restrict people, but need to constrain usage. Not a problem on blaze until recently.  Could ask people to remove old data, or add new disk. And move people to /Projects. SNA will put out quota directive for /home on blaze. Question on whether large storage users can bring storage resources that are paid for. Leads into budget discussion next meeting about charge-back models for HPC resources.

  • Other Issues
                    1- Scientific application support for specific titles - problem with running ABAQUS in batch mode on clusters under Condor - Sougata Roy
Sougata related his current status of how he has to run ABAQUS (that uses MPI for using distributed memory models) because he is unable to use Condor because it is not supported ABAQUS, which want to use OpenPBS as a batch scheduling submittal system. Since Lehigh only has an Academic version of the license, technical support is not available to rely on to help fix this problem. A quotation to ABAQUS will be asked for to find out the cost of getting technical support for resolving this problem, on a per incident basis if possible, but if not, to get an annual support cost quotation. 

                    2- IMSL usage - Gale

We are no longer going to provide IMSL software to users due to the prohibitive cost of licensing this software.

 

  • Issues considered by sub-committees: (None of these topics were reached. Topics tabled until next meeting March 27, 2009)
  • Policies and Procedures (Peter Bryan, chair)
1- HPC storage quota policy need - Brandon and Steve L.
  • Outreach and Education (Ben Felzer, chair)
1- HPC Day status - Brandon
  • Future Systems (Slava Rotkin, chair)
1-New hardware purchase (SMP for enhanced level 2 access / storage?) - Brian
  • Strategic Planning (Brian Davison, chair)
1- Update on RCEAS HPC subcommittee
  • Proposal Development ()
1- IBM SUR update - Brian

  • Next Meeting -- Brian
                    March 27, 2009 at  11:10am Room 625 EWFM
Footer with links to learningObjects information
${initParam.pluginShortName}
About | Feedback | Instructor Resources
Powered by Learning Objects, Inc., Copyright © 2003-2008
Page
Links to Create,Edit,Delete,Print and View History of Wikis
Page Stats
Views: 801
Edits: 31
Contributors: 2
Comments: 0
Toolbox
Site Navigation links