We are GDIT. We stay at the forefront of innovation to solve complex technical challenges.
GDIT has an opportunity available for a talented and innovative High Performance Computing (HPC) Linux Systems Administrator within our High Performance Computing Center of Excellence to provide continuing on-site support for our NOAA Research and Development High Performance Computing Systems (RDHPCS) customer at the NOAA Global Systems Laboratory in Boulder, Colorado.
The qualified candidate will bring their hands-on technical and system administration expertise on-site to maintain the operational readiness and availability of NOAAs RDHPCS high performance computing systems at the Boulder, Colorado facility; manage and support new technology insertions, as well as provide remote technical support and collaboration with our other supported NOAA sites at Fairmont, West Virginia and Princeton, New Jersey.
Responsibilities and Duties:
Daily monitoring and management of medium to large HPC cluster environments; working and contributing as a member of a small team for local systems activities and a large team for cross-site support.
Maintain an overall situational awareness of HPC environment to identify and correct issues before they impact operations.
Plan, prepare, and execute required timely security patching and software upgrades during scheduled maintenance periods; includes building of open source software packages and installation of commercial software and license managers.
Install and/or rebuild HPC cluster nodes (including front ends, compute, and administration nodes) in diskful, disklite, and diskless environments; deliver consistent configuration management within the HPC environment.
Monitor, troubleshoot, and assist with capacity planning for network fabrics such as InfiniBand and Ethernet, and file systems such as NFS and Lustre.
Manage, extend, and develop customized scripts to support the HPC user, monitoring, and system administration environments.
Follow formal change and configuration management practices, ingrained into the daily operations, to ensure changes and configuration control implementations are properly documented and approved.
Support HPC system Users, leveraging the helpdesk ticketing system.
Develop, improve, and enhance user and system administration online documentation repositories; Utilize existing platforms to support the various documentation efforts; contribute to selection of and migration to new platforms, if needed.
Bachelors degree in computer-related field (Computer Science preferred) or equivalent years of experience.
5+ years of overall IT experience to include at least 3 years of experience in Linux Systems Administration.
5+ years of experience in High Performance Computing Systems
U.S. Citizenship required.
Hands-on experience with computer hardware maintenance, such as replacing processors, DIMMs, disk drives, PCIe cards and other field-replaceable components.
Proficiency with installing and removing software for Linux, including pre-built packages, and compiling from source.
Proficiency with documentation, collaboration, and task management tools, such as the Microsoft Suite (Project, Visio, Word, and Excel), Google Docs, MediaWiki, and Trac.
Experience with configuration and use of provisioning tools, such as xCAT.
Experience managing NFS, Lustre, or other NAS and parallel file systems.
Familiarity with monitoring and metrics gathering tools such as Nagios, Zabbix, collectl, rancid, and PCP.
Familiarity with basic networking technologies, components, and tools, with a solid understanding of subnet and routing concepts.
Experience with HPC batch scheduling and queuing systems.
Programming experience and familiarity with scripting languages in the Linux/Unix environment such as bash, awk, Perl, Python, and expect; proficiency in one or more of the above.
Experience with high-speed/low-latency networks such as InfiniBand; Proficiency with debugging and tuning Mellanox IB and configuration of OpenSM is a strong plus.
Team player, with the ability to work in with a diverse team in both local and remote technical support environments.
Disciplined troubleshooting skills balanced with creative problem-solving skills, to tackle highly complex large-scale technical problems.
WHAT GDIT CAN OFFER YOU
Autonomy, career mobility, challenging work, and team environment
This position requires being fully vaccinated against COVID-19 by January 18, 2022 or the start date, if after January 18. Individuals who work in or reside in Texas or Montana or work outside of the United States may be excluded from this requirement.
The likely salary range for this position is $100,000 - $150,000, this is not, however, a guarantee of compensation or salary; rather, salary will be set based on experience, geographic location and possibly contractual requirements and could fall outside of this range.
We are GDIT. The people supporting some of the most complex government, defense, and intelligence projects across the country. We deliver. Bringing the expertise needed to understand and advance critical missions. We transform. Shifting the ways clients invest in, integrate, and innovate technology solutions. We ensure today is safe and tomorrow is smarter. We are there. On the ground, beside our clients, in the lab, and everywhere in between. Offering the technology transformations, strategy, and mission services needed to get the job done.
GDIT is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or veteran status, or any other protected class.