We are looking for an experienced High-Performance Computing (HPC) Systems Engineer to support complex system design, integration, monitoring, and diagnostics by applying deep understanding of both physical and logical system architectures.
Position Description
- Maintain a comprehensive understanding of the system’s end-to-end physical and logical architecture to effectively apply hardware modeling and diagnostics (HMD) monitoring tools.
- Leverage HMD monitoring tools to identify, narrow, and triage system issues, directing detailed problems to the appropriate diagnosticians or vendors for resolution.
- Develop deep expertise in the HMD product and monitoring architecture to identify gaps, inefficiencies, and opportunities to enhance diagnostic effectiveness.
- Collaborate with developers, analysts, and monitoring tool owners to propose, design, and implement improvements to monitoring solutions, increasing system reliability, and operational visibility.
- Analyze system logs, metrics, and telemetry—primarily using Splunk—to determine root causes, understand system behavior, and identify anomalous conditions.
- Interpret hardware and system performance data, including graphs and trends, to diagnose system behavior and inform troubleshooting activities.
- Guide the development of Splunk dashboards, health indicators, and diagnostic scripts to monitor critical data flows, system performance, and failure signatures.
- Review and evaluate relevant technical documentation; ask clarifying questions and build expertise in hardware design to support accurate and timely system diagnosis.
- Provide recommendations for testing strategies and develop documentation of issue signatures to enable and accelerate diagnostics development.
- Collaborate closely with diagnosis teams and external vendors to troubleshoot complex hardware and system-level issues.
- Track issues through resolution using JIRA, validate fixes, and confirm that corrective actions resolve the underlying problems.
Experience
- Demonstrated experience in one or more of the following technical domains, with a strong willingness and aptitude to expand expertise as required: System Architecture and Design, Power Systems, Printed Circuit Board (PCB) Design, Cooling Infrastructure, Signal Integrity, and System Reliability.
- Proven experience in system health monitoring, diagnostics, and operational support for complex hardware and integrated systems.
- Ability to read, interpret, and analyze detailed hardware documentation, including specifications, data sheets, schematics, and design artifacts.
- Strong analytical and troubleshooting mindset, with a natural curiosity and willingness to ask probing questions to identify root causes and systemic issues.
- Excellent communication skills, including the ability to engage vendors and internal stakeholders to extract detailed technical information related to system design, performance, and anomalies.
- Experience with, or demonstrated ability to quickly learn, system monitoring and diagnostics tools, including but not limited to Chiplink, Splunk, Ipsci, and iostat.
- Self-motivated and capable of completing complex technical tasks with minimal supervision while managing priorities effectively.
- Ability to clearly convey technical findings, risks, and recommendations to both technical and non-technical audiences through written documentation and oral briefings.
- Proven ability to work effectively within cross-functional teams, including engineering, diagnostics, operations, and vendor partners.
- Comfortable learning and adapting to new tools, technologies, and processes as mission and system needs evolve.
Qualifications
- An active TS/SCI with polygraph is required - Last poly must be within the last 5 years.
Employee Freedom of Choice
Our focus is on people first. We offer comprehensive and flexible compensation packages that match the best the industry has to offer and can be customized to fit your needs.
Our Benefits:
- 100% company-paid health, dental, and vision premiums
- Automatic company contributed Health Savings Account (HSA) up to $3,900 for families
- Up to 7 weeks of Paid Time Off (PTO)
- Automatic 401k Investment
- Paid 11 Federal Holidays
- BlueCross BlueShield Health Insurance
- Tuition/Training Reimbursement
- Access to Ravens season tickets in club level
- Company-paid golf events for your time and course fees
We provide equal employment opportunities to all employees and applicants for employment and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, pregnancy, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.
This policy applies to all terms and conditions of employment, including recruiting, hiring, placement, promotion, termination, layoff, recall, transfer, leaves of absence, compensation, and training.
The projected compensation range for this position is $175K - $250K. There are differentiating factors that can impact a final salary/hourly rate, including, but not limited to, Contract Wage Determination, relevant work experience, skills and competencies that align to the specified role, geographic location, education and certifications as well as Federal Government Contract Labor categories. In addition, we invest in our employees beyond just compensation. Our benefit offerings include, dependent upon position, Health Insurance, Dental and Vision Coverage, Health Savings Account (HSA), Paid Time Off, Holiday Pay, Short Term and Long Term Disability, Life Insurance, 401(k) Plan, Safe Harbor 401k Investment, Learning and Development opportunities, Referral Bonuses and Flex Time.