At IBM Infrastructure & Technology, we design and operate the systems that keep the world running. From high-resiliency mainframes and hybrid cloud platforms to networking, automation, and site reliability. Our teams ensure the performance, security, and scalability that clients and industries depend on every day. Working in Infrastructure & Technology means tackling complex challenges with curiosity and collaboration. You’ll work with diverse technologies and colleagues worldwide to deliver resilient, future-ready solutions that power innovation. With continuous learning, career growth, and a supportive culture, IBM provides the opportunities to build expertise and shape the infrastructure that drives progress.
As a Senior AI Engineer on the IBM Z team, you will lead the exploration, design, and implementation of next-generation AI capabilities for enterprise infrastructure. This role is suited to engineers who are passionate about solving complex systems challenges, shaping technical direction and driving innovation across AI and enterprise computing platforms.
You will work across hardware, firmware, systems software and AI runtime technologies to evaluate emerging approaches, develop advanced prototypes and influence the future direction of AI enablement on IBM Z platforms. AI Systems Research and Innovation Lead the investigation and evaluation of emerging AI and LLM technologies relevant to enterprise infrastructure environments. Design and prototype scalable approaches for deploying and optimising LLM inference workloads on IBM Z systems and Spyre hardware accelerator technologies. Research advanced runtime architectures, memory optimisation strategies and resource orchestration techniques for large-scale AI workloads. Contribute to technical strategy and platform direction through experimentation, benchmarking, comparative analysis and proof-of-concept development. Architecture and Solution Design Drive the architecture and technical design of AI integration frameworks, enabling enterprise applications and cloud-native services to securely and efficiently consume AI capabilities. Define scalable APIs, runtime interfaces and platform integration patterns for enterprise AI systems. Collaborate with system architects, hardware engineers and platform teams to influence long-term AI infrastructure capabilities and engineering direction. Evaluate trade-offs across performance, scalability, reliability, maintainability and operational efficiency. Performance Engineering and Analysis Lead performance investigations across AI inference workloads, hardware accelerators, runtimes and operating environments. Design benchmarking methodologies and profiling frameworks to analyse latency, throughput, memory efficiency, power consumption and system utilisation. Identify optimisation opportunities across hardware acceleration, parallel execution, batching strategies and model execution pipelines. Produce technical findings and engineering recommendations to improve enterprise AI platform capabilities. Advanced Debugging and Reliability Engineering Investigate and resolve complex cross-stack issues involving firmware, drivers, runtimes, orchestration layers and AI applications. Drive root cause analysis activities for performance regressions, scaling limitations and system stability challenges. Contribute to resilient system design through automation, regression analysis, fault detection and operational validation strategies. Observability and Operational Intelligence Design telemetry, instrumentation and observability capabilities for AI workloads running in production environments. Develop frameworks for capturing system behaviour, inference characteristics, hardware utilisation, reliability metrics and operational trends. Create dashboards, reporting models and monitoring strategies to support operational visibility and continuous optimisation. Technical Leadership and Collaboration Provide technical leadership across cross-functional engineering initiatives spanning AI, infrastructure, hardware acceleration and enterprise systems. Lead technical discussions, architecture reviews and innovation workshops with engineering stakeholders. Produce high-quality technical documentation, whitepapers, design proposals and knowledge-sharing material. Remain current with advances in LLMs, AI acceleration, distributed systems and enterprise AI infrastructure, applying insights to influence future platform direction.
Professional and Technical Expertise Demonstrated professional experience in AI/ML engineering, distributed systems, platform engineering or performance-focused software development. Strong programming skills in Python and working experience with C/C++. Deep understanding of transformer-based architectures, inference systems and large-scale AI workload execution. Strong knowledge of computer architecture, operating systems, memory hierarchies, parallel processing and I/O systems. Experience designing or evaluating scalable system architectures and distributed computing environments. Hands-on experience with profiling, benchmarking, performance analysis and optimisation of complex systems. Experience conducting technical investigations, comparative evaluations and prototype development for emerging technologies. Strong Linux systems expertise, including command-line tooling, scripting, debugging and systems analysis. Knowledge of observability, telemetry, monitoring and operational analytics concepts. Strong analytical thinking, problem-solving capability and ability to communicate complex technical concepts effectively. Experience with PyTorch, TensorFlow, Hugging Face Transformers, or related AI frameworks. Experience with hardware acceleration technologies such as GPUs, NPUs, or AI accelerators. Knowledge of model optimisation techniques including quantization, pruning, distillation, and runtime optimisation. Familiarity with inference frameworks such as ONNX Runtime, TensorRT, TorchServe, or vLLM. Experience with observability platforms, including Prometheus, Grafana, ELK, Splunk, or OpenTelemetry. Understanding of distributed systems, orchestration technologies, Kubernetes, Docker, and CI/CD pipelines. Exposure to enterprise computing platforms including IBM Z and z/OS environments. Experience contributing to technical strategy, platform innovation or advanced engineering initiatives. Experience working within highly reliable, secure and performance-sensitive enterprise environments. Ireland Infrastructure & Technology Hybrid Professional Waterford, IE