At IBM Software, we transform client challenges into solutions, building the world's leading AI-powered, cloud-native products that shape the future of business and society. We are building the next generation of watsonx.data—a GPU-accelerated, open data lakehouse engineered to deliver category-leading price-performance for analytics and AI workloads. Working in Software means joining a team fueled by curiosity and collaboration, where you'll design distributed components—metadata services, coordination layers, state management systems, and data movement pipelines—that handle high-throughput, fault-tolerant, and strongly consistent workloads at petabyte scale. With a culture that values innovation, growth, and continuous learning, IBM Software places you at the heart of IBM's product and technology landscape. Here, you'll have the tools and opportunities to advance your career while creating software that changes the world.
As a Software Engineer specializing in large-scale, stateful distributed systems, you will design, develop, test, and deliver the foundational infrastructure that makes watsonx.data reliable, consistent, and performant at petabyte scale.
You will work in an Agile, collaborative environment to understand stakeholder requirements and set the reliability and scalability ceiling for the platform.
Your primary responsibilities will include
- Design Distributed Components: Architect and implement metadata services, distributed schedulers, catalog backends, and state coordination layers for petabyte-scale, high-throughput, low-latency data. Build Stateful & Fault-Tolerant Infrastructure: Implement replication, automatic failover, distributed consensus (Raft/Paxos), snapshot/restore, and exactly-once processing for correctness under failure. Contribute to CI/CD Pipeline: Contribute to the automated CI/CD pipeline, instrumenting components with structured logging, distributed tracing, and metrics that make failure modes observable.
- Debug Distributed Failures: Design, develop, and unit test fixes for customer-reported and production issues, building diagnostic tooling and driving post-mortems to resolution. Collaborate in Agile Environment: Partner with query engine, storage, GPU acceleration, and AI/ML teams to surface constraints early, conduct reviews, and document consistency decisions.
- Distributed Systems Experience: 6+ years of professional software engineering experience, including at least 2 years designing and operating large-scale distributed systems (data platforms, databases, streaming, or comparable infrastructure).
- Systems Programming Proficiency: Strong skills in Java, Go, C++, or a comparable systems language, with experience writing and reviewing production distributed-system code. Consistency & Consensus Depth: Hands-on knowledge of consistency models (eventual, strong, causal), replication, quorum systems, leader election, and consensus protocols (Raft or Paxos) from direct implementation or deep operational experience. Fault Tolerance & Operations: Experience designing fault-tolerant systems with automatic failover, idempotent operations, and durable recovery, plus distributed observability (OpenTelemetry/Jaeger, Prometheus/Grafana) and stateful workloads on Kubernetes. Communication & Education: Clear written communication—able to produce design documents, post-mortems, and capacity analyses; Bachelor's degree in Computer Science, Engineering, or equivalent practical experience. Lakehouse & Streaming Internals: Hands-on experience with petabyte-scale data movement, compaction, or tiering; stateful exactly-once streaming (Kafka, Flink, Pulsar); and open table format transaction protocols (Iceberg, Delta, Hudi). Advanced Distributed & GPU Topics: Familiarity with vector or hybrid logical clocks, contributions to open-source distributed systems (Iceberg, Trino, Kafka, etcd, Flink), GPU-accelerated data processing (NVLink/PCIe topology), and FinOps for stateful workloads. United States Software Engineering Hybrid Professional Multiple Cities