Search

Site Reliability Architect

Qode
locationAustin, TX, USA
PublishedPublished: 6/14/2022
Technology
Full Time

Job Description

Job DescriptionSite Reliability Engineer (SRE Architect) – Unified Observability & AIOpsLocation- Austin, TXRole SummaryWe are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.
Key ResponsibilitiesObservability & Reliability Engineering

  • Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
  • Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
  • Build actionable dashboards for operations, engineering, and leadership
  • Implement alerting strategies using static and dynamic thresholds

Proactive Detection & AIOps

  • Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
  • Transition monitoring from reactive alerts to proactive insights
  • Implement noise reduction, alert correlation, and root cause analysis
  • Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems & Dependency Analysis

  • Monitor and troubleshoot multi-service architectures involving:
  • Microservices
  • Downstream APIs
  • Kafka / streaming platforms
  • Cloud infrastructure (Terraform, IaC)
  • Identify whether issues originate from:
  • Upstream/downstream dependencies
  • Streaming platform
  • Infrastructure
  • Application code

Tooling & Platforms

  • Deep hands-on experience with Dynatrace (mandatory)
  • Experience with:
  • OpenTelemetry
  • Prometheus / Grafana
  • ELK / EFK
  • Cloud-native monitoring (AWS/Azure/GCP)
  • Strong JSON-based telemetry manipulation and enrichment

GenAI & LLM Enablement

  • Apply GenAI / LLMs for:
  • Incident summarization
  • Root cause explanation
  • Runbook recommendations
  • Auto-remediation suggestions
  • Collaborate with platform teams to operationalize GenAI safely


Required Skills & Experience✅ 15+ years in SRE / Production Engineering✅ Strong Unified Observability background (not infra-only)✅ Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)✅ SLI/SLO engineering experience in production systems✅ Experience implementing dynamic thresholds and anomaly detection✅ Knowledge of AI/ML concepts applied to Ops (AIOps)✅ Distributed systems troubleshooting expertise✅ Experience with Kafka or streaming data platforms
Differentiators (Highly Valued)

  • Experience in financial services or regulated environments
  • Proven reduction of alert noise and MTTR using AIOps
  • GenAI / LLM integration into operations workflows


Interview Question Bank (Mapped to LPL Expectations)1. Dashboards, SLAs, and Reliability TargetsPurpose: Identify true SREs vs dashboard builders

  • How do you design dashboards differently for engineers vs leadership?
  • Explain how SLIs and SLOs differ from SLAs. Which do you operationalize?
  • How do you map SLOs to alerting without creating noise?
  • What KPIs would you track for a critical trading or advisor-facing platform?

Red Flag: Talks only about CPU, memory, uptime
2. Alerting Strategy & Threshold DesignPurpose: Assess signal-to-noise maturity

  • How do you decide when to use static vs dynamic thresholds?
  • Explain how you prevent alert storms during high traffic or seasonal spikes.
  • What makes an alert actionable?
  • How do you design alerts for early symptom detection?

Follow-up

  • What happens after an alert fires? Walk me through the lifecycle.


3. Dynamic Thresholds & Anomaly DetectionPurpose: Validate AIOps fundamentals

  • How do dynamic thresholds work under the hood?
  • How do you account for baseline drift and seasonality?
  • What risks do dynamic thresholds introduce?
  • How would you tune sensitivity to avoid false positives?

Expected Concepts ✅ Baselines✅ ML models✅ Adaptive learning✅ Time-series analysis
4. Multiplexing (Metrics, Signals, Streams)Purpose: Test system observability depth

  • What is multiplexing in observability?
  • How do multiple telemetry signals strengthen diagnosis?
  • Provide an example where one signal was misleading.
  • How do you correlate metrics, traces, logs, and events?


5. JSON Tooling & Proactive DetectionPurpose: Ensure hands-on operational telemetry skills

  • How have you used JSON-based event payloads to enrich observability?
  • How do you normalize data across heterogeneous sources?
  • How do structured logs improve proactive detection?
  • How do you extract signals from high-volume telemetry?


6. Proactive vs Reactive DetectionPurpose: Directly aligned to LPL concern

  • Give an example where you predicted an incident before customer impact.
  • What indicators help you identify impending failures?
  • How do you measure the success of proactive detection?


7. Multi-Service Failure Diagnosis (Critical Question)Purpose: Core differentiator at LPLScenario QuestionA user-facing issue is reported. The architecture includes:

  • Frontend
  • Backend microservices
  • Downstream APIs
  • Kafka streams
  • Terraform-managed infrastructure

Ask:

  • How do you determine if the issue is:
  • Application-related?
  • Kafka or streaming lag?
  • Downstream API latency?
  • Infrastructure drift via Terraform?

Expected Approach ✅ Dependency mapping✅ Golden signals✅ Trace correlation✅ Change analysis
8. Dynatrace (Mandatory)Purpose: Address explicit gap in feedback

  • What Dynatrace features have you used most?
  • How does Davis AI determine root cause?
  • How do you implement service-level baselining in Dynatrace?
  • How do you reduce alert noise using Dynatrace?

Red Flag: “I’ve mostly used dashboards”
9. AI/ML & AIOps FundamentalsPurpose: Ensure non-theoretical knowledge

  • What ML techniques are commonly used in AIOps?
  • How do supervised vs unsupervised models differ in Ops?
  • Where does AI fail in observability?
  • How do you validate AI-based decisions?


10. GenAI & LLM Use Cases for SREPurpose: Explicit LPL requirement

  • Where do you see GenAI adding value in SRE?
  • Have you used LLMs for incident response?
  • How would you integrate GenAI without introducing risk?
  • What data would you restrict from LLM exposure?

Expected Use Cases ✅ Incident summarization✅ RCA explanation✅ Runbook suggestions✅ MTTR reduction

Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...