Job Description
Job Description:\n\nSalary: This role serves as the operational leader of the Product Management function, establishing clear ownership, coaching Product Managers, improving operating rhythms, and building a high-performing product organization that consistently translates strategy into execution. The DevOps Engineer Lead owns the clarity, reliability, security, and repeatability of how our systems are built, deployed, and operated. This role designs and maintains automated, scalable, secure, and cost-effective infrastructure across production, development, and test environments, while advancing the internal platform capabilities engineers rely on every day, including the shared AI and LLM infrastructure that supports modern product delivery. This is a deeply hands-on role responsible for executing and improving deployments, observability, AI-enabled delivery tooling, and core operational practices to reduce risk caused by opaque processes, undocumented knowledge, and single points of failure. The DevOps Engineer Lead turns deployment and infrastructure from siloed knowledge into understandable, well-documented, observable systems that teams can confidently use and improve, including platform patterns for safe and reliable LLM usage. The role leads through practice by mentoring engineers, establishing standards, improving processes, and removing operational obstacles. Working independently and in close partnership with Engineering, this role reduces operational burden, increases delivery confidence, and builds platform capabilities that scale reliably. This role provides technical leadership through ownership and execution and does not include formal people-management responsibilities. Experience Required: 6-10 years of hands-on experience in DevOps, infrastructure, or platform engineering supporting production systems at scale, with experience enabling modern automation-heavy engineering organizations.Strong, hands-on experience operating workloads in AWS, with responsibility for reliability, security, performance, and day-to-day operations across services that include both traditional application workloads and AI-enabled platform capabilities.Proven production experience with Kubernetes, including deploying, operating, and troubleshooting containerized workloadsStrong programming experience with Python (or similar), with the ability to write and maintain production automation and work fluently in code-first operational workflows.Deep hands-on expertise with Terraform and infrastructure-as-code practices, with experience using broader DevOps tooling such as CloudFormation and Ansible.Strong proficiency with git based source control, including code reviews, collaborative workflows, and infrastructure/code ownershipExtensive experience building, operating, and improving CI/CD pipelines for provisioning, deployment, scaling, and automated verification, including practical integration of AI-assisted tooling in delivery workflows.Strong Linux/Unix expertise, including administration, scripting, troubleshooting, and operational monitoring in production environmentsHands-on experience implementing monitoring and log aggregation platforms (ELK, Graylog, Graphite, Prometheus, etc.)Experience implementing test automation and AI-assisted tooling to improve deployment quality, reliability, and operational efficiency, including workflows that use LLM-based assistants responsibly.Experience deploying and managing application infrastructure such as web or application servers, load balancers, queues, and caches, with an emphasis on scalability, resiliency, and operational transparency, plus familiarity with shared gateway or proxy patterns for external AI/LLM services.Must be authorized to work in the U.S. Nice to Have:Hands-on experience with networking concepts such as VPNs, firewall rules, or hybrid-cloud connectivity, and ability to apply these concepts to secure AI service integrations.Security experience in regulated or compliance-driven environments (e.g., SOC 2 and HIPAA familiarity), including governance considerations for AI and LLM-backed workflows.Database administration experience, including performance and reliability fundamentals.Experience supporting or deploying 12-Factor applications, internal developer platforms, or AI-enabled engineering workflows; experience with LLM gateways, model routing, or usage/cost controls is a strong plus. What youll be doing Platform, Deployment, and Operations OwnershipOwn day-to-day DevOps operations, including infrastructure health, monitoring, logging, patching, security posture, and maintenance, ensuring systems are observable and failures are diagnosable through strong metrics, logging, root-cause visibility, and effective incident response.Own and execute deployment processes end-to-end, ensuring they are secure, repeatable, transparent, and well documented with clear failure signals, automated rollback strategies, and release evidence that supports fast, safe decision-making.Design, build, and maintain automated, scalable, secure, and cost-effective infrastructure across production, development, and test environments using infrastructure-as-code and platform engineering best practices, including shared services that enable AI-assisted and agentic engineering workflows.Build, operate, and continuously improve CI/CD pipelines with reliable verification stages, clear failure signals, recovery paths, and rollback strategies, including automation hooks that support AI-enabled development workflows without weakening quality gates.Own application-level networking and infrastructure concerns, including network configuration, access controls, and connectivity required to support development and production environments, including secure connectivity for AI and LLM-backed services.Own infrastructure and networking concerns, including the configuration and troubleshooting of site-to-site VPNs, firewall rules, and secure connectivity required for county-level integrations and remote access. Security, Reliability, and StandardsAccess Analysis & Least Privilege: Perform regular access analysis across all systems, managing secrets, credentials, and IAM roles to ensure strict adherence to security best practices, including secure handling of AI provider credentials and service tokens.Audit Readiness & Evidence: Proactively support compliance requirements (such as SOC 2 and HIPAA) by maintaining auditable operational practices and generating technical evidence and reports for software and security audits, including traceability of AI and LLM service usage where required.Vulnerability Management: Enforce security posture through proactive patching, encryption, and vulnerability management across web servers, load balancers, data stores, runtime dependencies, and AI integration surfaces. Enablement, Leadership, and Continuous ImprovementPartner with software engineers during deployments and operational work to build shared understanding and enable safe, independent troubleshooting.Deploy, manage, and scale web and application servers, load balancers, queues, and caches through automated, repeatable workflows, and provide robust platform primitives for AI-enabled services and internal engineering automation.Identify, prioritize, and deliver improvements that reduce operational risk, remove bottlenecks, improve efficiency, increase delivery confidence, strengthen engineering throughput, and improve cost visibility across both cloud and AI usage.Document systems and processes with a focus on explaining both how they work and why, including clear runbooks and operational standards.Take proactive ownership of workload while ensuring strong coordination and transparency across the team, and coach engineers on practical use of platform, infrastructure, and AI-enabled engineering tools, patterns, and guardrails.Perform other job-related duties as assigned to support departmental goals and continuous improvement initiatives. You may be a good fit for our team if you have the following skills Strong ability to understand and operate systems end-to-end, including application architecture, infrastructure, deployment workflows, production operations, and AI/LLM service integration patterns.Proven ability to troubleshoot and resolve complex production issues across infrastructure, CI/CD pipelines, Kubernetes, and runtime environments.Strong understanding of observability practices, including metrics, centralized logging, alerting, tracing, and root-cause analysis, with the ability to extend observability and diagnostics to AI-enabled services.Deeply hands-on operator with sound technical judgment; able to assess situations quickly and clearly recommend solutions (what we should do and why).Strong sense of ownership and accountability, with the ability to prioritize work that improves reliability, reduces risk, controls cost, and ensures follow-through across both core infrastructure and AI-enabled platform services.Ability to collaborate effectively with software engineers and communicate clearly with both technical and nontechnical stakeholdersAbility to lead through influence by pairing, mentoring, documenting, and establishing practical standards across platform and delivery engineering.Self-starter comfortable operating in environments where structure must be built, not inherited, with a focus on clarity, measurable outcomes, and execution.Strong security mindset, with hands-on experience in secrets management, access controls, encryption, patching, vulnerability management, and secure service integration, including third-party AI and LLM providers.Hands-on experience with network topology, including the ability to configure and troubleshoot site-to-site VPNs, firewall rules, and hybrid-cloud connectivity. What we provide Medical (includes H.S.A. option with employer contribution), dental, and vision insuranceShort- and long-term disabilityCompany paid basic life insurance401(k) with 4% company match and immediate vestingFree financial education and consultationWellness program that helps you earn lower premiumsRobust EAP program that includes free therapy sessions, lifestyle coaching, legal/ID theft services, and more12 weeks fully paid parental leaveUp to $5,000 adoption fee reimbursement$500 wellness reimbursement after 60 days of employmentGenerous PTO policy and 10 company paid holidaysCompany paid cell phone plan Find yourself checking a lot of these boxes but doubting whether you should apply?Studies have shown that women and people of color are less likely to apply to jobs unless they meet every qualification. At Northwoods, we are dedicated to building a diverse, inclusive, and authentic workplace, so if you are excited about this role but your experience doesnt align perfectly with every qualification in the job description, we encourage you to apply anyway. You may be just the right candidate for this or other roles. Northwoods is committed to diversity in its workforce and is proud to be an equal opportunity employer. We are excited to work with talented people, period. All employment decisions are based on business needs, job requirements, and individual qualifications, without regard to race, color, religion or belief, national or ethnic origin, gender, age, disability, sexual orientation, gender identity and/or expression, marital or civil status, political affiliation, family or parental status, or any other status protected by the laws or regulations in the jurisdictions in which we operate. Who is Northwoods? Northwoods makes software solutions that improve the quality of the life of case managers and social workers in the Health and Human Services (HHS) field. Recognized as one of Columbustop places to work,working at Northwoods means being part of a team thats passionate about making an impact on the lives of HHS professionals and the families they serve. We believe in creating a culture of inclusivity and accountability seeking to hire professional, passionate and driven individuals that believe in the values that we believe in: Curiosity Willing to test assumptions, courage to ask questions, and active listening.Community Helping and mentoring each other, celebrating diversity and acknowledging our team members contributions.Resourcefulness Willing to try and fail. Asking for help and trusting the expertise of our team.Stewardship Safeguarding Northwoods values, culture, mission, and resources. We believe that our team members are all accountable adults, not only to themselves but to each other, and we treat them that way. Our team works incredibly hard and is proven through dedication to their craft and our mission. Our Solutions Our products are designed for state and county social care program areas, including child welfare, childcare, child support, economic assistance and adult & aging agencies. These solutions leverage technology to allow case workers and social workers to easily collect, store, manage, and share case content and data more efficiently. By simplifying processes, our customers can spend more time engaging with the families they serve, make better informed decisions and achieve better outcomes. TraverseA SaaS solution that allows for easy, on-the-go access to case files, case work and interview forms and intelligent insights on case content and materials, which lead to better engagement and better outcomes. CompassCompass solution is some of our longer standing software that supports multiple markets for both state and county customers. Our focus on reducing time wasted on paperwork and administrative burdens allows case workers to better serve families in need. Case Aide Services - Our trusted team of experienced child welfare professionals becomes an extension of agency staff supporting them with administrative tasks such as referral and records requests, compiling documents for court, and document organization. Workers can focus on fostering healthier families without sacrificing their own well-being.