Site Reliability Engineer (SRE) - Ops Level 2
Adex International – a global enterprise (ISO 9001:2015, ISO/IEC
27001:2013
certified company)
has as an integrated portfolio of IT products and services to provide the best solutions and
help businesses meet their objectives.
Responsibilities
1. Incident Management & Resolution (70%):
- Serve as an escalation point for complex incidents that cannot be resolved by Level 1 support, driving efficient resolution to minimize downtime and prevent further escalations.
- Utilize advanced monitoring, observability (e.g., Prometheus, Grafana, ELK Stack), and automation tools to rapidly diagnose, troubleshoot, and resolve critical issues across cloud (AWS, Azure, GCP), Linux, and Kubernetes environments.
- Implement and maintain automated remediation workflows and playbooks to accelerate incident resolution and reduce manual toil.
- Participate in a periodic on-call rotation to respond to critical incidents outside of business hours.
- Contribute to achieving a high resolution rate for Level 2 incidents with minimal escalations to Level 3.
2. Operational Excellence & Automation (20%):
- Proactively identify and implement automation opportunities using scripting (Python, Go, Bash), configuration management (Ansible), and orchestration platforms (e.g., Jenkins, ArgoCD) to enhance operational efficiency and reduce manual workloads.
- Develop, refine, and maintain comprehensive Standard Operating Procedures (SOPs), runbooks, and troubleshooting guides for common and complex operational issues.
- Collaborate with engineering teams to integrate operational insights into system design, contributing to more resilient and observable systems (Shift-Left SRE).
3. Incident Analysis & Collaboration (10%):
- Conduct deep-dive analysis of incident trends to identify root causes, recurring problems, and systemic weaknesses. Propose and implement preventative measures and long-term solutions.
- Facilitate strong collaboration and communication with NOC, Engineering, Product, and other support teams to ensure alignment, effective knowledge transfer, and continuous improvement.
- Contribute to post-incident reviews (PIRs/RCAs) to extract learnings and drive actionable improvements.
Skills
-
Key Technical Skills & Experience:
1. Deep Expertise in Kubernetes (5+ years experience):
- Extensive experience troubleshooting and resolving complex issues within Kubernetes clusters (e.g., pod connectivity, OOMKilled errors, DaemonSets, StatefulSets).
- Proficiency in Kubernetes administration, including managing namespaces, deployments, services, ingress, and persistent volumes (PVs/PVCs).
- Hands-on experience with Kubernetes autoscaling (HPA, VPA, Cluster Autoscaler) and familiarity with modern cluster management tools like Karpenter.
- Understanding of Kubernetes master plane components (API Server, etcd, Scheduler, Controller Manager) and their purpose.
- Ability to diagnose and resolve inter-pod communication issues within the same or different namespaces.
- Must have hands-on experience creating k8s clusters from scratch and extensive experience on troubleshooting production issues with customer defined RTO of 15 minutes.
2. Cloud Infrastructure & Administration (AWS/Azure/GCP):
- 5-10 years of hands-on experience troubleshooting and managing resources in public cloud environments (AWS strongly preferred).
- Proficiency in diagnosing and resolving issues related to EC2, VPC, Load Balancers (ALB/NLB), Route 53, S3, RDS, Lambda, and other core cloud services.
- Experience with Infrastructure as Code (IaC) tools like CloudFormation or Terraform, including performing rollbacks.
- Understanding of backend Lambda communication patterns with global services.
3. Linux System Administration (Expert Level):
- In-depth knowledge of Linux/Unix operating systems, including process management, file systems (e.g., inodes), networking, and troubleshooting tools (strace, tcpdump, lsof, top, vmstat, iostat).Strong understanding of memory management in Linux and ability to diagnose related issues.
- Experience diagnosing and resolving issues on remote servers across different regions.
4. Networking & Load Balancing:
- Solid understanding of OSI, TCP/IP, DNS, HTTP/S, ARP and network troubleshooting.
- Experience with web server management (Nginx, Apache) and global traffic management configurations.
- Ability to develop capabilities in troubleshooting network switches, firewalls, and VPNs (e.g., during datacenter outages).
5. Monitoring & Observability:
- Expertise in setting up, configuring, and utilizing monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack, Splunk, Datadog) to identify, investigate, and resolve infrastructure issues.
6. Automation & CI/CD:
- Proficiency in scripting languages (Python, Go, Bash) and automation platforms (Ansible, Chef, Puppet).Experience with CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions) for automated deployments and infrastructure changes.
- Strong version control skills with Git (GitHub).
7. Security & Compliance:
- Familiarity with secrets management (e.g., HashiCorp Vault) and certificate management (SSL/TLS).
Core Soft Skills:
- Exceptional Problem-Solving: Ability to rapidly analyze complex technical incidents under pressure, identify root causes, and implement effective solutions.
- Strong Communication & Collaboration: Excellent verbal and written communication skills to articulate complex technical issues clearly to diverse audiences (technical and non-technical). Proven ability to work effectively with cross-functional teams.
- Adaptability & Resilience: Thrive in a fast-paced, dynamic environment, handling multiple priorities and quickly adapting to new technologies and challenges.
- Time Management & Initiative: Proactive approach to identifying and addressing operational inefficiencies, prioritizing critical escalations, and driving continuous improvement.
- Customer/Stakeholder Focus: A deep commitment to maintaining system uptime, optimizing resolution efficiency, and ensuring a positive experience for internal and external stakeholders.
Qualifications
- Bachelor’s degree in Computer Science, IT, Data Science, Engineering, or related field.
- 5+ years of professional experience in this field.
Why Join us?
Adex is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. We carefully select candidates, test them for technical competency and emphasize on communication skills.
- Great Learning & Development Opportunities
- Industry Leading People and Policies
- Work-life Balance
- Fun and Learning Fridays
- Employee wellbeing
- Interesting compensation and benefits
- Stellar opportunity to work with the rising company
- A fast-paced tech environment
- Weekends off (Saturday & Sunday)
- Attractive Fringe benefits
How to Apply?
We’re always on the hunt for awesome, supercharged people who want to join our squad and bring their ninja skills to the table. So if you’re a powerhouse of talent and excitement, come hang out with us and let’s rock this team thing together!
To apply, mail your updated resume on careers@adex.ltd.