Our client is seeking a Site Reliability Engineer (SRE) that will be responsible for ensuring the reliability, performance, and scalability of the software, websites, and applications. This role requires a combination of software engineering and systems administration skills to monitor, control, and automate systems. The ideal candidate will have a deep understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance. This position plays a critical role in maintaining the overall health and efficiency of our platform.
Key Responsibilities:
System Monitoring and Maintenance: - Monitor the performance and reliability of Kubernetes clusters, software, websites, and applications. - Automate routine maintenance tasks to ensure system stability and performance.
Incident Response and Troubleshooting: - Respond to and resolve incidents in a timely manner, minimizing downtime and impact on users. - Conduct root cause analysis to identify and address underlying issues. - Develop and implement strategies to prevent future incidents and improve system resilience.
Automation and Infrastructure Management: - Design, build, and maintain automated systems and processes to improve efficiency and reduce manual intervention. - Manage cloud infrastructure, including provisioning, scaling, and optimizing resources. - Collaborate with development teams to ensure seamless deployment and integration of new features and updates.
Performance Optimization: - Analyze system performance and identify areas for improvement. - Implement performance tuning and optimization techniques to enhance system efficiency. - Collaborate with cross-functional teams to ensure optimal performance of all components.
Security and Compliance: - Ensure compliance with security best practices and industry standards. - Implement and maintain security measures to protect systems and data. - Conduct regular security audits and vulnerability assessments.
Documentation and Reporting: - Maintain accurate and up-to-date documentation of systems, processes, and procedures. - Generate and analyze reports on system performance, incidents, and other key metrics. - Provide regular updates to management and stakeholders on system health and performance.
Continuous Improvement: - Identify opportunities for improving system reliability, performance, and scalability. - Stay up-to-date with industry trends and best practices in site reliability engineering. - Participate in training and development opportunities to enhance skills and knowledge.
Qualifications: - Deep expertise of Kubernetes and containers. - Strong understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance. - Experience with monitoring and logging tools such as Loki, Grafana. - Minimum of 3 years of experience in site reliability engineering, Kubernetes administration, or a related role. - Excellent problem-solving skills and attention to detail. - Strong communication and interpersonal skills, with the ability to work effectively with cross-functional teams.