Site Reliability Engineer

Omniscius Consulting

United States

Our client is seeking a Site Reliability Engineer (SRE) that will be responsible for ensuring the reliability, performance, and scalability of the software, websites, and applications. This role requires a combination of software engineering and systems administration skills to monitor, control, and automate systems. The ideal candidate will have a deep understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance. This position plays a critical role in maintaining the overall health and efficiency of our platform.

Key Responsibilities: ‍ System Monitoring and Maintenance: ‍- Monitor the performance and reliability of Kubernetes clusters, software, websites, and applications. - Automate routine maintenance tasks to ensure system stability and performance.

Incident Response and Troubleshooting: - Respond to and resolve incidents in a timely manner, minimizing downtime and impact on users. - Conduct root cause analysis to identify and address underlying issues. - Develop and implement strategies to prevent future incidents and improve system resilience. ‍ Automation and Infrastructure Management: ‍- Design, build, and maintain automated systems and processes to improve efficiency and reduce manual intervention. - Manage cloud infrastructure, including provisioning, scaling, and optimizing resources. - Collaborate with development teams to ensure seamless deployment and integration of new features and updates. ‍ Performance Optimization: ‍- Analyze system performance and identify areas for improvement. - Implement performance tuning and optimization techniques to enhance system efficiency. - Collaborate with cross-functional teams to ensure optimal performance of all components. ‍ Security and Compliance: ‍- Ensure compliance with security best practices and industry standards. - Implement and maintain security measures to protect systems and data. - Conduct regular security audits and vulnerability assessments. ‍ Documentation and Reporting: ‍- Maintain accurate and up-to-date documentation of systems, processes, and procedures. - Generate and analyze reports on system performance, incidents, and other key metrics. - Provide regular updates to management and stakeholders on system health and performance. ‍ Continuous Improvement: ‍- Identify opportunities for improving system reliability, performance, and scalability. - Stay up-to-date with industry trends and best practices in site reliability engineering. - Participate in training and development opportunities to enhance skills and knowledge.

Qualifications: ‍- Deep expertise of Kubernetes and containers. - Strong understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance. - Experience with monitoring and logging tools such as Loki, Grafana. - Minimum of 3 years of experience in site reliability engineering, Kubernetes administration, or a related role. - Excellent problem-solving skills and attention to detail. - Strong communication and interpersonal skills, with the ability to work effectively with cross-functional teams.

Job description Mido US is looking for a Marketplace Manager, who will ...

The Swatch Group (U.S.) Inc.

United States

Review

Site Reliability Engineer

Share This Job

Similar Jobs