Remote Senior Site Reliability Engineer ML Platforms
Are you passionate about building and maintaining large-scale production systems that support advanced data science and machine learning applications? Do you want to join a team at the heart of NVIDIA's data-driven decision-making culture? If so, we have a great opportunity for you! NVIDIA is seeking a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team. The role involves designing, building, and maintaining services that enable real-time data analytics, streaming, data lakes, observability and ML/AI training and inferencing. The responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of the platform, as well as applying SRE principles to improve production systems and optimize service SLOs. Additionally, collaboration with our customers to plan implement changes to the existing system, while monitoring capacity, latency, and performance is part of the role. To succeed in this position, a strong background in SRE practices, systems, networking, coding, capacity management, cloud operations, continuous delivery and deployment, and open-source cloud enabling technologies like Kubernetes and OpenStack is required. Deep understanding of the challenges and standard methodologies of running large-scale distributed systems in production, solving complex issues, automating repetitive tasks, and proactively identifying potential outages is also necessary. Furthermore, excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential. As a Senior SRE at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and data science, and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking. If you are seeking an exciting and rewarding career that makes a difference, we invite you to apply now! What youโll be doing: Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. Gain a deep understanding of our system operations, scalability, interactions, and failures to identify improvement opportunities and risks. Create tools and automation to reduce operational overhead and eliminate manual tasks. Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation. Define meaningful and actionable reliability metrics to track and improve system and service reliability. Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally. Build tools to improve our service observability for faster issue resolution. Practice sustainable incident response and blameless postmortems What we need to see: Minimum of 10 years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments. Master's or Bachelor's degree in Computer Science or Electrical Engineering or CE or equivalent experience. Strong understanding of SRE principles, including error budgets, SLOs, and SLAs. Proficiency in incident, change, and problem management processes. Skilled in problem-solving, root cause analysis, and optimization. Experience with streaming data infrastructure services, such as Kafka and Spark. Expertise in building and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus). Proficiency in programming languages such as Python, Go, Perl, or Ruby. Hands-on experience with scaling distributed systems in public, private, or hybrid cloud environments. Experience in deploying, supporting, and supervising services, platforms, and application stacks. Ways to stand out from the crowd: Experience operating large-scale distributed systems with strong SLAs. Excellent coding skills in Python and Go and extensive experience in operating data platforms. Knowledge of CI/CD systems, such as Jenkins and GitHub Actions. Familiarity with Infrastructure as Code (IaC) methodologies and tools. Excellent interpersonal skills for identifying and communicating data-driven insights. NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for exceptional people like you to help us accelerate the next wave of artificial intelligence. The base salary range is 224,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. NVIDIA is the world leader in accelerated computing. NVIDIA pioneered accelerated computing to tackle challenges no one else can solve. Our work in AI and digital twins is transforming the world's largest industries and profoundly impacting society. Learn more about NVIDIA. \n\n#Salary and compensation\n
No salary data published by company so we estimated salary based on similar jobs related to Python, DevOps, Cloud, Senior and Engineer jobs that are similar:\n\n
$60,000 — $135,000/year\n
\n\n#Benefits\n
๐ฐ 401(k)\n\n๐ Distributed team\n\nโฐ Async\n\n๐ค Vision insurance\n\n๐ฆท Dental insurance\n\n๐ Medical insurance\n\n๐ Unlimited vacation\n\n๐ Paid time off\n\n๐ 4 day workweek\n\n๐ฐ 401k matching\n\n๐ Company retreats\n\n๐ฌ Coworking budget\n\n๐ Learning budget\n\n๐ช Free gym membership\n\n๐ง Mental wellness budget\n\n๐ฅ Home office budget\n\n๐ฅง Pay in crypto\n\n๐ฅธ Pseudonymous\n\n๐ฐ Profit sharing\n\n๐ฐ Equity compensation\n\nโฌ๏ธ No whiteboard interview\n\n๐ No monitoring system\n\n๐ซ No politics at work\n\n๐ We hire old (and young)\n\n
\n\n#Location\nUS, CA, Santa Clara
๐ Please reference you found the job on Remote OK, this helps us get more companies to post here, thanks!
When applying for jobs, you should NEVER have to pay to apply. You should also NEVER have to pay to buy equipment which they then pay you back for later. Also never pay for trainings you have to do. Those are scams! NEVER PAY FOR ANYTHING! Posts that link to pages with "how to work online" are also scams. Don't use them or pay for them. Also always verify you're actually talking to the company in the job post and not an imposter. A good idea is to check the domain name for the site/email and see if it's the actual company's main domain name. Scams in remote work are rampant, be careful! Read more to avoid scams. When clicking on the button to apply above, you will leave Remote OK and go to the job application page for that company outside this site. Remote OK accepts no liability or responsibility as a consequence of any reliance upon information on there (external sites) or here.
Remote Senior DevOps Lead Cloud & Autonomous System
\nAbout Cyngn \nBased in Menlo Park, CA, Cyngn is a publicly traded autonomous vehicle company. Whether at a warehouse floor, mine, or construction site, our self-driving technology can be deployed in various commercial domains across various vehicle form factors. To build this emergent technology, we seek innovative, motivated, and experienced leaders to join our team and move this field forward. If you like to build, tinker, and create with a team of trusted and passionate colleagues, then Cyngn is the place for you. Key reasons to join Cyngn: \n\n\nWe are Small and Big. \nWith under 100 employees, Cyngn is still a company that operates with the energy of a startup. On the other hand, we are publicly traded. Combined, our employees not only work in close-knit teams with close mentorship from company leaders, but they also get access to the liquidity of our publicly traded equity. This gives our small team the opportunity to make a big impact in industries that other people arenโt touchingโwithout taking on the risks associated with untested organizations. \n\n\nWe Build Today and Deploy Tomorrow. \nOur employees arenโt just researchers but are creating reality. In other words, the autonomous vehicles weโre building are designed to go to real clients right away. We are driven by our passion for innovation, our ability to see the entire product, and the real impact of our work in the real world. At Cyngn, the distance between the theoretical and the actual is razor-thin. \n\n\nWe arenโt robots. We just build them. \nRead our Glassdoor reviews, and youโll find that one of the best things about working here is the people. We are an inclusive, diverse team of top talent with exceptional synergy. We thrive on open collaboration and a trusting and creative work environment that is fueled by our passion for the industry. At Cyngn, everyoneโs voice is valued, and each of our unique perspectives is celebrated. Itโs the people that allow our company to continue to grow bigger and better every day.\n\n\n\n\nAbout this Role:\nAs a Senior DevOps Lead at Cyngn, you will play a vital role in architecting and managing infrastructure across cloud and autonomous vehicle systems. This position combines traditional cloud DevOps leadership with specialized expertise in robotics and autonomous systems infrastructure. You will bridge the gap between cloud operations and edge computing while leading a team of DevOps engineers to build and maintain scalable, reliable infrastructure for our autonomous vehicle platform.\n\n\n\nWhat you will do in this role\n* Lead and architect cloud and vehicle infrastructure initiatives across AWS and ROS/Linux environments \n* Design and implement scalable solutions for both cloud services and autonomous vehicle systems \n* Establish and maintain DevOps best practices, CI/CD pipelines, and infrastructure as code \n* Drive observability, monitoring, and incident response strategies \n* Optimize performance and cost efficiency of cloud and edge computing resources \n* Mentor team members and foster a developer-friendly environment \n* Manage on-call rotations and incident response processes \n* Architect solutions for processing and storing large-scale vehicle telemetry data \n* Lead security initiatives and compliance efforts across infrastructure \n* Design and implement solutions for both cloud services and autonomous vehicle systems \n* Optimize system performance for real-time processing of high-bandwidth sensor data \n* Develop and maintain documentation for system architecture and integration procedures \n\n\n\nWho you are\n* 10+ years of relevant DevOps/Infrastructure experience\n* Proven track record as a technical lead in platform or infrastructure teams\n* Advanced expertise in AWS services, infrastructure as code (Terraform), and Kubernetes\n* Strong experience with service mesh (Istio) and Helm/Kustomize\n* Deep understanding of ROS/ROS2 and Linux kernel configurations\n* Experience with GPU configurations and ML infrastructure\n* Expertise in ARM and NVIDIA CUDA platform configurations\n* Strong programming skills in Python and shell scripting\n* Experience with infrastructure automation (Ansible)\n* Expertise in CI/CD tools (Jenkins, GitHub Actions)\n* Strong system architecture and design skills\n* Excellence in technical documentation\n* Outstanding problem-solving abilities\n* Strong leadership and mentoring capabilities\n\n\n\nNice to haves\n* Experience with autonomous vehicle systems\n* Track record of optimizing GPU-based ML infrastructure\n* Experience with large-scale IoT deployments\n* Contributions to open-source projects\n* Experience with real-time systems and low-latency requirements\n* Expertise in security implementations including SSO, IdP, and AWS Cognito\n* Experience with JFrog artifactory and container registry management\n* Proficiency in AWS IoT Greengrass\n* Experience with container resource management on edge devices\n* Understanding of CPU affinity and priority scheduling\n* Track record of implementing cost optimization strategies\n* Experience with scaling systems both horizontally and vertically\n\n\n\nBenefits & Perks\n* Health benefits (Medical, Dental, Vision, HSA and FSA (Health & Dependent Daycare), Employee Assistance Program, 1:1 Health Concierge)\n* Life, Short-term, and long-term disability insurance (Cyngn funds 100% of premiums)\n* Company 401(k)\n* Commuter Benefits\n* Flexible vacation policy\n* Stock options for all full-time employees\n* Sabbatical leave opportunity after five years with the company\n* Paid Parental Leave\n* Daily lunches for in-office employees and fully stocked kitchen with snacks and beverages\n* Monthly meal and tech allowances for remote employees\n\n\n\n\n\n$180,000 - $240,000 a year\n \n\n#Salary and compensation\n
No salary data published by company so we estimated salary based on similar jobs related to Design, Python, DevOps, Cloud, Senior and Linux jobs that are similar:\n\n
$45,000 — $75,000/year\n
\n\n#Benefits\n
๐ฐ 401(k)\n\n๐ Distributed team\n\nโฐ Async\n\n๐ค Vision insurance\n\n๐ฆท Dental insurance\n\n๐ Medical insurance\n\n๐ Unlimited vacation\n\n๐ Paid time off\n\n๐ 4 day workweek\n\n๐ฐ 401k matching\n\n๐ Company retreats\n\n๐ฌ Coworking budget\n\n๐ Learning budget\n\n๐ช Free gym membership\n\n๐ง Mental wellness budget\n\n๐ฅ Home office budget\n\n๐ฅง Pay in crypto\n\n๐ฅธ Pseudonymous\n\n๐ฐ Profit sharing\n\n๐ฐ Equity compensation\n\nโฌ๏ธ No whiteboard interview\n\n๐ No monitoring system\n\n๐ซ No politics at work\n\n๐ We hire old (and young)\n\n
\n\n#Location\nMenlo Park, CA
๐ Please reference you found the job on Remote OK, this helps us get more companies to post here, thanks!
When applying for jobs, you should NEVER have to pay to apply. You should also NEVER have to pay to buy equipment which they then pay you back for later. Also never pay for trainings you have to do. Those are scams! NEVER PAY FOR ANYTHING! Posts that link to pages with "how to work online" are also scams. Don't use them or pay for them. Also always verify you're actually talking to the company in the job post and not an imposter. A good idea is to check the domain name for the site/email and see if it's the actual company's main domain name. Scams in remote work are rampant, be careful! Read more to avoid scams. When clicking on the button to apply above, you will leave Remote OK and go to the job application page for that company outside this site. Remote OK accepts no liability or responsibility as a consequence of any reliance upon information on there (external sites) or here.
\nWho You Are\nWe are seeking an experienced and talented Cloud Engineer to join our growing team with a heavy network focus. As a Cloud Engineer, you will play a crucial role in designing, deploying, and managing cloud infrastructure across large environments. Leveraging Infrastructure as Code (IAC) principles, you will ensure seamless deployment and scalability. If you are passionate about cloud technologies, automation, and have expertise in Azure, Windows, Linux (RHEL), Kubernetes, Terraform, Ansible, and Python, we want to hear from you!\n\n\n\nWhat You'll Do\n* Implement and maintain hybrid Azure / Cisco Routing / Cisco Nexus ACI / Palo Alto - Panorama / A10 seamless cloud network. \n* Design, deploy, and maintain cloud infrastructure in Azure, ensuring scalability and reliability.\n* Implement CI/CD pipelines using Azure DevOps or similar tools\n* Implement Infrastructure as Code (IAC) practices using Terraform and Ansible.\n* Work with both Windows and Linux (RHEL) operating systems.\n* Manage Kubernetes clusters for containerized applications.\n* Utilize Python scripting for automation and custom tooling.\n* Implement secure vaulting solutions for sensitive data.\n* Collaborate with cross-functional teams to define and implement best practices.\n\n\n\nWhat You'll Need\n* Bachelorโs degree in Computer Science, Information Technology, or related field.\n* Proven experience as a Cloud Engineer or similar role.\n* Azure certification (e.g., AZ-900, AZ-104) is a plus.\n* Strong understanding of cloud architecture and IAC principles.\n* Excellent problem-solving skills and attention to detail.\n* Ability to work independently and collaboratively in a fast-paced environment.\n* Strong communication and interpersonal skills.\n* Experience with VMware for virtualization.\n* Familiarity with Kafka, Redis, and MongoDB.\n* Proficiency in MSSQL and database administration.\n* Strong knowledge of GO programming language.\n* Storage Area Networking (iSCSI) experience.\n* Load balancing knowledge (A10 preferred).\n\n\n\n\n\n \n\n#Salary and compensation\n
No salary data published by company so we estimated salary based on similar jobs related to Python, DevOps, Finance, Cloud and Engineer jobs that are similar:\n\n
$50,000 — $90,000/year\n
\n\n#Benefits\n
๐ฐ 401(k)\n\n๐ Distributed team\n\nโฐ Async\n\n๐ค Vision insurance\n\n๐ฆท Dental insurance\n\n๐ Medical insurance\n\n๐ Unlimited vacation\n\n๐ Paid time off\n\n๐ 4 day workweek\n\n๐ฐ 401k matching\n\n๐ Company retreats\n\n๐ฌ Coworking budget\n\n๐ Learning budget\n\n๐ช Free gym membership\n\n๐ง Mental wellness budget\n\n๐ฅ Home office budget\n\n๐ฅง Pay in crypto\n\n๐ฅธ Pseudonymous\n\n๐ฐ Profit sharing\n\n๐ฐ Equity compensation\n\nโฌ๏ธ No whiteboard interview\n\n๐ No monitoring system\n\n๐ซ No politics at work\n\n๐ We hire old (and young)\n\n
๐ Please reference you found the job on Remote OK, this helps us get more companies to post here, thanks!
When applying for jobs, you should NEVER have to pay to apply. You should also NEVER have to pay to buy equipment which they then pay you back for later. Also never pay for trainings you have to do. Those are scams! NEVER PAY FOR ANYTHING! Posts that link to pages with "how to work online" are also scams. Don't use them or pay for them. Also always verify you're actually talking to the company in the job post and not an imposter. A good idea is to check the domain name for the site/email and see if it's the actual company's main domain name. Scams in remote work are rampant, be careful! Read more to avoid scams. When clicking on the button to apply above, you will leave Remote OK and go to the job application page for that company outside this site. Remote OK accepts no liability or responsibility as a consequence of any reliance upon information on there (external sites) or here.
About Phaidra\n\nPhaidra is building the future of industrial automation.\n\nThe world today is filled with static, monolithic infrastructure. Factories, power plants, buildings, etc. operate the same they've operated for decades โ because the controls programming is hard-coded. Thousands of lines of rules and heuristics that define how the machines interact with each other. The result of all this hard-coding is that facilities are frozen in time, unable to adapt to their environment while their performance slowly degrades.\n\nPhaidra creates AI-powered control systems for the industrial sector, enabling industrial facilities to automatically learn and improve over time. Specifically:\n\n\n* We use reinforcement learning algorithms to provide this intelligence, converting raw sensor data into high-value actions and decisions.\n\n* We focus on industrial applications, which tend to be well-sensorized with measurable KPIs โ perfect for reinforcement learning.\n\n* We enable domain experts (our users) to configure the AI control systems (i.e. agents) without writing code. They define what they want their AI agents to do, and we do it for them.\n\n\n\n\nOur team has a track record of applying AI to some of the toughest problems. From achieving superhuman performance with DeepMind's AlphaGo, to reducing the energy required to cool Google's Data Centers by 40%, we deeply understand AI and how to apply it in production for massive impact.\n\nPhaidra is based in the USA but 100% remote; we do not have a physical office. We hire employees internationally with the help of our partner, OysterHR. Our team is currently located throughout the USA, Canada, UK, Norway, Italy, Spain, Portugal, and India.\n\n**Please only apply to one opening. If you are a better fit for another opening, our team will move your application. Candidates who apply to multiple openings will not be considered.**\nWho You Are\n\nWe are looking for a very experienced Software Engineer with a focus on MLOps tech leadership to be a part of our growing AI Platform team. You are bold and creative, and have deep empathy for customers. You will design and implement significant parts of the code base and will have the opportunity to make an immediate impact with your work and guide the product and team as we grow.\n\nYou are curious and like to understand technologies and their tradeoffs in depth - providing technical guidance to the team and peers as and when required. Leading by example, you have accumulated a wealth of insights and experiences from your hands-on involvement in the field, and you are committed to rolling up your sleeves and getting work done. You like joining and supporting other engineers in their work to learn from them as well as letting them benefit from your expertise and experience.\n\nYou have the motivation and skills to identify technical product needs, initiate projects and owning their delivery, including the involvement of engineering peers as needed. You are comfortable with challenging the status quo respectfully to drive and deliver technical excellence in the team.\n\n\n* We are seeking a team member located within one of the following areas: USA/Canada/UK/EU\n\n\n\nResponsibilities\n\nThe AI Platform team you are joining is responsible for building the core platform that powers model training, inference and decision making in our products. Furthermore the team owns MLOps and the services hosting our AI capabilities. Productionizing results from Research, as well as extending our systems and providing support according to our customer needs fall into team responsibilities as well. You will join this team as a very experienced engineer with a focus on MLOps solutions to grow our expertise in that area, but also contribute as a software engineer more widely in the team.\n\nAs an organization, we strongly believe in expertise across the stack. As such, you will experience flavors of Machine Learning, Software Engineering, Distributed Systems, MLOps and DevOps.\n\nIn particular, you will:\n\n\n* Design, build and lead the MLOps initiatives and vision for the AI Platform to strengthen automation, orchestration, versioning, observability, monitoring and collaboration for the platform.\n\n* Build and design scalable components for the AI Platform to allow high throughput training and inference for RL agents doing realtime inference for autonomous control of industrial systems.\n\n* Contribute to the design and implementation of the product backend by writing REST & gRPC API services and scalable event-driven backend applications.\n\n* Design clear, extensible software interfaces for the team's customers and maintain a high release quality bar.\n\n* Perform DevOps duties of CI/CD, Release & Deployment management.\n\n* Be a part of our global production oncall team and, own & operate your services in production, meeting Phaidraโs high bar for operational excellence.\n\n* Lead cross-functional initiatives collaborating with engineers, product managers and TPM across teams.\n\n* Mentor your peers and be a technical role-model in the team.\n\n\n\nOnboarding\n\nIn your first 30 daysโฆ\n\n\n* You will be immersed in an onboarding program that introduces you to Phaidra and our product.\n\n* You will spend time in the Engineering org, learning how the teams operate, interact, and approach problems.\n\n* You will read various parts of our handbook and familiarize yourself with the documentation culture at Phaidra.\n\n* You will set up your development environment and start working on an onboarding exercise that will introduce you to various parts of our code base.\n\n* You will learn about how we use agile and be able to navigate our sprint boards and backlogs.\n\n* You will learn about various team standards and development & release processes.\n\n* You will start to learn about our system architecture and infrastructure.\n\n* You will start picking up few good โfirst-tasksโ to get yourself accustomed to the end to end release flow.\n\n\n\n\nIn your first 60 daysโฆ\n\n\n* You will get a solid understanding of what Phaidra does and how we do it.\n\n* You will meet with team members across Phaidra and started building relationships that will help you be successful at your job.\n\n* You will complete the onboarding exercise and will be on your way to completing your first production task.\n\n* You will take ownership for the MLOps work on the team, identify gaps and propose roadmap items on the topic.\n\n\n\n\nIn your first 90 daysโฆ\n\n\n* You will be fully integrated in the team and with team members across the company.\n\n* You will have a more in-depth understanding of our system architecture and infrastructure.\n\n* You will complete your first on-call experience helping monitor and improve our production environments.\n\n* You will become an expert with our tooling.\n\n* You will start to contribute to knowledge sharing throughout Phaidra and the team.\n\n* You will take proactively drive MLOps topics in the team and represent it technically throughout the company.\n\n\n\nKey Qualifications\n\n\n* 10+ years of work experience.\n\n* Proven record on impact as a Tech Leader and bar-raiser for ambitious Software Engineering teams\n\n* Strong experience on designing and implementing MLOps solutions for AI production systems\n\n* Extensive experience with platform Software Engineering with the ability to contribute on all levels as an individual contributor and tech leader\n\n* Strong expertise on building, operating and monitoring large scale multi-tenant systems with high availability, fault tolerance, performance tuning, monitoring, and metrics collection\n\n* Ability to take ownership of realtime production systems - aligning technical with business requirements, raising the bar for operational excellence and on-call incident handling\n\n* Strong expertise in Python and Cloud environments\n\n* Very good grasp of Machine Learning (especially Deep Learning) fundamentals\n\n* Ability to collaborate and communicate effectively in an all-remote setting\n\n* Doing your work with curiosity, ownership, transparency & directness, outcome orientation, and customer empathy.\n\n\n\nBonus\n\n\n* Experience with building applications that can be deployed in cloud, as well as in hybrid or on-prem environment\n\n* Exposure to Reinforcement Learning or other in-depth knowledge on modern ML applications\n\n* Experience with industrial applications, industrial control systems, IoT, sensor time series applications, or similar\n\n\n\nRelevant Technologies from our Stack\n\n\n* Python, Go\n\n* PyTorch, PyTorch Lightning\n\n* Ray.io, Prefect, mlflow\n\n* REST & gRPC micro-services\n\n* Docker, Kubernetes, Terraform & Kapitan\n\n* GCP - GKE, PubSub, CloudSQL, BigTable, Postgres, etc.\n\n* Grafana Cloud, Prometheus\n\n* Poetry, Pants\n\n* Gitlab CI, ArgoCD, Atlantis\n\n\n\nGeneral Interview Process\n\nAll of our interviews are held via Google Meet, and an active camera connection is required.\n\n* \n\nMeeting with Operations (30 minutes): The purpose of this interview is to meet you, learn more about your background, discuss what you are looking for in a new position and cover formalities around your application.\n\n\n* \n\nTech Lead interview (60 minutes): This interview is a combination of technical and cultural fit assessment. We will cover your technical experience and the skills as an engineer and a tech lead while discussing projects that you have worked on in the past. You will meet the manager for the role as well as our VP of Engineering, with the opportunity to ask any questions about the team, role and engineering at Phaidra.\n\n\n* \n\nML system design & SRE (90 minutes): In this interview, we will go over a real world MLOps problem. You can expect to draw architecture diagrams using boxes & arrows in your browser. We will talk about system design, scalability and monitoring\n\n\n* \n\nML interview (60 minutes): This interview will focus on Machine Learning approaches, algorithms and theory. You will be asked about ML algorithms you are familiar with, how they work under the hood and how to use them in an applied setting.\n\n\n* \n\nCulture fit interview with Phaidraโs co-founders (30 minutes): This interview focuses on alignment with Phaidraโs values and the mutual cultural fit.\n\n\n\nBase Salary\n\n\n* US Residents: $156,000-$234,000/year\n\n* UK Residents: ยฃ108,000-ยฃ162,000/year\n\n\n\n\nSalary ranges for EU countries will vary based on the market rate for the location.\n\nThis position will also include equity.\n\nThese are best faith estimates of the base salary range for this position. Multiple factors such as experience, education, level, and location are taken into account when determining compensation.\nBenefits & Perks\n\n\n* Fast-paced and team-oriented environment where you will be instrumental in the direction of the company.\n\n* Phaidra is a 100% remote company with a digital nomad policy.\n\n* Competitive compensation & equity.\n\n* Outsized responsibilities & professional development.\n\n* Training is foundational; functional, customer immersion, and development training.\n\n* Medical, dental, and vision insurance (exact benefits vary by region).\n\n* Unlimited paid time off, with a minimum of 20 days off per year requirement.\n\n* Paid parental leave (exact benefits vary by region).\n\n* Home office setup allowance and company MacBook.\n\n* Monthly remote work stipend.\n\n\n\nOn being Remote\n\nWe are thoughtful about remote collaboration. We look to the pioneers - like Gitlab - for inspiration and best practices to create a stellar remote work environment. We have a documentation-first culture and actively practice asynchronous communication in everything we do. Our team stays connected through tools like Slack and video chat. Most teams meet daily, and we have dedicated all-hands meetings bi-weekly to build strong relationships. We hold virtual team building events once per month - and even hold virtual socials to watch rocket launches! We have a yearly in-person, all-company summit in locations like Seattle, Athens, Goa, and Barcelona.\nEqual Opportunity Employment\n\nPhaidra is an Equal Opportunity Employer; employment with Phaidra is governed on the basis of merit, competence, and qualifications and will not be influenced in any manner by race, color, religion, gender, national origin/ethnicity, veteran status, disability status, age, sexual orientation, gender identity, marital status, mental or physical disability, or any other legally protected status. We welcome diversity and strive to maintain an inclusive environment for all employees. If you need assistance with completing the application process, please contact us at [email protected].\nE-Verify Notice\n\nPhaidra participates in E-Verify, an employment authorization database provided through the U.S. Department of Homeland Security (DHS) and Social Security Administration (SSA). As required by law, we will provide the SSA and, if necessary, the DHS, with information from each new employeeโs Form I-9 to confirm work authorization for those residing in the United States.\n\nAdditional information about E-Verify can be found here.\n\n#LI-Remote\n\nWE DO NOT ACCEPT APPLICATIONS FROM RECRUITERS.\n\n \n\n#Salary and compensation\n
No salary data published by company so we estimated salary based on similar jobs related to Design, Python, DevOps, Cloud, API, Engineer and Backend jobs that are similar:\n\n
$70,000 — $105,000/year\n
\n\n#Benefits\n
๐ฐ 401(k)\n\n๐ Distributed team\n\nโฐ Async\n\n๐ค Vision insurance\n\n๐ฆท Dental insurance\n\n๐ Medical insurance\n\n๐ Unlimited vacation\n\n๐ Paid time off\n\n๐ 4 day workweek\n\n๐ฐ 401k matching\n\n๐ Company retreats\n\n๐ฌ Coworking budget\n\n๐ Learning budget\n\n๐ช Free gym membership\n\n๐ง Mental wellness budget\n\n๐ฅ Home office budget\n\n๐ฅง Pay in crypto\n\n๐ฅธ Pseudonymous\n\n๐ฐ Profit sharing\n\n๐ฐ Equity compensation\n\nโฌ๏ธ No whiteboard interview\n\n๐ No monitoring system\n\n๐ซ No politics at work\n\n๐ We hire old (and young)\n\n
\n\n#Location\nSeattle, Washington, United States
๐ Please reference you found the job on Remote OK, this helps us get more companies to post here, thanks!
When applying for jobs, you should NEVER have to pay to apply. You should also NEVER have to pay to buy equipment which they then pay you back for later. Also never pay for trainings you have to do. Those are scams! NEVER PAY FOR ANYTHING! Posts that link to pages with "how to work online" are also scams. Don't use them or pay for them. Also always verify you're actually talking to the company in the job post and not an imposter. A good idea is to check the domain name for the site/email and see if it's the actual company's main domain name. Scams in remote work are rampant, be careful! Read more to avoid scams. When clicking on the button to apply above, you will leave Remote OK and go to the job application page for that company outside this site. Remote OK accepts no liability or responsibility as a consequence of any reliance upon information on there (external sites) or here.
Who You Are\n\nWe are looking for a driven Software Engineer (MLOps) to be a part of our growing AI Platform team. You are bold and creative, and have deep empathy for customers who may not be tech-savvy. You will design and implement significant parts of the code base and will have the opportunity to make an immediate impact with your work and guide the product and team as we grow.\n\nYou are curious and like to understand technologies and their tradeoffs in depth - providing technical guidance to the team and peers as and when required. Leading by example, you have accumulated a wealth of insights and experiences from your hands-on involvement in the field, and you are committed to rolling up your sleeves and getting work done. You like joining and supporting other engineers in their work to learn from them as well as letting them benefit from your expertise and experience.\n\nYou have the motivation and skills to identify technical product needs, initiate projects and owning their delivery, including the involvement of engineering peers as needed. You are comfortable with challenging the status quo respectfully to drive and deliver technical excellence in the team.\n\n**We are seeking a team member located within one of the following areas: USA/Canada/UK\nResponsibilities\n\nThe AI Platform team you are joining is responsible for building the core platform that powers model training, inference and decision making in our products. Furthermore the team owns MLOps and the services hosting our AI capabilities. Productionizing results from Research, as well as extending our systems and providing support according to our customer needs fall into team responsibilities as well. You will join this team as an experienced engineer with a focus on MLOps solutions to grow our expertise in that area, but also contribute as a software engineer more widely in the team.\n\nAs an organization, we strongly believe in expertise across the stack. As such, you will experience flavors of Machine Learning, Software Engineering, Distributed Systems, MLOps and DevOps.\n\nIn particular, you will:\n\n\n* Design, build and lead the MLOps initiatives and vision for the AI Platform to strengthen automation, orchestration, versioning, observability, monitoring and collaboration for the platform.\n\n* Build and design scalable components for the AI Platform to allow high throughput training and inference for RL agents doing realtime inference for autonomous control of industrial systems.\n\n* Contribute to the design and implementation of the product backend by writing REST & gRPC API services and scalable event-driven backend applications.\n\n* Design clear, extensible software interfaces for the team's customers and maintain a high release quality bar.\n\n* Design and optimize data storage & retrieval mechanisms for high throughput, security & ease of access.\n\n* Perform DevOps duties of CI/CD, Release & Deployment management.\n\n* Be a part of our global production oncall team and, own & operate your services in production, meeting Phaidraโs high bar for operational excellence.\n\n* Lead cross-functional initiatives collaborating with engineers, product managers and TPM across teams.\n\n* Mentor your peers and be a technical role-model in the team.\n\n\n\nOnboarding\n\nIn your first 30 daysโฆ\n\n\n* You will be immersed in an onboarding program that introduces you to Phaidra and our product.\n\n* You will spend time in the Engineering org, learning how the teams operate, interact, and approach problems.\n\n* You will read various parts of our handbook and familiarize yourself with the documentation culture at Phaidra.\n\n* You will set up your development environment and start working on an onboarding exercise that will introduce you to various parts of our code base.\n\n* You will learn about how we use agile and be able to navigate our sprint boards and backlogs.\n\n* You will learn about various team standards and development & release processes.\n\n* You will start to learn about our system architecture and infrastructure.\n\n* You will start picking up few good โfirst-tasksโ to get yourself accustomed to the end to end release flow.\n\n\n\n\nIn your first 60 daysโฆ\n\n\n* You will get a solid understanding of what Phaidra does and how we do it.\n\n* You will meet with team members across Phaidra and started building relationships that will help you be successful at your job.\n\n* You will complete the onboarding exercise and will be on your way to completing your first production task.\n\n* You will take ownership for the MLOps work on the team, identify gaps and propose roadmap items on the topic.\n\n\n\n\nIn your first 90 daysโฆ\n\n\n* You will be fully integrated in the team and with team members across the company.\n\n* You will have a more in-depth understanding of our system architecture and infrastructure.\n\n* You will complete your first on-call experience helping monitor and improve our production environments.\n\n* You will become an expert with our tooling.\n\n* You will start to contribute to knowledge sharing throughout Phaidra and the team.\n\n* You will take proactively drive MLOps topics in the team and represent it technically throughout the company.\n\n\n\nKey Qualifications\n\n\n* 7+ years of work experience.\n\n* Bachelors or Masters in Computer Science, or equivalent experience.\n\n* Strong experience on designing and implementing MLOps solutions for AI production systems\n\n* Expertise with production Software Engineering - relational and non-relational data modelling, micro-services, understanding of event driven systems, etc.\n\n* Strong experience building large scale multi-tenant systems with high availability, fault tolerance, performance tuning, monitoring, and statistics/metrics collection.\n\n* Strong expertise in Python and Cloud environments\n\n* Good grasp of Machine Learning (especially Deep Learning) fundamentals.\n\n* Ability to collaborate and communicate effectively in an all-remote setting\n\n* Doing your work with curiosity, ownership, transparency & directness, outcome orientation, and customer empathy.\n\n\n\nBonus\n\n\n* Experience as a service owner of a realtime production system - operating & monitoring services in production, including using observability tooling such as Prometheus, Grafana, Tempo or equivalent offerings and incident management.\n\n* Experience with building applications that can be deployed in cloud, hybrid or on prem environments\n\n* Exposure to Reinforcement Learning\n\n\n\nOur Stack\n\n\n* Languages - (Backend) Python, Go; (Frontend) JavaScript/TypeScript, React; Customer SDK & Clients - C# .NET\n\n* PyTorch\n\n* Cypress\n\n* Docker, Kubernetes, Terraform & Kapitan\n\n* Gitlab CI, ArgoCD, Atlantis, Vercel\n\n* GCP - GKE, PubSub, CloudSQL, BigTable, Postgres, etc.\n\n* Ray.io\n\n* REST & gRPC micro-services\n\n* Poetry, Pantsbuild\n\n\n\nGeneral Interview Process\n\nAll of our interviews are held via Google Meet, and an active camera connection is required.\n\n* Initial screening interview with a People Operations team member (30 minutes): The purpose of this interview is to meet you, learn more about your background, and discuss what you are looking for in a new position.\n\n* Hiring manager interview (30 minutes): The purpose of this meeting is for you to get to know the manager for the role. This chat will mainly focus on your previous experience and technical background. You can expect to talk about projects that you have worked on in the past and ask any questions about the team & role.\n\n* Technical Interview 1 (60 minutes): The purpose of this interview is to assess your skills in Machine Learning and related mathematics.\n\n* Technical Interview 2 (90 minutes): In this interview, we will go over a real world MLOps problem. You can expect to draw architecture diagrams using boxes & arrows in your browser. We will talk about system design, scalability and monitoring.\n\n* Meeting with VP of Engineering (30 minutes): This interview is a combination of technical and cultural fit assessment. You will cover the technical experience and the skills that you brinand have an opportunity to ask any questions about the teamโs culture or vision.\n\n* Culture fit interview with Phaidraโs co-founders (30 minutes): This interview focuses on alignment with Phaidraโs values\n\n\nBase Salary\n\nUS Residents: $115,200-$208,800/year\n\nUK Residents: ยฃ96,400-ยฃ144,000/year\n\nThis position will also include equity.\n\nThese are best faith estimates of the base salary range for this position. Multiple factors such as experience, education, level, and location are taken into account when determining compensation. \n\n#Salary and compensation\n
No salary data published by company so we estimated salary based on similar jobs related to Design, Python, DevOps, Cloud, API, Senior, Engineer and Backend jobs that are similar:\n\n
$65,000 — $110,000/year\n
\n\n#Benefits\n
๐ฐ 401(k)\n\n๐ Distributed team\n\nโฐ Async\n\n๐ค Vision insurance\n\n๐ฆท Dental insurance\n\n๐ Medical insurance\n\n๐ Unlimited vacation\n\n๐ Paid time off\n\n๐ 4 day workweek\n\n๐ฐ 401k matching\n\n๐ Company retreats\n\n๐ฌ Coworking budget\n\n๐ Learning budget\n\n๐ช Free gym membership\n\n๐ง Mental wellness budget\n\n๐ฅ Home office budget\n\n๐ฅง Pay in crypto\n\n๐ฅธ Pseudonymous\n\n๐ฐ Profit sharing\n\n๐ฐ Equity compensation\n\nโฌ๏ธ No whiteboard interview\n\n๐ No monitoring system\n\n๐ซ No politics at work\n\n๐ We hire old (and young)\n\n
\n\n#Location\nSeattle, Washington, United States
๐ Please reference you found the job on Remote OK, this helps us get more companies to post here, thanks!
When applying for jobs, you should NEVER have to pay to apply. You should also NEVER have to pay to buy equipment which they then pay you back for later. Also never pay for trainings you have to do. Those are scams! NEVER PAY FOR ANYTHING! Posts that link to pages with "how to work online" are also scams. Don't use them or pay for them. Also always verify you're actually talking to the company in the job post and not an imposter. A good idea is to check the domain name for the site/email and see if it's the actual company's main domain name. Scams in remote work are rampant, be careful! Read more to avoid scams. When clicking on the button to apply above, you will leave Remote OK and go to the job application page for that company outside this site. Remote OK accepts no liability or responsibility as a consequence of any reliance upon information on there (external sites) or here.