AWS launches Amazon DevOps Guru

New machine learning powered operations service provides tailored recommendations to improve application availability.

  • 3 years ago Posted in

Amazon Web Services has introduced Amazon DevOps Guru, a fully managed operations service that uses machine learning to make it easier for developers to improve application availability by automatically detecting operational issues and recommending specific actions for remediation. Informed by years of Amazon.com and AWS operational excellence, Amazon DevOps Guru applies machine learning to automatically analyze data like application metrics, logs, events, and traces for behaviors that deviate from normal operating patterns. When Amazon DevOps Guru identifies anomalous application behavior that could cause potential outages or service disruptions, it alerts developers with issue details to help them quickly understand the potential impact and likely causes of the issue, with specific recommendations for remediation. Developers can use remediation suggestions from Amazon DevOps Guru to reduce time to resolution when issues arise and improve application availability—all with no manual setup or machine learning expertise required. There are no upfront costs or commitments with Amazon DevOps Guru, and customers pay only for the data Amazon DevOps Guru analyzes. To get started with Amazon DevOps Guru, visit: aws.amazon.com/devops-guru

As more organizations move to cloud-based application deployment and microservice architectures to scale their businesses, applications have become increasingly distributed, and developers need more automated practices to maintain application availability and reduce the time and effort spent detecting, debugging, and resolving operational issues. Application downtime events caused by faulty code or config changes, unbalanced container clusters, or resource exhaustion (e.g. CPU, memory, disk, etc.) inevitably lead to bad customer experiences and lost revenue. Companies invest a considerable amount of developer resources, time, and money to deploy multiple monitoring tools, often managed separately, and then have to develop and maintain custom alerts for common issues like spikes in load balancer errors or drops in application request rates. Setting thresholds to identify and alert when application resources are behaving abnormally is difficult to get right, involves manual setup, and requires thresholds that must be continually updated as application usage changes (e.g. an unusually large number of requests during a sales promotion). If a threshold is set too high, developers don’t see alarms until operational performance is severely impacted. When a threshold is set too low, developers get too many false positives, which they are prone to ignore. Even when developers get alerted to a potential operational issue, the process of identifying the root cause can still prove difficult. Using existing tools, developers often have difficulty triangulating the root cause of an operational issue from graphs and alarms, and even when they are able to find the root cause, they are often left without the right information to fix it. Each troubleshooting attempt is a cold start where teams must spend hours or days identifying problems, and this leads to time consuming, tedious work that slows down the time to resolve an operational failure and can prolong application disruptions.

Amazon DevOps Guru’s machine learning models leverage over 20 years of operational expertise in building, scaling, and maintaining highly available applications for Amazon.com. This gives Amazon DevOps Guru the ability to automatically detect operational issues (e.g. missing or misconfigured alarms, early warning of resource exhaustion, config changes that could lead to outages, etc.), provide context on resources involved and related events, and recommend remediation actions. With just a few clicks in the Amazon DevOps Guru console, historical application and infrastructure metrics like latency, error rates, and request rates for resources are automatically ingested from a user’s AWS applications and analyzed to establish normal operating bounds. Amazon DevOps Guru then uses a pre-trained machine learning model to identify deviations from this established baseline (e.g. under-provisioned compute capacity, database I/O utilization, memory leaks, etc.). When Amazon DevOps Guru analyzes system and application data to automatically detect anomalies, it also groups this data into operational insights that include anomalous metrics, visualizations of application behavior over time, and recommendations on actions for remediation—all easily viewable in the Amazon DevOps Guru console. Amazon DevOps Guru also correlates and groups related application and infrastructure metrics (e.g. web application latency spikes, running out of disk space, bad code deployments, etc.) to reduce redundant alarms and help focus users on high-severity issues. Customers can see configuration change histories and deployment events, along with system and user activity, to generate a prioritized list of likely causes for an operational issue via a dashboard in the Amazon DevOps Guru console. To help customers resolve issues quickly, Amazon DevOps Guru provides intelligent recommendations with remediation steps and integrates with AWS Systems Manager for runbook and collaboration tooling, giving customers the ability to more effectively maintain applications and manage infrastructure for their deployments. For example, when an analytics application using Amazon Relational Database Service (RDS) begins to exhibit degraded latencies, Amazon DevOps Guru will detect the change by automatically analyzing the relevant metrics across the application stack, identify the underlying root cause (e.g. increased number of concurrent compute instances writing to RDS), and provide a recommendation to resolve the issue (e.g. increase the provisioned RDS capacity and IOPS storage to handle the higher load).

“Customers continue to ask AWS for more services that enable them to take advantage of our decades of operational excellence in improving application availability running Amazon.com,” said Swami Sivasubramanian, Vice President, Amazon Machine Learning, AWS. “With Amazon DevOps Guru, we have taken that expertise and built specialized machine learning models to detect, troubleshoot, and prevent operational issues long before they impact customers and without dealing with cold starts each time an issue arises. Amazon DevOps Guru immediately provides customers the benefits of operational best practices we have learned running Amazon.com, and we designed Amazon DevOps Guru to be so simple that turning it on would be an easy choice for every AWS customer.”

With a few clicks in the AWS Management Console, customers can enable Amazon DevOps Guru to begin analyzing account and application activity within minutes to provide operational insights. Amazon DevOps Guru gives customers a single-console experience to visualize their operational data by summarizing relevant data across multiple sources (e.g. AWS CloudTrail, Amazon CloudWatch, AWS Config, AWS CloudFormation, AWS X-Ray) and reduces the need to switch between multiple tools. Customers can also view correlated operational events and contextual data for operational insights within the Amazon DevOps Guru console and receive alerts via Amazon SNS. Additionally, Amazon DevOps Guru supports API endpoints through the AWS SDK, making it easy for Amazon Partner Network Partners and customers to integrate Amazon DevOps Guru into their existing solutions for ticketing, paging, and automatic notification of engineers for high-severity issues. PagerDuty and Atlassian are among the AWS Partners that have integrated Amazon DevOps Guru into their operations monitoring and incident management platforms, and customers who use their solutions can now benefit from operational insights provided by Amazon DevOps Guru. Amazon DevOps Guru is available today in US East (N. Virginia), US East (Ohio), and US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm), with availability in additional regions in the coming months.

Together with Amazon CodeGuru—a developer tool powered by machine learning that provides intelligent recommendations for improving code quality and identifying an application’s most expensive lines of code—Amazon DevOps Guru provides customers the automated benefits of machine learning for their operational data so that developers can more easily improve application availability and reliability.

Teams at more than 194,000 companies rely on Atlassian products to make teamwork easier, and help them organize, discuss, and complete their work. “Atlassian is excited that our customers are implementing an AIOps strategy using Amazon DevOps Guru to manage the operational performance of their cloud applications,” said Emel Dogrusoz, Head of Product at Opsgenie. “With our new Opsgenie and Jira Service Management integration, the right teams are notified the instant Amazon DevOps Guru discovers a potential issue and prioritizes it by the severity of the incident using machine learning (ML). This integration ensures that every team can quickly respond to, resolve using ML-powered recommendations, and learn from every incident.”


TMF Group, a leading provider of critical administrative services for global businesses, turned to...
Strengthening its cloud credentials as part of its mission to champion the broader UK tech sector...
Nearly all UK IT managers surveyed (98%) state cloud investment is an organisational priority for...
LetsGetChecked is a global healthcare solutions company that provides the tools to manage health...
Node4 to the rescue.
Commvault provides cloud-first organisations with greater choice and flexibility to protect and...
On the morning of September 20, Executive Director of the Board of Huawei and CEO of Huawei Cloud...
Global IT Business-to-Business (B2B) revenues, coming from data centers, IT services and devices,...