Deployment Tracking and Exception Tracking: monitoring design patterns for cloud-native applications

Monitoring a system over time is as important as ever with the increasing use of cloud-native software architectures. This paper expands the set of patterns published in a previous paper (Liveness Endpoint, Readiness Endpoint and Synthetic Testing) with two solutions for supporting teams in diagnosing occurring issues — Deployment Tracking and Exception Tracking. These patterns advise tracking relevant events that occur in the system. The Deployment Tracking pattern provides means to limit the sources of an anomaly, and the Exception Tracking pattern makes a specific class of anomalies visible so that a team can act on them. Both patterns help practitioners identify the root cause of an issue, which is instrumental in fixing it. They can help even less experienced professionals to improve monitoring processes, and reduce the mean time to resolve problems with their application. These patterns draw on documented industry best practices and existing tools. In order to help the reader find other patterns that supplement the ones suggested in this study, relations to already-existing monitoring patterns are also examined.


INTRODUCTION
Cloud-native software systems are those designed to reside in the cloud and take the most benefit from it.It encompasses the use of architectures and technologies such as microservices, container orchestrators, and auto-scaling.Cloud adoption continues to grow, and is expected to do so for the next couple of years [15].Organisations are embracing cloud-native systems, moving more workloads to the cloud and advancing on their cloud journey [11].As adoption becomes more mature, the concerns with the cloud also change.A report by GitLab [16] identifies observability/monitoring as the fourth most common practice of DevOps, so it remains a relevant task amongst cloud practitioners.
The term observability has become popular recently but, in this paper, we mostly use monitoring.These terms have different definitions and are perceived differently by different authors.For the sake of this paper, we opted for Newman's definition of observability as "the extent to which you can understand the internal state of the system from external outputs" [22, p. 310], and monitoring as an activity where practitioners keep track of their system.In essence, observability is a characteristic of the system, whereas monitoring is an action performed by people [22].
Existing works show us that monitoring processes are indeed changing [22,39], as many systems now make extensive use of the cloud, and monitoring a monolithic application deployed on premises is a different challenge than monitoring a cloud-native one.Considering that, from a sample of 1,300 IT-related people, "90% of respondents said that observability is critical to the success of their business, and 94% state that it's critical to their role" [21, p. 8], we believe it is important to have a way to communicate and reuse best practices on monitoring and observability.
Many of these practices organically surface in teams dealing with monitoring challenges, and find their way to grey literature [10,17,22,24,25].However, we may ask ourselves if the way in which they are described is rigorous enough and under which conditions they should be adopted.Therefore, in this paper, we refine our previous works on monitoring design patterns [2,3] and propose two new pattern descriptions that focus on monitoring cloud applications-more precisely, those that track relevant events of the system: Deployment Tracking and Exception Tracking.
We argue that documenting these practices as design patterns can play a key role in their correct and pondered adoption.In addition, it allows all practitioners, regardless of experience, to apply and communicate them much more easily.
For the most part, we follow the same methodology and structure as in our previous paper [3]: we review related works to uncover relevant industry practices and existing patterns, evaluate them and generally explain our findings to then write new or more detailed patterns based on that research.

RELATED WORK
In order to ensure the patterns reflect good solutions employed commonly in the real world [12, p. 10], we start this work by looking into what we know about existing monitoring practices.In this section, we review how existing works approach different practices, establishing our starting point for the pattern mining process.
The works we analysed were found through Google Scholar and grey literature, including books, industrial reports, blog posts, and web pages, as professionals and companies often share solutions in these formats.All in all, we found eleven monitoring practices well suited to be described as design patterns [2], and that we introduce in Section 3.
In this paper, we focus specifically on two patterns for diagnosing possible issues -Deployment Tracking and Exception Tracking.The first incentivises the tracking of deployments and relevant changes to the system to make it easier to relate issues to the changes that caused them.The second is about having dedicated tracking for exceptions as they pose unique challenges and are of great importance in troubleshooting issues.A summary of the sources we found about these two practices is presented in Table 1.In bold, we highlight works that already describe these practices as design patterns.

Design patterns
We assess if the patterns that we found already described in the literature clearly identify the context, problem, forces, solution and consequences, and if their description is detailed enough to provide actionable information for practitioners.
Within the scope of this paper, in particular, we found patterns in two resources by Richardson, a book [24] and its companion wiki [25].The former presents 44 design patterns to address many Microservices challenges, but we find that the patterns do not clearly distinguish the context, forces and problem, which we believe may hinder communication and reusability.The latter includes 52 patterns that overlap the 44 in the book but adopt a different format, with all pattern elements clearly identified, although the level of detail of each pattern varies widely.Overall, we consider that neither of these resources is detailed enough to help practitioners.Therefore, we set ourselves to extend these patterns.In the following sections, we explain the extent to which we base our patterns in the existing literature, and how the two patterns in this paper match the practices in Table 1.

Deployment tracking
We found three main references describing a practice we call Deployment Tracking.The first is the wiki page written by Richardson [25] that describes a design pattern named Log deployments and changes.Its solution describes exactly what the name suggests: that the practitioner should "log every deployment and every change to the (production) environment" [25].We argue this is a very specific way to track deployments, hence the decision of rewriting and expanding this pattern as Deployment Tracking.Richardson´s pattern is shallowly described, containing only one sentence for most of its sections.Nonetheless, it does present a formal structure that clearly separates context, problem, forces and solution.
Waseem et al. 's study on the design, test, and monitoring of microservices in the industry mentions 6 monitoring practices that the authors found mostly in grey literature [39].One of them is the aforementioned Log deployment and changes [25], which the authors do not further describe.Instead, they conduct an empirical study with 106 survey responses and 6 interviews in which they found out that this practice is the fourth most used monitoring practice amongst practitioners, with 51.9% of them stating they use Log deployment often or very often.Furthermore, the authors conclude, with a statistically significant difference, that "the Experience > 2 years group is more likely to use Health check API and Log deployment and changes for monitoring microservices systems than the Experience ≤ 2 years group" [39, p. 23].We believe this shows how experience plays an important role in the adoption of practices.By writing detailed and structured design patterns, we hope to make it easier for everyone to adopt monitoring practices.
Lastly, Brittain [5], at the time an engineer at Etsy, wrote a thorough blog post explaining how the company tracks releases and changes.He describes an example that contextualises the need for said practice, exploring how issues first arose and what the team did to find the root of those issues.Even though this is far from an academic source, the amount of detail used to describe the practice and the images used by the author are of great importance to our own description.Brittain mentions many aspects of a patterncontext, problem, solution, and example -lacking mostly in the consequences of its adoption.

Exception tracking
The second practice, Exception Tracking, is mentioned in three works.Richardson describes this practice as a pattern with a relevant level of detail in his book [24], and less so on his wiki [25].[25], Waseem et al. [39], Brittain [5] 3 Exception tracking Richardson [24,25], Waseem et al. [39], Daineka [8] The author argues that "a service should rarely log an exception" [24, p. 376] because exceptions involve special concerns that logs are not prepared to handle.Thus, he suggests the use of a central service to track exceptions that will deal with de-duplication, alerting, and issue management.The description we provide for this pattern in Section 5 builds upon Richardson's work to improve the pattern in two ways.On the one hand, we provide a more formal structure for the pattern, separating it into different sections (as explained in Section 3).On the other hand, we dive into further detail on the pattern's consequences, known uses, and relation to other patterns to make it more accessible for less experienced practitioners.
As mentioned in Section 2.2, Waseem et al. [39] identify 6 monitoring practices from grey literature.One of them is Exception Tracking, which they define as "a monitoring practice with which exceptions are identified, understood, and resolved with monitoring tools" [39, p. 3].Although the authors do not provide more detail on the practice itself, their empirical study shows that Exception Tracking is the second most used monitoring practice amongst the participants, with 60.4% of them stating they use it often or very often.Such a high degree of usage further accentuates the importance of formalising this practice as a structured design pattern.
Finally, Daineka [8] explains in a blog post what Error Monitoring and Error Tracking are, and what tools exist to tackle them.The terms he uses for those practices are, as we see it, different names for the same underlying practice, which is the same as Exception Tracking.The author analyses the pros and cons of nine different tools available for these practices.Daineka defines them as "a set of instruments to proactively find, triage and fix errors in different applications, mostly on the web" [8] but we consider this definition too broad for the practice at hand.Nonetheless, the number of tools that tackle Exception Tracking reiterates how common and relevant this practice is in the industry.

ABOUT THE PATTERNS
In our research, we discovered a set of monitoring practices that have not before been formalised as patterns or have incomplete pattern equivalents.The two patterns presented in this paper are part of this collection.More precisely, they are instrumental in producing data for generating metrics, and they support Distributed Tracing practices.
We show the complete collection of practices in Figure 1 as a pattern map, together with the main relationships between them, and identify their purpose through the following one-line solution statements: (1) Audit Logging -Record user activity and relevant system changes in a data store to help customer support, ensure compliance and detect suspicious behaviour [2].(2) Standard Logging -Implement a consistent logging format across all services and teams, so logs are understood by everyone and consumed by other services independently of their origin [2].(3) Log Sampling -Sample and prioritise logs to reduce the number of logs that need to be stored and processed while maintaining enough information for effective troubleshooting [2].(4) Distributed Tracing -Assign each external request a unique ID and record how it flows through the system from one service to the next in a centralised server that provides visualisation and analysis, making troubleshooting the application faster and less complicated [2].( 5) Deployment Tracking -Track every deployment and change to the production environment, making it possible to relate effects observed in the system to changes that caused them.( 6) Exception Tracking -Send exceptions to a centralised exception tracking service that aggregates exceptions, tracks their resolution, and creates alerts.(7) Liveness Endpoint -Implement a specialised endpoint that responds to requests without side effects.Then, configure another system (e.g.service, tool, load balancer) to periodically check that endpoint and take action when that fails, providing an automatic way to detect that the instance is unable to respond [3].(8) Readiness Endpoint -Implement a specialised endpoint that checks if the service is ready to accept and process traffic.Then, configure another system (e.g.service, tool,  load balancer) to periodically check that endpoint and stop routing traffic to the service when the check fails.[3] (9) Synthetic Testing -Create or pick a subset of existing test cases and periodically run them against the production environment, ensuring the application behaves as expected and detecting issues before they affect end-users.[3] (10) Application Metrics -Instrument the application to gather business and performance metrics.Collect these metrics in a centralised service that provides aggregation and visualisation, allowing deeper insight into the application's performance.
[2] (11) Infrastructure Metrics -Instrument the server and runtimes to capture relevant metrics of the operative system and underlying infrastructure and collect them in a centralised server, allowing the team to get a real-time overview of the application's environment.[2] The list above summarises a pattern catalogue proposed by the authors [2] and points to where each pattern can be found in their most recent version at the time of writing of this paper.As part of our research, we also found other monitoring-related design patterns that we believe are described well enough for effective use by practitioners.These are: Log Aggregation [35], Preemptive Logging [35], Automated Recovery [36], External Monitoring [37], Query Engine [6] and Correlation ID [6].
For the current paper we focus on the two patterns highlighted in Figure 1 -Deployment Tracking and Exception Trackingand the reason is twofold.
Firstly, the use of these patterns can help provide the information needed to diagnose issues that may occur during the operation of a software system.They do so by tracking events in the system that are essential to troubleshoot an issue.These events merely differ in their nature and goal, but the overall approach is similar.Secondly, they both provide information that can be used to generate application or infrastructure metrics.
We should also note that, although we designed these patterns with cloud-native systems in mind, we believe they can be used in other contexts with the appropriate adjustments.
To wrap up this section, it is relevant to mention that the pattern structure that we use is similar to the one in our previous paper [3], and combines elements from Gamma et al. 's [14] and Buschman et al. 's [7] pattern descriptions.This means that for each pattern, we present the following sections: • Name -an intuitive name for the pattern, immediately followed by a summary of its intent.• Also Known As -a list of other names given to the pattern.
• Context -contextualisation of the pattern; provides background on the problem and may also refer to other design patterns that can be considered before the current one.• Problem -a brief description of the problem as a question.
• Forces -a list of forces constraining the solution to point in a certain way and not another.• Solution -starts with a sentence in italics that captures the gist of the solution, and goes on to describe the solution for the problem, where we highlight in bold certain keywords that represent the roles of different components or modules; these components and their interactions are then depicted in a figure by the end of the solution section.• Consequences -a bullet-point description of the pattern's main advantages and disadvantages that should be considered when adopting the pattern.• Example -an illustrative example, either real or fictional, of the pattern in action.• Known Uses -succinct description of real-world cases that use the pattern; throughout this section, we also briefly mention existing tools that can be used to adopt the pattern.

DEPLOYMENT TRACKING
Issues in the system can correlate with changes it goes through, software-wise or not, hence the interest in tracking them.When metrics indicate the existence of system anomalies, the team should know if such anomalies might correlate with a recent deployment or change to the environment.Therefore, the team must track every deployment and change to the production environment, making it possible to relate the effects observed in the system to the changes that caused them.

Context
Nowadays, code gets released faster and faster to meet customer demands, so system changes can be deployed many times a day.By introducing new features or tweaking existing ones, there is the risk of error and, even with a robust suite of tests, faults can make it into production, possibly degrading some quality attributes (e.g.performance, security).By collecting metrics following, for example, the Application Metrics or Infrastructure Metrics [2] patterns, engineers can notice when there is something wrong with the system.However, metrics represent effects and not root causes.In other words, they denote what failures or warnings are occurring in the system but do not correlate these to possible disruption causes.

Problem
When metrics indicate the existence of system anomalies, how does the team know if such anomaly might correlate with a recent deployment or change to the environment?

Forces
Addressing this problem is subject to the following forces: • Changes affect end-users from both the technical (e.g.slow requests) and business (e.g.disappearing purchase history) point of view.• Software release cycles can vary a lot, so any correlation strategy must be able to cope with just a few releases per year or many releases per day.• The development and infrastructure costs increase with each new component that is added to a system.• Recording events increases data storage costs and introduces data management overheads.
• Having too much information can make its analysis more complicated, but if the information allows to detect possible correlations it makes the diagnosing process faster.

Solution
Track every deployment and change to the production environment, making it possible to relate effects observed in the system to changes that caused them.
The team should start tracking deployments and relevant changes to the system.Things like code releases, changes in environment variables, updates to system dependencies and infrastructure changes (e.g.instances scaling in or out) can help the team in their debugging efforts, so it is important to store them as Deployment Records.These records can be stored as logs, metrics or represented as events of the system and have an associated event handler to register them.For the sake of clarity, by logs we mean one-line entries of information, usually text, in a file; metrics are key-value pairs that commonly represent a measurable property of the system; and events are a broader term to identify anything relevant that happens in the system which can be processed into other outputs (including logs and/or metrics).
The team needs to choose one of the above options as the format for their deployment records.Each option has its pros and cons: logs are easy to generate from almost any tool, but quickly pile up and become hard to manage; metrics are easy and cheap to store, but can be harder to integrate into the Deployment Tool or pipeline since most of these already generate logs out-of-thebox; generic events are more versatile, but extracting actionable information from them requires more complex logic because they can contain virtually anything.Additionally, to make this decision the team must take into account the system's release frequency.In other words, systems that get released or changed every day will generate more data which can be hard to manage, while longer release cycles generate less data that may need to be complemented with more information.The team should consider these tradeoffs and select what suits them best.
Nonetheless, it is crucial that the deployments and changes are correlatable to other sources of data.If the team opts for using metrics as their deployment records, then the Metrics Service will provide said correlation.Otherwise, the team should integrate this information into an existing Metrics Service and display it in the metrics' dashboards (cf. Figure 2).By plotting deployments and changes together with existing metrics graphs, it becomes trivial to understand if a release or change caused the metrics to skew up or down and, if so, at what time that happened.With this increase in visualisation, the team can better understand if issues are being caused by a change in the system, such as a deployment, or not.If they are, then they can focus their diagnosing efforts on the faulty changes, speeding up the process of finding the root cause of the issue.
Moreover, tracking deployments and changes and comparing them to business-related metrics can provide insight into how releases affect the end-users.Request rates or daily logins may increase if a major feature is introduced in a new system version.By plotting releases and business-oriented metrics together, managers can also benefit from the increased observability of the system.

Consequences
This pattern has the following advantages: • Once problems with the system are identified, it becomes faster to pinpoint changes that could be the cause of the issue, effectively decreasing the mean time to detect (MTTD).• Releases can be correlated with other metrics, encouraging data-driven decision-making.• When deployments are plotted with metrics, visualisation becomes richer, and a better system overview is made available to the stakeholders.• Teams can have more confidence when releasing a piece of software because, even if faults make it into production, these will be noticed quickly.This pattern also has the following drawbacks: • To take full advantage of this pattern's benefits, metrics should be already collected and visualised; implementing all of this from the beginning requires considerable time and money.• Plotting a binary metric, such as deployment logs, may not be supported by all tools; the team should ensure their Metrics Service supports drawing the releases as vertical lines.

Example
In a blog article, Mike Brittain [5] explains how Etsy, an e-commerce company and global online marketplace, decided to adopt deployment tracking.Since they were already collecting many system metrics, they would notice when something was going wrong with the system.However, they were missing the actual cause of the problem -the deployments.Since issues in the system usually occur after changes, they started tracking deployments.To do so, they tweaked their deployment tool to emit a deployment event every time a new release was made.The event consisted simply of a name field (events.deploy.website), a placeholder value (1, for example) and the timestamp (in their case, 1287106599 in Unix time).By sending this metric to their monitoring tool, Graphite, they plot the changes together with the metrics, as depicted in Figure 3.Given the new information, it became clear that the warnings were due to a release at 16h, and they were fixed with the two subsequent releases.The ability to visualise the changes makes it possible to identify release trends.This example was focused on a technical metric.However, Brittain also provides an example of how the number of new posts in Etsy's forums increased after a product launch (see Figure 4), revealing how tracking releases can also be helpful for business managers and customer support.

Known uses
Brittain's [5] example described in the previous section is already a known use of this pattern that utilised Graphite to implement this pattern in a real-world system.
In a blog post, Will Sewell, a backend engineer at Monzo, explains how they deploy to production more than 100 times a day [29].The author goes into detail to explain many things that contribute to the high frequency of deployments, including company culture and used tools.We focused on the fact that they track the number of deployments they do each day, which allows the company to plot, for example, the average weekly deployments per engineer.Moreover, their "deployment pipeline records events of what happened, and when " [29], so they can "easily find out why things went wrong and stop it happening again" [29].
Finally, in a blog post about Raygun's Deployments feature, Freyja [13] reports that they interviewed some customers to understand how they use the feature to track deployments and version releases.According to the author, most teams find it "most helpful when investigating which errors have been introduced with the new release, which are still occurring, and which have been fixed" [13].The tool provides visibility over the deployment pipeline so teams can avoid error propagation across the deployments.The author recommends configuring "Raygun with deployments so that you can correlate error spikes with releases " [13], which greatly resembles this pattern's solution.

Related patterns
When adopting Canary Releases [40] or other specific deployment strategies (e.g.A/B testing, Feature Toggles), deployments can be tracked with additional information about which version of the system the end-user is utilising.Plotting the deployments in existing metric graphs may reveal more information about the impact of canaries and either support or oppose their full release.
Applying this pattern allows practitioners to also track the relevant deployed version in their Distributed Tracing system.This means that they can now see which version of the system the requests are flowing through.
Deployments and changes can be treated as events and monitored that way, so this pattern follows an event-based monitoring strategy.Similarly, deployments can be treated as binary metrics that are collected and correlated with other system metrics.Without metrics that illustrate system anomalies, the usefulness of Deployment Tracking is much lower.Hence, the team should consider adopting Infrastructure Metrics, Application Metrics, or both before implementing this pattern.
Nevertheless, if the visualisation or collection of metrics is not yet developed in the project, merging the deployment logs with other system logs by adopting Log Aggregation [30] may already be enough to correlate deployments with system anomalies.Logs are the simplest way to implement this pattern and may provide enough value for teams that do not have time or money to invest in more robust monitoring solutions.

EXCEPTION TRACKING
Logging exceptions in a log file does not support de-duplication or issue-tracking.Yet, teams usually need a way to efficiently record exceptions to maximise their value for troubleshooting processes.Therefore, they should send exceptions to a centralised exception tracking service that aggregates exceptions, tracks their resolution, and creates alerts.

Also known as
Error Monitoring [8], Error Tracking

Context
Applications can generate a lot of operational control data -often errors, but many times simply informational.Such data is often stored as log files, providing a historical record of events that happened in the system.Information in log files could be of great use to help diagnose issues, but in practice, it may be hard to use it effectively for this purpose.
Some of these events are exceptions.Most of the time, the log's severity level is adjusted, so it is easier to identify erroneous events.An exception stack trace is vital to the team for fixing errors.On the one hand, if they use logs as single lines of text, stack traces do not fit in the log's body, and crucial information may be lost.On the other hand, if the stack trace is dumped into the log, it most likely clutters the log files with long lists of method calls.Furthermore, logging lacks fundamental capabilities (e.g.de-duplication, issue-tracking) that make exceptions much easier to use for troubleshooting.
Standard Logging [2] helps to make better use of logs, and can even help a lot to record exceptions without the challenges of capturing them in single-line text-based logs, but not necessarily by itself to diagnose issues.

Problem
How can the team record exceptions to maximise their value for troubleshooting processes?

Forces
Addressing this problem is subject to the following forces: • By nature, exceptions capture unexpected behaviours of a system, which are at the root cause of most existing faults.• Exceptions usually contain long stack traces that span multiple lines and that, therefore, are hard to store in a log file in a way that preserves readability.• Many exceptions can be generated simultaneously by the same root issue (e.g.multiple users interacting with a broken feature) creating duplicated entries that take up space but do not add any extra information for debugging.• Logs are easy to generate and supported by many monitoring tools, but they are harder to read and extract information from when they contain large or duplicate entries.• Exceptions generated by a system can help any team member diagnose issues.However, these exceptions may contain sensitive information, potentially posing a security risk depending on how they are stored, who accesses them, and the circumstances of access.• Developing a custom solution to handle exceptions is more versatile, but incurs more costs (e.g.maintenance, infrastructure, development) than a third-party one.

Solution
Send exceptions to a centralised exception tracking service that aggregates exceptions, tracks their resolution, and creates alerts.
The Exception Tracking Service handles all major concerns with error tracking.First, it deals with all Exceptions from the production systems, both on the web and native applications.The team should instrument the source code using an Exception Tracking Library to forward every exception that gets thrown in the Application to the tracking service (cf. Figure 5).The library takes care of the details of getting the exception from the running environment, so it works for all platforms alike.Moreover, it is common for the library to be available in many programming languages to support various sources.
Secondly, the tracking service handles duplicate entries.Since exceptions are usually thrown and sent upstream, its common for multiple instances to surface.Thus, the service should either drop duplicate exceptions or aggregate them in a single report showing similar occurrences.The latter provides more information for troubleshooting but requires more storage and processing.Storing a single exception case may be representative enough for troubleshooting.Hence, the team should start with the simplest scenario and explore a more complex one only if it is necessary.
Additionally, the team should collect as much context as possible when the exception occurs.This is usually done by the instrumentation library because it is running in the production systems.There are no mandatory properties that every exception tracking system should collect, so the team should discuss what things they want to include in an exception report.These may involve information about the source of the exception (e.g. the name of the service, its IP address), some additional specifications of the user's machine, a record of the user actions some moments before the exception occurred, source code snippets from the calls that originated and propagated the exception, and a set of logs generated close to when the exception was thrown.All of these properties provide additional views of the system to make the troubleshooting process as straightforward as possible.
Finally, the exception tracking service should provide alerts and automatic issue tracking.Even though alerts are out of this paper's scope, every exception is likely a critical error.Thus, it is vital to have alerts for this level of importance.As is typical with alerts, the way they are delivered, their thresholds and descriptions should be configurable.The issue-tracking is not a must-have in the tracking service, but it does prove to be a valuable feature.The exception tracking service can make the whole troubleshooting process faster and more organised by automatically creating an issue and associating the error report with it.Depending on the team's needs, the issue management can be internal to the service or a third-party tool.All of these aspects improve the team's capability to effectively diagnose the issues that the exceptions expose.

Consequences
This pattern has the following advantages: • Exceptions are aggregated and stored in a single repository, easily accessible by the team.• Collecting exceptions from different platforms and devices becomes trivial and does not require additional tooling.• Aggregating similar exceptions reduces overall noise and makes them easier to process.• Troubleshooting errors becomes faster, reducing the system's mean time to repair (MTTR) and the overall number of errors.• When a record of user actions is included in the error report, the team can troubleshoot the exception without needing to exchange emails with the users.
This pattern also has the following drawbacks: • Instrumenting the source code to catch all exceptions and call library functions makes it more complex.• The exception tracking service, either self-hosted or as a managed SaaS, introduces new costs in the system.• Storing the exceptions and their contexts may require additional infrastructure.

Example
Suppose we are developing a platform available on the web and native mobile applications for Android and iOS.All frontend applications use the same backend services.We naturally start logging exceptions to log files and soon realise that this strategy only works for our backend services.Exceptions that occur solely on the frontend are not showing up in the log files.For example, misconfigured navigational routes on mobile applications that receive a null value generate a NullPointerException and crash the user's application.These exceptions dramatically degrade the user experience and should be fixed as soon as possible.However, we cannot notice the problem with our current error-tracking strategy.
Therefore, we decided to use Sentry due to its integration with Android and iOS and a JavaScript library for the web application frontend code."Sentry's SDKs report an error automatically whenever a thrown error or exception goes uncaught" [27] in the application.Even if the tool cannot send the exception report immediately due to application instability or network issues, "the report is guaranteed to send once the application is started again" [27].
With the tool set up, we now have an overview of incoming exceptions from our frontends (see Figure 6 for an example).Not only did we get visibility over a significant section of our user base, but we also improved our ability to troubleshoot exceptions and improve the overall quality of the application.

Known uses
Gerhard Jacobs, in a post on Sentry's blog [18], explains how monday.comuses that tool to monitor their errors.The author puts great emphasis on the importance of customising the tool.He states that monday.comused Breadcrumbs, a Sentry's feature that "[shows] a timeline of actions that led to an error, reducing the time required to resolve it" [18], and "added their own custom events, allowing for more granular investigations" [18].This feature reduces the need Figure 6: Example Sentry dashboard with a visualisation of events before an exception using the tool's Breadcrumbs feature [26] .
for user feedback since most information is already conveyed in the collected events.According to the author, monday.com'steam was able to "reduce the time it takes to resolve an issue from between 30-45 minutes to 10 minutes" [18] using exception tracking.In a post on Raygun's blog, Penney [23] writes about a presentation he did explaining how Raygun, the company, uses Raygun, the tool, to monitor its deployments.The author reveals that their deployables "[feed] into separate Raygun apps" [23], and developers can subscribe only to the notifications from projects they are involved in.The tool also provides a Slack integration to notify the team of occurring issues.The author claims he uses the error details page to triage and assign issues and a feature called User Tracking to prioritise issues [23].Managing their systems so closely allows Raygun's teams to provide quick customer support and retain customers.
Finally, in another blog post, Borders [4], at that time working at Collage.com, explains some of the company's problems when scaling exception tracking.They were using TrackJS as an exceptiontracking tool, and the teams would fix some of the tracked exceptions each sprint.However, the author states that it quickly became evident that fixing everything was unfeasible, and the company struggled with keeping track of the priority of each exception.Thus, they "created a system to synchronise exceptions with Jira tickets.This system creates Jira tickets when new exceptions pop up but also synchronises the number of affected users (or total count for back-end exceptions without a user ID) during the past 24 hours and adjusts the ticket priority on a continuous basis" [4].Additionally, they implemented alerts to trigger when a ticket reached maximum priority, ensuring that one of the on-call engineers would be on top of the occurrence.

Related patterns
Adopting Distributed Tracing along with this pattern provides an even more complete view of the events before an exception.The team can follow the execution traces, analyse the information inside each span as contextual information, and have the exception tracking service complement that with the error's stack trace and additional context.
Even though logs make troubleshooting exceptions challenging, it is still relevant to keep logging exceptions as historical longterm storage of system events [28].Thus, the team may consider adopting Log Aggregation [30] in case they need to keep these events stored (e.g. for audits).

CONCLUSIONS
With this paper, we propose two new design patterns for monitoring cloud-native applications.While Deployment Tracking suggests that deployments should be recorded and used to find the cause of an issue, Exception Tracking encourages practitioners to use dedicated systems to keep track of exceptions so they are easier to use in troubleshooting processes.
Both patterns consist of tracking relevant events in the system and aim at helping to resolve occurring issues.Moreover, keeping a record of these events is tightly related to storing them as metrics.Hence, these patterns are good sources of information for metrics collection systems, such as suggested in the Application Metrics and Infrastructure Metrics [2] patterns.
The patterns were based on existing academic and grey literature, so they have practical foundations.After all, we believe that "a key part of patterns is that they're rooted in practice" [12, p. 10].Moreover, the design patterns we propose are mere theories [19] of what we perceive and know about industry best practices on the subject of monitoring and observability.Thus, this paper opens the way for empirical studies over the proposed patterns, such as the ones carried out by Sousa et al. [34], Vale et al. [38] and Albuquerque [2], in order to strengthen them or ultimately refute them as valid theories [19].Additionally, the proposed patterns can serve as the foundation or inspiration for other works and we strongly incentivise other authors to refer to them or even revise them as needed.
Finally, this paper adds to our previous work on proactive monitoring design patterns for cloud-native applications [3].All of these patterns are part of a larger pattern catalogue from another work developed by the authors [2].The latter served as a large inspiration for this paper and provides a definition for all patterns shown in Figure 1.

Figure 1 :
Figure 1: Overview of the monitoring design pattern candidates proposed by the authors.The patterns highlighted with a dashed red line are the ones explored in this paper.

Figure 2 :
Figure 2: Overview of the structure for the Deployment Tracking pattern.The production environment and application are shown just for better context.

Figure 3 :
Figure 3: Plot of the number of PHP warnings over time and code deployments for Etsy's system.The vertical lines seem to indicate the cause of the horizontal line's behaviour.

Figure 4 :
Figure 4: Plot of the number of posts in Etsy's forums and code deployments.After code deploys (vertical lines) the number of new posts to the help forum seems to increase.

Figure 5 :
Figure 5: Overview of the structure for the Exception Tracking pattern.As exceptions occur, the library exports them to the exception tracking service.

Table 1 :
Related works about Deployment tracking and Exception tracking.The ones already proposing design patterns are highlighted in bold.