Building Apps around the Event Mesh
Architecture
The architecture of an event mesh-enabled system is a paradigm shift from traditional transactional designs to an eventual consistency model. This design aligns better with real-world processes, where different parts of a system may operate asynchronously yet collaboratively.
In this section, we explore the technologies, flow, and structure that make the event mesh architecture resilient, scalable, and developer-friendly.
1. Technology Stack
Here’s the list of technologies used in this solution and its examples:
-
Red Hat supported products
-
Red Hat OpenShift — Orchestrate containerized applications. Based on Kubernetes.
-
Red Hat OpenShift Serverless — Provides the Event Mesh and Serverless capabilities. Based on Knative project.
-
Streams for Apache Kafka — (Optional) Provides a persistence for Event Mesh, likely needed in production. Based on Strimzi project.
-
-
Other open source technologies:
-
CloudEvents — Provides a standard for event metadata
-
OpenTelemetry — (Optional) Facilitates tracing for observability.
-
Rust and Java — Implementation examples.
-
Kubernetes, Knative, Strimzi, CloudEvents, and OpenTelemetry are CNCF projects.
2. An in-depth look at the solution’s architecture
Building applications around the Event Mesh is a solution that can be applied both to existing and new projects. For existing applications, domain boundaries can guide the division into modular components, which may evolve into separate services. These modules will generate commands intended for the Event Mesh, which will then route these events to their designated endpoints. The Event Mesh will allow for such transition, gradually making the system not only more responsive but also better suited to the real-world business logic.
2.1. Problem Analysis
Traditional systems often enforce strict transactional consistency, which can impede application performance and compromise business logic. For instance, upon the completion of a ride-sharing service at the end-user’s destination, the system should reliably capture this real-world event. The capture should be performed regardless of any potential operational disruptions affecting dependent services (e.g., invoicing).
In such scenarios, transactional applications, which typically encompass a number of steps in a user workflow, make all the steps undone when an error is returned. This prevents any data from being changed, and causes the loss of real-world intent of end-users, and results in an adverse user experience and a deviation from the genuine business process, potentially leading to customer dissatisfaction.
2.2. Solution Breakdown
The concept of employing the Event Mesh as a central, reliable hub for dispatching commands, as events, lies at the heart of this solution. This approach aligns closely with the Command Query Responsibility Segregation (CQRS) pattern, which distinctly categorizes commands and queries. Commands, in this context, are modeled as asynchronous events, designed to undergo processing by the Event Mesh. On the other hand, queries are synchronous operations, safe to retry, ensuring no loss of data integrity due to transient errors.
The primary responsibility of the Event Mesh is twofold. Firstly, it persists the incoming events, thereby maintaining a record of changes in the system’s state. Secondly, it routes these events to their respective endpoints, ensuring that the appropriate microservices are notified and can subsequently update their internal states based on the event data.
The mesh’s inherent resilience is further bolstered by its built-in retry strategies (linear or exponential backoff), which it employs when encountering operational failures. This mechanism ensures that the system retries the operation until it succeeds, thus mitigating the risk of data loss or system disruption due to transient issues.
By integrating the Event Mesh into the system architecture, several architectural benefits are achieved:
-
Decomposition of the application into independently functioning services: This approach facilitates a division of labor, with each service handling specific responsibilities — the Domain-driven design approach fits here quite well. This not only enhances maintainability but also fosters scalability, as services can be independently scaled based on their demands.
-
Improved business alignment: By embracing an eventual consistency model, the Event Mesh aligns closely with the inherent nature of most real-world business processes. Unlike traditional transactional systems that strive for immediate, irreversible consistency, the Event Mesh allows for a more flexible and adaptive approach to data consistency. This results in better alignment with business requirements, as it supports scenarios where multiple services collaborate and synchronize their operations, making the whole state eventually consistent, without the constraint of strict, synchronous consistency.
-
Improved Resilience: The Event Mesh's error-handling mechanism, through retries and event persistence, aids in minimizing the impact of failures on the end user. This is crucial as it prevents bugs and outages from becoming visible to the user, thereby preserving the system’s perceived responsiveness.
-
Enhanced system performance: The system becomes more responsive, as the end user no longer needs to wait for multiple, often independent, operations to complete successfully. The Event Mesh's event-driven model, coupled with the retries and event persistence, ensures that critical state changes are propagated swiftly and reliably, thereby improving the overall user experience.
2.3. Event Mesh Flow
The event-driven flow enables eventual consistent collaboration and state synchronization between services, fostering a resilient, scalable, and developer-friendly system architecture.
A usual flow may look like:
-
An end-user application sends an HTTP request to the Event Mesh. Such message can be understood as a Command type event.
-
The Event Mesh (Broker) persists the event in a queue (like an Apache Kafka topic, but the implementation is hidden from the user). After Event Mesh persists safely the data, it returns a successful HTTP response with the
202 Acceptedreturn code. At this point, the operation could already be considered successful, from the end-user point of view. It will eventually settle correctly in all downstream systems. -
The Event Mesh routes the event to the appropriate endpoint based on the CloudEvent’s metadata and configured triggering rules.
-
The endpoints receive the events and process them, updating their internal states and potentially emitting new events for downstream consumers. The potential events are transmitted to the Event Mesh.
-
The dispatch loop continues until the event queue is empty and all the events are processed successfully. The failures are automatically retried by the Event Mesh, with an increasing pauses between retries to avoid overloading the system.
The diagram illustrates the example flow of events between the applications, the Knative Event Mesh, and the datastores which persist settled state of the system.
|
Notice the applications aren’t pulling the events from the queue. In fact they aren’t aware of any. The Event Mesh is the one controlling the flow, and retrying when needed. There are no additional libraries needed to consume events from Event Mesh. The Event Mesh pushes the events as CloudEvents encoded as REST messages. |
|
The exponential backoff algorithm used by Event Mesh is configurable.
It uses the following formula to calculate the backoff period: A dead letter sink can also be configured to send events in case they exceed the maximum retry number, which is also configurable. |
2.4. Work Ledger analogy
A good way of thinking about the Event Mesh and its persistent queue backend is the Work Ledger analogy. Like in the olden days, the clerk kept his to-do work in the Work Ledger (e.g. a tray for paper forms). Then he was picking the next form, and processing it, making changes within the physical file cabinets. In case of rare and unexpected issues (e.g. invoicing dept being unavailable), the clerk would just put the data back onto the Work Ledger to be processed later.
The Event Mesh is processing the data in very similar fashion. The data is held in the Event Mesh only until the system successfully consumes it.
2.5. Differences from the Event Sourcing
The Event Mesh pattern could be mistaken for Event Sourcing, as both are Event-Driven approaches (EDA) to application architecture. However, Event Mesh has few improvements over the shortcomings of Event Sourcing approach.
The data is held in the Event Mesh only until the system successfully consumes it, settling the data in various datastores to a consistent state. This effectively avoids the need to keep the applications backward compatible with all the events ever emitted. Introducing breaking changes in the event schema is as easy as making sure to consume all the events of given type from the Event Mesh. This also works for the systems which can’t have downtime windows. The applications could have short-lived backward compatible layers for such situations. When all the events are processed, the backward compatible code may be removed simplifying the maintenance.
Because in the long-term, the regular datastores are the source of truth for the system, all traditional techniques for application maintenance apply well. It is also, easier to understand for developers, as it avoids sophisticated event handlers logic, and reconciling into the read database abstraction.
2.6. Differences from the Service Mesh
Worth pointing out are the differences from the Service Mesh pattern. The Service Mesh pattern is intended to improve the resilience of synchronous communications which return the response. The Service Mesh effectively raises the uptime of the dependent endpoints by retrying and backoff policies. The uptime can’t be raised to 100%, though, so Service Mesh still can lose the messages.
The table below captures the key differences:
| Service Mesh | Event Mesh | |
|---|---|---|
Similarities |
|
|
Differences |
|
|
2.7. Supporting Legacy Systems
One of the strengths of an Event Mesh architecture is its ability to integrate seamlessly with legacy systems, making them more resilient and adaptable. Legacy applications can be retrofitted to produce and consume events through lightweight adapters. For instance:
-
A monolithic legacy application can send events for specific operations, instead of handling all logic internally in transactional fashion.
-
Event listeners can be introduced incrementally, enabling the legacy app to subscribe to events without refactoring its core logic.
-
This approach decouples old systems from rigid workflows, allowing for gradual modernization while ensuring operational continuity.
2.8. Improving Resilience in Applications
Traditional systems often rely on synchronous calls and transactions, which can cascade failures across components. Replacing these with asynchronous event-driven communication reduces dependencies and makes the system Eventually Consistent.
For example, invoicing and notification services in a ride-sharing platform can process events independently, ensuring that downtime in one service does not block the entire workflow.
Retry mechanisms provided by the Event Mesh guarantee that transient failures are handled gracefully without data loss.
3. More about the Technology Stack
It’s worth noting that Knative’s Event Mesh is completely transparent to the applications. The applications publish and consume events, usually via HTTP REST, and the only thing that is required is the CloudEvents format.
The CloudEvents format provides a common envelope for events with metadata that every event needs, such as identifier, type, timestamps, or source information. The format is a CNCF standard supported by a number of projects and doesn’t enforce the use of any library.
This makes the investment in Knative’s Event Mesh safe in terms of vendor lock-in. Architects can be assured that their options remain open and that solutions can be easily reconfigured down the road.
What’s more, relying on well-known and easy-to-deploy CloudEvents, typically over HTTP, makes testing simple and straightforward. Developers don’t need complex development environments because the Event Mesh integration can be easily tested with regular REST testing that most developers are familiar with.