Near-Real-Time Event Processing

NearRealTime.png

When devices or systems are reporting status data more or less constantly, the value of that data is usually twofold – the first being immediate and the second being much longer term. Some examples:
  • A temperature gauge in an engine reports every few milliseconds. Immediate value: ensuring the engine doesn’t overheat. Longer term value: seeing patterns of this data in relation to other data, such as fuel efficiency – how does engine operating temperature affect miles-per-gallon?
  • A GPS reports the position of a vehicle every few seconds. Immediate value: maybe alerting the driver of congestion in their path. Longer term value: aggregating with data provided by many more drivers and perhaps contributing to road development or traffic management plans.

In these cases, there is no middle ground, meaning, the near-term value of the data drops off dramatically within minutes. Put another way, losing a few transactions doesn’t matter. More are coming, and very soon.

This split value of the data presents an opportunity: the persistence tier can be similarly bifurcated. A near-term persistence tier with no redundancy can optimize cost and speed. A long-term persistence tier can be fully redundant, optimizing cost and data availability.

Balancing the resources in the near-term persistence tier is worth some careful thought. The messages are coming in pretty fast - being served by the ingest tier which in turn pushes relevant bits of it into the persistence tiers.

What’s ‘pretty fast’?  Factories can contain hundreds or thousands of machines, each with dozens of sensors reporting data.  The social sphere contains millions of devices.  These multipliers can quickly result in many thousands of messages per second for a service to process.

That persistence tier is also handling requests for that data in service to the “immediate needs” for that data. The persistence tier must be “big enough” in terms of CPU, memory and network resources to accommodate both requirements. This resource balancing can only be achieved through effective load testing.

The Solution

The implementation provided (see Source Code) uses OWIN to provide WebAPI infrastructure, running on an Azure Worker Role. A standard ASP.NET MVC 4 WebAPI project template works great in this configuration. Both Visual Studio 2012 and 2013 templates were tested with success. Project Katana implements OWIN on the Microsoft stack. It is available via Nuget.

The “immediate value” persistence tier is provided by Windows Azure In-Role Cache. The configuration chosen is the dedicated (as opposed to co-located) option.

The “longer term” persistence tier is provided by Event Tracing for Windows (ETW) logging plus Azure Diagnostics Custom Directory, which moves the logs from the compute instances to Azure Blob Storage.

Performance

Several variations on the implementation were load tested so we could learn where the bottlenecks would be and under what conditions they would start to show up. For example, we tested the WebAPI tier with no code (all it did is return a 200), then we tested with caching, then with caching plus logging. We tested both IIS and OWIN configuration. We tested with both Visual Studio 2012 and 2013 templates. Throughout we monitored the behavior of caching and logging to see how they were reacting under load. All testing was done with a single small compute instance to ingest the data. Similarly the caching tier was a single small compute instance.

NOTE: Your mileage may vary.  These numbers are provided as a guide, not a guarantee of performance.

When testing the empty WebAPI alone, we found that OWIN is marginally faster than IIS, processing about 1300 TPS vs 1200 TPS. This was with the VS 2013 ASP.NET 4.5 template. IIS, however, is much better with queuing requests when it gets busy, so when load is ebbing and flowing around peak load it’s more likely to process requests without errors. When added to the tests, caching and logging functions had no trouble keeping up. With the VS 2012 template, OWIN has a much greater advantage over IIS, getting about the same 1300 TPS vs 500 TPS for IIS.

Several times we added instances to the ingestion tier and found that the solution overall scales linearly. This holds true when adding instances to the cache tier as well. Logging functions had no observable effect on performance.

Last edited Nov 27, 2013 at 11:07 PM by sebastus, version 14