Wednesday, November 9, 2016

20+ Questions You Want To Ask From A Stream Processing Framework

About a year ago I started wondering around in the world of stream processing. Having been what I would call an "average" Java developer, building plain-old Java applications, libraries, desktop and web-based stuff, the transition to the "big data"-ish stack of technologies was definitely an eye-opener. In both positive and sometimes negative ways.

In this post, I would like to present you a number of questions you might find useful to ask about the next stream processing framework you are about to adopt. There are also some answers here, based on my personal (and my colleagues') experiences taking the first steps with Apache Flink and Apache Spark Streaming. Obviously, the questions selected here and answers to many of these are highly subjective and may not reflect your exact needs or previous experience. Nevertheless, if you are someone who has just recently started gazing upon the big data landscape and found (near real time) stream processing to be of some interest, you might also be able to take advantage of the experiences here.

The questions are divided into five portions, based roughly on (some kind of) a lifecycle of an application, from building to monitoring it in its working environment.


Where do I start from?

Spark users might grab the Spark Streaming Programming Guide and go through some examples. There's also a quick start guide available from Scala, Python and Java. Flink users can have a look at the Basic API Concepts and the DataStream API Programming Guide. The Maven quickstart example is also a great way to set up your first Flink program.

Where does my code run? How does it get there?

In both Spark and Flink, your code runs either on the "master" node (Spark Driver / Flink JobManager) or one of the "worker" nodes (Spark Worker / Flink TaskManager). The code that sets up your processing topology, is usually executed on the master node, while the topology itself is serialized and transported to worker nodes for execution.

Where do I get my data from?

From an implementation of a Receiver (Spark data source) or a SourceFunction (Flink data source). Many of those are available out of the box or as extensions e.g. for reading from file systems, sockets, Kafka, Flume, Kinesis etc.

Nevertheless, you can always implement a custom source. The approach is similar for both Spark (custom receivers) and Flink - you have to implement a receiver function running on a thread that waits for incoming data and pushes the received data into the stream upon arrival.

If streams are unbounded, how can my streaming operations reach a result?

There aren't many operations that can be directly applied onto a direct stream, usually some sort of windowing is needed on the data. Spark Streaming applies windows to your data automatically, dividing it by arrival time into batches of determined length provided during the setup of your processing topology. In Flink, you need to explicitly specify window operations to be applied on your data.

How can I make my design modular?

  • Extract complex functions into separate classes.
  • Separate parts of your topology into factory functions. You can use IoC containers, such as Google Guice or Spring IoC to wire these functions together. This allows you to test subsections of your topology separately.

How can I prepare for and recover from failures?

  • Do not rely on local state (static variables, cache, etc), use the state management facilities provided by Spark / Flink.
  • In general, try to avoid recovery into a new version of your application, if possible. Especially, if a new version changes in topology.
  • In Flink, learn about Savepoints and Fault Tolerance to make sure your code is upgradable, in addition to being recoverable.
  • In Spark, have a look at the fault-tolerance semantics described in the programming manual. 

How can I manage state?

In Spark, you need to use updateStateByKey or mapWithState functions to map the previous state of your stream to its new state. Your state is kept directly in RDD's produced by your topology.

In Flink, you can implement stateful functions (both source and transform functions). The State Backend is responsible for storing and restoring the state stored by your stateful functions.


What are my “units” for unit testing?

I would recommend you to separate processing components (i.e. the functions that act upon data) and the processing topology (i.e. the pipeline) into separate classes / components as much as you can. Although defining your whole stream processing pipeline in one place makes it easier to grasp, it doesn't really give you any benefits when building automated tests. Also, if you separate your processing pipeline into sub-components, you should be able to provide test inputs to each of these separately, without the need to actually involve real input / output operators. This applies to both Spark and Flink.

Can I include the surrounding framework into my tests?

In both Spark and Flink, yes you can (to some extent). For example, you should be able to:

  1. Build streams from collections of objects, to facilitate data-driven testing.
  2. Build ad-hoc stream sources and outputs to test the endpoints of your topology.
  3. Run your whole topology in "local mode", without the need to deploy it to a cluster.

How can I test the distributed nature of my application?

Both Spark and Flink allow you to run your processing topology in local mode, with varying degrees of parallelism. Nevertheless, this usually happens in one JVM and certain patterns that are incompatible with the distributed nature of stream processing, can go unnoticed. For example:

  • Static variables (e.g. static singletons) do not behave the same way in local and distributed cluster.
  • Some serialization and deserialization support could be missed in case of local mode in Spark. When running locally, Flink is usually able to catch unserializable object graphs in local mode more reliably than Spark.
As always, the safest way to make sure you cover potential edge cases with cluster-based execution is still just to set up a cluster of JVMs during integration tests and run your integration tests against that setup.

Do I have to test the framework itself as well?

Luckily, not too much in case of Spark Streaming and Flink. Both usually run as intended. Just make sure you are careful with experimental portions of the APIs (usually annotated or documented as such). Also, take note that there are many streaming operations that run in-memory for efficiency's sake - so make sure you monitor performance-critical sections of your code well and don't expect any miracles from the framework when dealing with copious amounts of data.


How do I include third-party dependencies?

Using either Maven Shade / Assembly plugin to create an uber-jar or by providing these separately through a shared file system. You can trade off between quick deployments and self-containment here. It seems that the recommended approach is to build one big artifact containing everything except Spark / Flink itself and deploy that.

How do I isolate my code and dependencies?

That can be a challenge in both Spark and Flink. In both cases, your code runs on a JVM together with dependencies included with the framework itself. There's not much built-in modularity support either. Your code (JAR) gets loaded by a separate class loader, with the framework's main class loader as the parent, so the parent's dependencies have most likely already been loaded.

Leaving options like OSGi aside (not because it would be a bad idea, just don't have any personal experience with joining OSGi and Spark / Flink), you can use Maven's Shade plugin to move your required dependencies to custom packages. That is, if the versions of your dependencies don't match with what Flink and Spark require.

One big artifact or many small ones?

With Flink, your best bet is to create one big uber-jar for deployment, this allows for easy deployment also from the JobManager UI. It is possible to add your dependencies via a custom classpath, but this requires the additional artifacts to be available on a shared path (e.g. network share). With Spark, the command-line tools allow for more flexibility. For example, in case of YARN deployment, the CLI tools allow for additional artifacts to be uploaded and shared with the entire cluster automatically. Still, building a one big jar tends to be less error-prone.


Do I need to deploy any kind of middleware?

You usually need to deploy the framework itself. Both Spark and Flink support standalone clustered mode, YARN and cloud deployments. That is, your programs do not run in isolation, but rather on Spark / Flink nodes that must be configured and deployed prior to the deployment of your application.

Does my application share its sandbox with others?

Depends on how you deploy your application. In Flink, task managers have 1 task slot per manager by default, which means that your application's code runs in a separate JVM. However, this does not need to be the case, you can configure task managers to run multiple slots, in which case your code is separated from others through class loader isolation only. Have a look at Cluster Execution.

In Spark, executors get their own processes which stay up for the duration of the whole application, i.e. tasks from different applications always run in different JVMs. See Cluster Mode Overview for more details.

Which deployment tools do I have at my disposal?

Out of the box, shell scripts. Flink Job Manager also provides a REST API and an HTML front end for simple uploads.

To deploy the cluster itself, you can use tools like Apache Ambari for more convenient set-up and configuration management.

What kind of infrastructure do I need?

Nothing too spectacular. Standalone clusters need some hardware and software able to run Java on. Common cloud providers are usually supported as well. For example, Amazon EMR for Flink and Spark.


What are the metrics I could monitor?

The most crucial metric tends to be the back pressure from downstream, when incoming data overwhelms a worker node and the limited throughput of your stream results in the buildup of the queue of unprocessed input. This is especially critical if your input source does not support throttling. There are no direct metrics for back pressure available, however – so you might need to implement your own.

At any rate, there are some great sources of information about monitoring in Spark and Flink available online.

Which tools could (or should) I use?

JMX, Graphite and other "usual suspects" work fine. Ganglia support is also available for Flink out of the box, a custom Spark build is required due to licensing restrictions (more details here).
It should be noted though, that getting the best out of monitoring due to the distributed nature of your application can be tricky. Also, since built-in metrics do not necessarily cover all your needs, you will probably need to add your own metrics. Unfortunately, there aren't really good implementation patterns for such purpose specified in neither Flink nor Spark.

How should I handle the system log?

If you are using a resource manager (e.g. YARN), you can use its history server and related log management features out of the box. But in any case, you need to collect and store the log on your own. There isn't anything built in for this purpose, other than what Log4J / SLF4J implementations provide. Note that Spark used to be built on top of Log4J, so in order to use different logging implementations require a bit more work to set up.

Monday, May 23, 2016

Embrace the Change with Event Streams

Information systems thrive on change. We track and record changes of the state of the world, we generate new knowledge by observing changes and perhaps most importantly, we predict future changes by accumulating the knowledge about changes in the past. Tracking and observing all significant changes in large information systems is not a trivial challenge, however.

How do we observe change? As one would expect, there are several different methods for doing just that. You could be an active observer, remembering the state of the world and comparing your knowledge to the current state of the world once in a while. This kind of “polling” approach can be useful in situations where changes do not happen often, the state of the world is relatively easy to observe and small enough to have your own copy of it. Another option would be to have the world “tell” you directly when it changes. This requires very little of you as an observer, but puts expectations on the one changing the state, increasing overall complexity. Plus, if a new observer arrives, the world must adapt to inform both of you on the change. Clearly, we need something in between these two. Something that allows for significant changes of the world state (what we sometimes call events) to be observed without too many restrictive requirements on the world and its observers. This is where event-driven architectures come to play.

Event Stream is a new communication platform in Playtech's Information Management Solution (IMS) that is destined to fill this middle ground and to put forth a common software platform that provides a continuous flow of events from internal event producers to all interested event consumers. The project kicked off at the end of 2015 and is currently being developed in the IMS team located in Tallinn, Estonia, with the help and input from architects and developers from our offices around the globe. As a result of this project, a new method of communication between the technical services of IMS will be delivered. This allows for further decoupling and future expansion of IMS as a set of small services. For more information on what the IMS means to us and how it is being developed, have a look at this article by one of my colleagues, Ivo.

From the perspective of a software engineer, the Event Stream platform is built on top of Apache Kafka, which already is taking roots in the IMS ecosystem. We develop a custom-made API for both consumers and producers, but also provide message wire format information for implementing your own. Most of message serialization is done using the open Apache Avro binary or JSON formats. In the middle of the streams stands an Event Stream Manager service, built on top of Apache Spark that keeps watch on the consistency and general health of the stream.

Does this mean that the way we have been doing inter-service communication is wrong? Not at all, the Event Stream is not the “silver bullet” to replace current communication protocols. It does, however, add a new dimension to communication and broadens the scale of future opportunities. Imagine running a bus company. The direct service-to-service APIs act like bus drivers – they know how to drive the system, how to get from one stop to another. You most definitely need drivers to run a successful bus company. But from the business perspective, it would not hurt to also have passengers or observers. They prefer to sit on the bus and watch the scenery – events – go by, enjoying the ride and coming back for more. This “observer experience” is what the Event Stream brings to applications.

Being an observer opens up a whole new set of opportunities. First of all, it gives us motivation to sort out what exactly are the significant (business) events that happen in IMS, bringing us closer to a model of the gaming business that is unified over all the services of IMS. On the other hand, new knowledge – and what might be even more important, additional business value – can be created on the observer-side (i.e. without the involvement of the real sources of information). For example, think about our core business activity – gaming. Based on gaming, we can assign bonuses and awards, provide player support, involve players in different campaigns and perform a number of other activities. Yet, gaming by itself is a relatively localized process with a restricted set of actions involved. Involving every other potentially interested party in the gaming process directly would (and in some cases, it already has) rapidly increase the complexity of our information systems, out of our grasp in terms of maintainability. To keep this evolution under control, we introduce a new view of the shared timeline of all IMS services, a view where all events that have happened flow through separate “streams” which every other service is able to follow and make decisions upon.

This unified medium of delivery between services also leads us to the path of distributed data. No longer do we need – or want – to have all the data centralized in a place where most of the action happens. Instead, it will become more natural to distribute data between different tiers of accessibility, availability, structure and cost. While real-time systems can work on the usual on-line transaction processing databases, reporting and business intelligence can gather their data from bigger, less normalized data warehouses. Data aggregation can happen immediately from streamed events, but also from historical data stored in an offline distributed data store. A lot of the infrastructure can start using commodity hardware to save on costs.

As one might expect, there are substantial expectations on the quality of such an event delivery solution. Over a million events per minute, being delivered in an eventually consistent manner in a predictable order is a feat that requires specialized software components and design patterns. We aim to keep the overhead on the event producer side to a minimum, while still keeping consistency guarantees and providing near-realtime delivery to consumers. Load balancing between consumers must happen naturally and the internal technical API both consumers and producers are bound to use must be simple, yet technology-agnostic.

Most important of all, we hope to gain a lot of experience from introducing event-driven architecture to IMS and to see how the open-source technologies behind it can shape our understanding of how modern scalable online gaming systems could be built. We aim to do our best sharing this experience with you and perhaps spark an interest in Spark and other interesting technologies out there. If our wary adventures into the wonderful world of event streaming could be interesting for you to read about, feel free to leave a comment. No-one should be left behind.

Sunday, January 17, 2016

How to Design an API?

I would consider APIs or Application Programming Interfaces to be one of the cornerstones of reusable software. A well-designed API can empower a tiny bit of software to be reusable and useful for a lot of engineers out there. A poorly laid-out API, on the other hand, could mean the death of an otherwise functional and useful marvel of software engineering.

In this and related posts, I would like to bring out some of the personal design ideas that I've picked up over the years of building and maintaining APIs. These have definitely been influenced by my own experiences (and perhaps personality) a lot. Hence, do not worry if your intuition or what you might have heard elsewhere on what makes a good API does not match with the opinions presented here. I for sure won't. :)

1. Be deliberate. Design an API with an intent for others using it. Sharing is caring, but only if you share with an intention to share. Using an API should make a programmer feel all fuzzy inside and love the fact that your API is The One API ™ that makes her life that much more easier. If you feel that you do not want to build an API, you do not have to. It's all cool. If you feel that you want to, be ready to put an effort in it.

2. Use meaningful names. I hate naming things. Naming is important. You should name everything in your API in a meaningful, recognizable way. In fact, I would go so far to say that a good API should not need documentation. Just good names. Did I mention I hate naming? You probably will too, when you are designing APIs. But you will love the result, after every class, method, parameter and constant has a proper name. And so will everyone else who is using your API.

3. Evolution beats revolution. Design your API with extension and evolution in mind. An API that does one thing well, is a nice thing to have. What you want is an API that can do that one thing better in the future. Doing something better often involves doing it a bit differently. But you shouldn't discard the folks who are satisfied with the current state of affairs. So don't lock them out by re-writing the API from scratch. Keep compatibility, keep your promises and keep your past decisions close to your heart. You might suffer because of your odd decisions in the past. Good. Suffering helps you be better the next time around. Just don't take it personally. There's you and then there's your code.

4. No surprises. Some folks call it the POLA. I call it "not being an a-hole". Personally, I dislike surprises. Even when packaged as an API. If your well-named deliberate and highly-evolved API shows up to a party with a whole different face and does something developers had no idea about, it can get a bad rep. If your API does something like that in a production environment, then amazement happens on a whole different level. All-in-all, design an API to do what you have promised it will do and keep the side-effects to a minimum.

5. Freedom of choice. Allow the developer using your API choose exactly where, when or how your API will be called. Reduce the number of expectations you set on the user. Don't hide important choices from your user. Be clear about details on how your API behaves in certain situations and allow for behaviours to be configured. Just remember to do one thing well - it's better to provide a different API than another configuration parameter changing the whole contract of the API.

6. Be respectful. Treat your user's data with care. Let your users manage and own resources; do not use more resources than absolutely necessary. Provide meaningful feedback in API responses. Do not blame the user for bad input or unexpected calls. In general, be a respectful partner to your users and make your API behave the same way with the applications interfacing with it. Your API is there to help others to get where they want to.

7. Listen to feedback. Talk to your users and listen to what they have to say about your API. Try to find some meaning in all the constructive and destructive criticism they have to offer. Drive the future of your API in a direction you want to go, but use the moments of feedback to stop on the road and look around at the scenery. You probably don't get to go see all the scenery on the way, but you just might be able to pick up someone to share this ride with.

Friday, March 20, 2015

Profiling PHP Applications with Xdebug and CacheGrind

Recently I worked on a project that required setting up a code profiler for PHP on Windows. In case you are interested in doing the same, here are some tips for you.


- Working installation of PHP (version 5.2 or above)
- Apache or any other web server that runs together with the PHP installation (assuming you are profiling web pages).


Step 1: Download the correct Xdebug DLL from Note that you have to download a DLL that matches your version and architecture of PHP, This information you can get from the output of phpinfo.

Step 2: Save the DLL to a reasonable location (e.g. somewhere near or in the PHP installation directory). As this is not a PHP module, this does not have to be the PHP's extensions directory.

Step 3: Configure the php.ini file (in your PHP installation folder):
zend_extension = "[the location of the xdebug dll]\php_xdebug-2.2.7-5.3-vc9.dll"

; this turns off the profiler by default:
; this makes the profiler overwrite, rather than append to profiling output
; (i.e. only the profile of the last request will be stored in output)
; this enables reading a GET/POST/COOKIE parameter named XDEBUG_PROFILE to profile a specific request
xdebug.profiler_output_dir="[Path where to store the profiler output]"

If everything succeeded, then after restarting Apache, there should be an Xdebug section in the output of phpinfo() and the phrase "with Xdebug ..." next to the Zend framework logo. If this is not so, verify that the path to the DLL is correct and that the DLL matches the version of your PHP installation.
More parameters to play with can be found on the documentation page of Xdebug. In general, if you want to profile just one specific HTTP request, it makes more sense to use the profiler_enable_trigger=1 parameter, with profiler_enable=0, so that you can pinpoint the exact request by adding an XDEBUG_PROFILE=1 parameter to the request. On the other hand, to get the total profiling output for a series of requests, you should set profiler_append=1 and possibly profiler_enable=1 (or XDEBUG_PROFILE=1 in a cookie) to get a cumulative profile of multiple queries in a row.
Note that by default, Xdebug writes the output to a single file, called cachegrind.out.[process ID].

Step 4: Install QCacheGrind. This one is a lovely Windows port of the KCacheGrind tool that allows you to analyze the profiling output. Documentation on how to use it is available at

Keep in mind that the profiling output can grow very fast if the profiler is constantly on. With QCacheGrind, there seems to be an upper bound to the number of profiles that are stored in one output file - from some point on, the tool tends to crash without warning. So, keep your profiling output lean or distribute it to separate files (see

Step 5: Success! Profiling should now be enabled.

Friday, October 10, 2014

MicroReview - BitFenix Prodigy Mini Tower

Disclaimer / warning: This review contains opinions. Also, if you're looking for a technical specification or unbiased information on how this piece of hardware actually fulfills it's duty, you won't likely find it here. But, if you are a system builder and would like a second opinion or hints on what to keep in mind when working with this specific component, look no further. :)

The BitFenix Prodigy is a case that comes in many colors. Including white, as was mine.

It is a mini-ITX case which looks a bit like a micro-ATX case with an eating disorder, but the real reason for its chubby look is that it actually houses the motherboard in a horizontal position. Which can be quite convenient in some situations. There are no usual rubbery legs on this one, but rather both the top and the bottom of the case have "handles" - soft plastic design elements. These make the case a bit wobbly and slidey on its legs, and probably don't function that well as lifting handles when the case is full of hardware. But from design perspective, the arches are "meh" enough to like the case.

From the back-side of the case you can see that the motherboard is positioned somewhere in the middle of the case vertically, with the power supply mounted below it. This leaves enough breathing space for the processor and the graphics card - and with a large back-mounted case fan, provides a reasonably good cooling solution. On the top of the case there's a removable grill for better cooling (and noise-making).

Besides the user guide, this is what you find in the package:

The case comes with a handful of screws and a USB 2.0 to USB 3.0 cable included. Just in case your motherboard does not have the modern USB 3.0 connector available. On the inside, there are several places where you can mount your HDDs and SSDs. Not only in the drive rack, but there are special mounting areas for 2.5'' drives on the "other" side panel.

When assembling (or finding suitable components for) the case, be aware that the power supply area might not suit that well for every PSU out there. Had some minor trouble fitting the cables of a Chieftec 450W power supply, for example. Luckily, there's enough room for bending and storing extra cabling, if needed.

Horizontal mounting of the motherboard can make it a bit easier to work with the case without having to turn it sideways, but with a case that small, getting access to the motherboard screw can be a bit difficult. So, either find a short screwdriver or one that can bend around corners.

Although the front panel for the case can be removed easily, be aware of the bezel for the optical drive. To remove it you HAVE to remove the whole front panel, as the bezel is attached with screws - rather than with pressure-fitted plastic pieces or any other non-breaky fixtures.
All in all, a cool and a bit unusual case to work with. Requires some fiddling and possibly some slightly fancier tools, but the end result looks really nice and clean.

Thursday, September 4, 2014

Renaming multiple files in Windows Explorer

I've been a Windows user for 10+ years by now and only today seemed to accidentally want to rename a selection of multiple files in Windows Explorer. And, as it turns out, it's actually a feature. Known from ages ago. Fun.

MicroReview - Cooler Master Elite 343 Mini Tower Case

Disclaimer / warning: This review contains opinions. Also, if you're looking for a technical specification or unbiased information on how this piece of hardware actually fulfills it's duty, you won't likely find it here. But, if you are a system builder and would like a second opinion or hints on what to keep in mind when working with this specific component, look no further. :)

The Cooler Master Elite 343 is a lightweight aluminum mini-tower case that fits a microATX motherboard, all up to the maximum 244 mm depth. When unpacked, this is what the case looks like:
As additional packaging material, there are three small bags of screws included: 
  • M3 screws (the ones with a tighter thread) - 16 pcs,
  • 6-32 screws (round head) - 16 pcs, and
  • 6-32-6 screws (hex head) - 17 pcs, 
plus a handful of cable ties and nine motherboard stand-offs. Also, a small buzzer (PC speaker) and a manual are included.

The front panel fingerprint-magnets are nicely covered with thin transparent film. The film is not in one piece, but rather in separate bits for each (re)movable surface it covers. This means that you can avoid removing the film until the last moment when your system is finished.
From behind, the case looks like this:
Pretty much a standard case layout, with a top-mounted power supply (not included). Both side panels are fixed with thumbscrews that are easy to remove. The side panels themselves are a bit tricky to fasten to the sides - as is often the case with softer panels - need to take care to make sure both the top and the bottom edge are properly fastened from each fastening point. On the back and the left side panel, there are mounting points for additional fans. Only the topmost cover for extension cards (the main PCIe slot, usually) is removable and re-attachable using a screw.
On the inside (at least in my case), the wiring for the front panel fan was really well hidden. Had to remove the motherboard-side panel to get to the wiring. Also, there's an interesting thumbscrew to fix the drive bay. As it turns out, the drive bay is removable - but to remove it, you would also need to unscrew four additional screws on the bottom of the case. Wouldn't recommend keeping them unscrewed though for convenience, since the drive bay adds a reasonable amount of strength to the otherwise soft aluminum case.

The front panel wiring does not have a universal plug and is provided as separate wires. In addition to the usual LEDs and switches, there are also a plug for HD audio, an AC'97 connector and one USB connector for the front-panel USB ports.
As mentioned before, the front panel features a case fan, with both a 4-pin HDD connector and a 3-pin fan connector available.

Installing the motherboard and a PSU to the case was a breeze, no problems encountered. The 5.25'' drives are made especially simple to install, with quick fasteners.

Installing a graphics card required a little bit of additional effort - some physical force was needed to fasten the metal panel of the extension card to the back-panel of the case. This was mostly due to the motherboard being mounted very closely to the edge of the case, which left not much room for maneuvering with the card itself.

All-in-all, a solid case with a reasonably small price tag. Not much to look at, but also not something you would need to hide from your friends and family. Easy to use and assemble if you're looking to build a simple microATX system.