Mulesoft and Apache Kafka for real time data streaming and analytics

April 23, 2017

When it comes to moving large amounts of data from one place to another rapidly and scaleably, Apache Kafka is a excellent choice for enterprises. As a powerful messaging system, Apache Kafka is tailored for high throughput use cases where vast amounts of data need to be moved in a scalable, fault tolerant way. An ideal use case for Apache Kafka is managing data from different data sources such as log streams, records being sent to a database, key value pairs for No SQL databases like Redis all applications creating data at an incredible rate. The rate at which data moves can often strain existing data stores and require more stores to take on the load. Furthermore, a messaging environment is dependent on the ability for message consumers to actually consume at a reasonable rate. There is also the challenge of fault-tolerance. Another challenge in existing messaging systems is the fact that most of them are implemented on a single node or host, which generally relies on a limited amount of local or quoted storage. This can become a problem when you have lazy, slow, or unresponsive application consumers for whatever reason.

Using Apache Kafka, independent data producing applications send data on a topic, and any interested consuming application can listen in and receive data on that topic, which it could process, and in turn, produce on a different topic for others to consume. In Kafka, topics are maintained in a broker - the Kafka Broker is a software process also referred to as an executable or demon service that runs on a machine, physical machine or a virtual machine.

To achieve high throughput, Apache Kafka allows you to scale out the number of broker therefore distributing its load and allowing you to efficiently processes it on multiple nodes in parallel( which forms a cluster), all of this without affecting existing producer and consumer applications. Apache Zookeeper is what enables the grouping of individual broker nodes into a cluster therefore distributing message processing across multiple brokers in a fault tolerant way.

Given this isn't really supposed to be a tutorial on Apache Kafka, I'll stop here and start talking about the real topic - how Apache Kafka can be integrated with Mulesoft as an ESB to move data from multiple inbound data systems to a target system such as Hadoop HDFS to do large scale data processing and analytics.

This is an ideal use case for an ESB to handle and Mulesoft with its rich set of connectors makes it very easy to setup multiple receivers on the bus which can then send data to No SQL data stores such as MongoDB or distributed file systems such as HDFS. Other applications can then run analytics on this data and store the results for later review.

Before we can go to more complex examples, we need to set up a simple producer - consumer example which (a) sends message to the bus and (b) has an endpoint which consumes messages from the bus.

Step 1 - Provide a consumer.properties file wih the following information.
group.id=MyGroupId

enable.auto.commit=true

auto.offset.reset=earliest

Step 2 - Include the following entries in the mule-app.properties file.

config.bootstrapServers=localhost:9092

config.consumerPropertiesFile=consumer.properties

config.producerPropertiesFile=producer.properties

Step 3 - Provide consumer specific information

consumer.topic=testTopic

consumer.topic.partitions=2

Step 4 - Provide a producer.properties file with the following information.
config.bootstrapServers=localhost:9093

config.keySerializer=org.apache.kafka.common.serialization.StringSerializer

config.valueSerializer=org.apache.kafka.common.serialization.StringSerializer

The bootstrap server information above points to your Kafka Broker. The topic is where messages will be posted. The serializer information is needed to correctly serialize information to and from the broker at runtime.

The following is a sample configuration for your broker.

The following Mulesoft code allows you to POST serialized messages to the topic configured in the Kafka broker

<flow name="producer-flow">
<http:listener config-ref="HTTP_Listener_Configuration" path="/pushMessage" doc:name="Push message endpoint"/>
<dw:transform-message doc:name="Push request transformer">
<dw:set-payload><![CDATA[%dw 1.0
%output application/java
---
{
"message": payload.message,
"topic": payload.topic
}]]></dw:set-payload>
</dw:transform-message>
<logger message="Message: "#[payload.message]" is going to be published to topic: "#[payload.topic]"." level="INFO" doc:name="Data to be produced logger"/>
<apachekafka:producer key="#[server.dateTime.getMilliSeconds()]" message="#[payload.message]" topic="#[payload.topic]" doc:name="Kafka publisher" config-ref="Apache_Kafka__Configuration"/>
<set-payload value="Message successfully sent to Kafka topic." doc:name="Push response builder"/>

</flow>

and finally the following Mulesoft code allows you to consume the message off the broker.

<flow name="consumer-flow">
<apachekafka:consumer topic="${consumer.topic}" partitions="${consumer.topic.partitions}" doc:name="Kafka consumer" config-ref="Apache_Kafka__Configuration"/>
<logger message="New message arrived: #[payload]" level="INFO" doc:name="Consumed message logger"/>

</flow>