Title_Documentation

Documentation

Contact Us

Knowledge Base Display

Data Streaming

The term data streaming refers to a flow of information arriving from a source in an undefined mode. In this paradigm there are different technologies that have different objectives:

  • Efficient message sorting
  • Real-time processing of messages
  • Real-time batch processing of messages

Applications that are developed in the context of data streaming are called, precisely, Streaming Applications. There are a number of key features within the development of Streaming Applications that need to be considered so as to understand the fundamental aspects and limitations of each framework.

  • Delivery Guarantees - This refers to the level of guaranteed delivery of information. There are different types:
  • At least one: the information will be processed at least once even in case of failure. This approach can generate duplicates, if not managed properly, in downstream systems.
  • At most one: The information will be processed at most once, so it may not even arrive in case of errors in the delivery chain.
  • Exactly one: The information will be delivered once and only once, even in case of errors

We understand well how exactly one semantic is, of course, the desirable approach but it is more complex to implement in distributed systems and sometimes requires trade-offs with performance.

  • Fault Tolerance - In case of problems, such as node, network, or other malfunctions, the framework must be able to restore its state and continue processing from where it stopped. This is typically provided by some sort of checkpoint made available by some technologies e.g. offset in kafka.
  • State Management - In the case of stateful processing it is necessary to have the ability to maintain a state. For this reason, a very important aspect is also related to the support of this functionality (saving and updating the state) by the framework used.
  • Performance - Latency (how fast a record is processed), throughput (number of records per second) and scalability. The goal is always to minimize latency and maximize throughput, but we know this is difficult to achieve.
  • Advanced Features - Event time processing, Wartermarks, Windowing are very important and useful features for The most complex use cases. For example process a record based on when it was generated by the source (time processing).
  • Maturity - It is important to know that from a usage perspective a technology has been widely adopted and has proven levels of scalability and performance in the enterprise environment.

Streaming Types

two possible approaches to streaming are:

  • Native: The record is processed as soon as it arrives without waiting further. There are always-on processes (operators, tasks, bolts depending on the technology) that process the record as soon as it arrives e.g. Flink, Storm, Kafka Streams.
  • Micro-batching: also known as Fast-batching , involves grouping data into small batches for a given period of time (milliseconds, seconds) and then processing them together in a single run of the batch. Examples of these technologies are Spark Streaming, Storm-Trident.

Both approaches have advantages and disadvantages: For example, “native streaming” aims to minimize latency by processing each record as soon as it arrives. However, this involves a throughput trade-off in that for each record, one must also manage the checkpoint to have robust error handling. In this case, state-management is also easy to implement by having processes dedicated to data consumption.

“Micro-batching” is exactly the reverse: error-management is easy to implement and throughput is high since the group of records is processed by parallelized processing and the checkpoint is handled on a group of records at the same time. The cons of this approach is that, of course, latency is introduced and it will not appear as “pure” streaming. The state management could be more articulated than in the previous case.

HyperIoT Framework & Data Streaming

Currently, HyperIoT natively supports the following technologies:

  • Apache Kafka
  • Apache Storm

Depending on the technology addressed, different scenarios open up. For example, Kafka is also used internally as technologies to enable Saga transactions. These aspects will be explored later in the chapter on Integration Patterns.

Advanced WebSockets Previous

Child Articles (2)

  • Apache Kafka

    Kafka was born in LinkedIn as an open-source project and as a high-performance event-streaming technology based on commit-log. The project evolves over time by orienting its vision beyond “simple”...

  • Apache Storm

    Apache Storm is a computational framework for distributed stream processing written primarily in the Clojure programming language. Originally created by Nathan Marz and the BackType team, the...