Procesando datasets grandes usando Kafka e Spark Streaming
We have seen how the data analytics has taken an important role in the industry in the last couple of years. We knew data analytics as a batch process, where we wanted to extract insights from the data once month, once a week and even once a day. Use cases has become more aggresive and now we are not only want to do data analysis on batch but also we want to do data analysis on real time. Use cases such as fraud detection, where we want to detect fraud on real time analysing the user behaviour in a web page. When someone wants to design an application which needs cover these kind of uses cases, is necessary to think about fault tolerance, how much data will be processed, throughput versus latency, scalability, data format (Avro vs Parquet), among many others.
This proposal aims to explain how to approach requirements which requires Streaming processing for large datasets using Hadoop technologies such as Apache Kafka and Spark Streaming. Covering the whole Technology Stack such as Apache Avro and Evolution Schema, Schema Registry for storing the different schema versions for the avro documents, Apache Zookeeper for tracking the Kafka offsets.