January 26, 2021
In a previous article we discussed on how to perform “Movie Data Statistics with Apache Spark” showing you the way to leverage Spark to perform Batch Processing.
Today, and as we (at SOUTHWORKS) have continued working on Big Data and Distributed Task related projects, I want to show you a sample of how you can mix and match Spark Streaming with Batch Processing capabilities to perform real-time analytics of chunks of data flowing through the system.
With the rise of requirements for processing big datasets and doing real-time analytics you need to orchestrate the right solution and, by grabbing and combining both flavors you get from Spark, you bring to life a Lambda Architecture capable of handling those massive quantities of data by taking advantage of both batch and stream-processing methods.
Before we move on, I want to mention the group of people that collaborated on creating this article, the reference implementation linked to it and/or provided valuable feedback to allow me sharing it with you: Lucas Nicolas Cabello, Santiago Calle, Pablo Costantini, Mauro Ezequiel Della Vecchia, Derek Nicolas Fernandez & Alejandro Tizzoni.
We all hope you enjoy it 😃
This time we are going to see the underlying architecture you need to be able to perform near real-time video analysis, supporting both VOD and live video, using Spark. The idea of the whole thing is to show you how to handle complex workloads in a scalable and distributed way.
The solution admits, as input, video stream from different sources such as video files, webcams, or security cameras feed. Regarding the analysis processes, it has two pre-trained neural networks and an unsupervised learning model.
We’ll delve deeper into the architecture and each of the models later on.
As streaming tool we have chosen to use Apache Spark Structured Streaming functionalities. We are not going to explain how Spark works or what is a Dataframe, but you can check that out in our previous article .
Spark Structured Streaming is a bit more complicated to understand but as simple to develop. It is a stream processing engine built on top of the Spark SQL engine, which allows us to express streaming computation in the same way that we would express a batch computation on static data. The Structured Streaming engine executes a series of small batch jobs (often referred to as micro-batches) upon data streams and uses Write Ahead Logs and checkpointing to provide exactly-once (guarantee that each record will be processed once and only once even when there is a failure) fault-tolerance guarantees
In short, as said in the official documentation : ‘Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming’.
As we previously mentioned, the scenario targets a pipeline for real-time video analysis with high degrees of parallelization — video usage here is just to get a constant flow of streamed data (as it is simply being treated as that). For the sake of it, video sent from different sources gets processed by several state-of-the-art deep learning models, performing real-time analysis over chunks of data (frames).
For this purpose, we carefully selected three models (more details below) for the following use-cases:
Disclaimer: Although we detailed the AI models of choice below— the focus of this article is not on the models nor the logic chosen to process the data being ingested, but how to leverage an ingestion/analysis real-time pipeline that does the job.
The objective is to achieve a highly scalable live analysis & inference on extracted frames from the input video.
To keep track of how does our solution perform, we defined some metrics:
The solution heavily relies on Spark capabilities for distributed processing along with Kafka , since it’s the go-to option for real-time streaming data architectures while allowing delivering messages to multiple subscribers.
In addition, Cassandra is being used as the persistence layer, which is a high velocity No-SQL database which can handle both structured and non-structured data.
Note: In this repository you can find the entire reference implementation.
If you want to run the code locally please follow steps on the README file.
It is done by leveraging a docker environment, and a container has been created for each component of the solution.
There are two Kafka topics for queueing messages, and we push only the minimum needed information for each step into the topics:
We defined a keyspace named ‘videoanalysis’. In it, we defined three tables to store all the data generated in the analysis. The first one is videos. It contains video_id, which is the primary key; video_url; status; and error, which is a user-friendly error message to display if any error occurs. The second one is metrics. It contains metric_id, which is the primary key; video_id, which is videos’ table foreign key; metric_name; and metric_value. The third one is analysis. It contains video_id, which is videos’ table foreign key; frame_number; video_timestamp; inference model, which alongside video_id and frame number compose the primary key; inference_result; and inference_time.
We decided to go with pre-trained models for our video analysis tool for two main reasons:
But we did not only use pre-trained models, as we have also used an unsupervised learning algorithm to accomplish one of the functionalities. In this case we use unsupervised learning for cluster analysis, which groups or segments datasets with shared or similar attributes in order to estimate algorithmic relationships.
YOLOv3 is a real-time object detection system. This model applies a neural network to the full image that divides it in regions and after doing so, it predicts bounding boxes and probabilities for each of the regions it produced. Bounding boxes are then weighted by the predicted probabilities of the region it corresponds to.
An easier explanation to it: if we input a video frame, YOLO calculates the relative coordinates of the objects it recognizes and assigns a probability to this prediction. Anyone can then choose what probability is considered acceptable for the result to be correct and keep or ignore gotten results.
For a more detailed view, consider visiting the official webpage .
To detect joy we had to use three models, not just one:
First, the face detection model recognizes faces within a grayscale version of the original image, and returns their coordinates.
The original image is put away, and two images are generated for each face detected:
The Gender detection model and the Emotion detection model don’t have any inter-dependency and can be run on any order. The Gender detection model runs over the RGB images, meanwhile the Emotion detection model runs over grayscale images.
Finally, a single output is generated, containing for each face detected:
To extract the color palette of the multiple frames that are streamed to our solution, we decided to choose an unsupervised learning algorithm for clustering: K-Means .
Let’s say you have some vector X that contains n data points. Running our algorithm consists of the following steps:
This can be applied to gather the different points of a frame by color, and then extract their RGB code to obtain the color palette of the clusters, and therefore the color palette of the frame itself.
The API represents the entry-point for this reference implementation. Its endpoints allow us submitting videos for analysis, query both status and results, in addition to view specific frames alongside the result of their analysis.
This API is powered by Flask .
The RTMP endpoint allows submitting live video for analysis. To start streaming, a request must be sent to /analysis/rtmp-endpoint, indicating the desired models to be executed. The response will contain the address where the bitstream must be pushed.
The ReactJS provides a graphical interface to make the user experience simpler.
The app consists of 3 tabs:
This is not a straightforward nor simple solution, nevertheless it helps us accomplish what I wanted to show you on how to leverage Spark Streaming for real-time analytics at scale. The reference implementation focus on showing Spark’s parallelization power to handle constant chunks of data concurrently while applying different processes to each one and building, piece by piece, an interesting result.
While building the scenario the team decided to use pre-trained models and brought AI into the picture to make it more compelling, although for the sake of it the models themselves (ie. processes applied to the data) are not so relevant.
As final words, we see that the solution performs quite well even though there is still room for improvement, but at last pipeline’s performance will finally depend on the logic applied to your data. With time and effort even the models could be modified to work seemingly across the pipeline, but to demonstrate that Spark Structured Streaming can be combined with Machine Learning, it is far more than enough.
Originally published by Mauro Krikorian for SOUTHWORKS on Medium 26 January 2021