During Part 1, we introduced a reference implementation on how to implement a messaging-like app focused on monitoring users’ “mental health”, leveraging AWS Serverless Computing, Data Platform and AI Services. We focused on the data ingestion and processing part of the pipeline, describing the main architecture and how each component works.
This article (Part 2), focuses on Monitoring, Alerting and Visualization, more precisely about real time notifications and mid/long-term insights. To showcase near real time alerting we added a notification pipeline that would send alerts to the user and administrator when the former submits a streak of negative messages. On the other hand, for mid & long-term insights we created a series of dashboards to provide information to the administrators about the users sentiment and some other dashboards for the performance of the sentiment analysis process.
Before moving forward I want to mention the group of people that collaborated on creating this article, the reference implementation linked to it and/or provided valuable feedback to allow me sharing it with you: Lucas Cabello, Santiago Calle, Pablo Costantini, Mauro Della Vecchia & Alejandro Tizzoni.
So without further ado, let’s dive in the services used to implement these two features.
Real Time Alerting
As previously mentioned, we implemented a new feature that generates notifications when a user sends continuous streaks of negative messages. To do so, each sentiment got assigned a numeric value. Positive ones weight +1 and negative ones -1, while mixed and neutral are considered as 0, for they do not affect the outcome. The trigger condition is an aggregated result equals or lesser than -3 within a time window of 1 hour.
While the most obvious way a user can reach it is sending 3 straight negative messages, if within a group of 5 messages there is 1 positive and 4 negatives, the condition is also met.
When the negative streak threshold is exceeded, the pipeline sends a notification using SNS and SES. This feature allows an immediate action towards caring of our users’ well-being.
Sentiment Analysis step function refactor: To begin with, we performed a refactor over the Sentiment Analysis step function where the sentiment output now is sent to a Kinesis Stream instead of being directly persisted to DynamoDB and the S3 Datalake. This allowed us to use Kinesis Data Analytics to perform real time analysis over the stream of messages that are sent to the Kinesis stream.
Kinesis Data Analytics: This service provides an easy way to transform and analyze streaming data in real time using Apache Flink or SQL-like queries. In our case we used it to first transform the stream by changing the sentiment result from a text value to a numerical one (as explained earlier) and then performing a windowed query that filters the users that surpass the allowed negative threshold. The output is sent to another Kinesis stream that will be used to trigger the Notifications step function.
Notifications Step Function: Inside this step function we first get the user data from DynamoDB to merge it with the information received from the Notifications stream. The resulting information is sent in parallel to SES and SNS to notify the users and administrators.
Mid & Long term Monitoring
To easily analyze all the information gathered by the system, we first use Kinesis Data Firehose to ingest the sentiment results stream and persist it to a S3 Data Lake. By default, Firehose store the data partitioned by event time. It creates a folder structure (year/month/day/hour/) and save the data into it creating one or multiple files according to the buffer size and time configuration. We configured Firehose to store the information in Apache Parquet format, as with this one, we can save storage by using compression while also improve the performance of several queries we run later.
Below you can see a benchmark on reading a few bunch of records with JSON and Parquet formats, which showcases our last statement:
The data stored in S3 Data Lake is being used to feed some dashboards and get an interactive view. For this we chose Athena for querying the Data Lake and Quicksight visualizing it as shareable dashboards.
For the sake of the sample we created a series of dashboards with ‘common’ use cases, but the possibilities are endless.
Pipeline’s performance dashboard
The main idea of this dashboard is to have information about how the pipeline is performing, we can check the number of messages recognized in each category for different time frames and also the amount of messages recognized by Lex or Comprehend — with this last insight we know whether or not we need to evolve the sentiment recognition model and/or start brainstorming about how to take it to the next level.
User’s evolution dashboard
This dashboard allows an administrator to quickly see how a user was feeling in the last 7 days or in any given time window.
The idea of this last dashboard is to show sentiment information for a certain group of users, here we can use any information for grouping as the user’s age, nationality, gender, etc.
The main goal for this second part of the same reference implementation, that started with ingesting and processing users’ sentiment, is to show how you can track and react in real-time to the messages being sent by participants. In addition, we wanted to show you how with Big Data Analytics and Visualization tools you can get interesting insights on the mid/long-term about sentiment evolution per user, age, country in our case — but with the intention of giving you an idea of how you can leverage the same solution on any other relevant data your business might have.
Going into the technical details, first we showed that with the use of Kinesis Data Analytics we achieved performing real time analysis over streams of data that flow through a Kinesis Stream. This AWS service can be used to analyze any kind of streaming data, with SQL-like queries that easily transform and filter the information to provide real time insights on the data being streamed.
Additionally, by leveraging Athena and Quicksight, we showed you how a S3 Data Lake can be easily queried to create useful dashboards for the administrators. We have also briefly talked about Big Data file formats and we chose Parquet as it met our reading and storage expectations better than other formats offered by the tool of our choice (Firehose in this case). Quicksight dashboards are completely customizable and can provide information about the pipeline performance and the users sentiment evolution.
Finally, and as a short recap on both articles so far, we have covered the first stages of a Big Data Analytics Pipeline, from data ingestion to processing and storage in the first one, while this one covers another mile getting into querying, visualization and real-time monitoring and alerting. In the next article, which will be the final part of these series, we will be covering the last mile to make the whole solution smarter and allow you to take better decisions — we will target sentiment model improvement going deeper into the machine learning, and supervised learning specifically, world.
If you are interested in seeing this live please refer to the following repository.
Finally! You can find the end of this amazing journey here.
Originally published by Mauro Krikorian for SOUTHWORKS on Medium 7 May 2021