Mauro Krikorian
January 8, 2021
Continuing with SOUTHWORKS’ commitment to Big Data projects and our contribution to the big community of Azure developers out there, we present this article in which we describe an application that ingests sport events data from multiple sources, processes them and generates an insightful dashboard with information about matches, teams and Twitter influencers from the Spanish Football League.
All this is done with the help of several managed services from the Azure Data Stack: Data Factory, Databricks & Stream Analytics — just to name a few. Let’s first introduce project’s context/scenario and then move on to how we implemented a solution for it.
Before we move on, I want to mention the group of people that collaborated on creating this article, the reference implementation linked to it and/or provided valuable feedback to allow me sharing it with you: Lucas Gabriel Abella, Nicolás Bello Camilletti , Mariano Agustín Iglesias & Diego Satizabal.
We all hope you enjoy it 😃
Let’s suppose that 3 different actors from a sporting TV show have requested us to provide a solution for these 3 situations:
a) Luckily, they can provide us an API with data of the Spanish Football League. The endpoint’s responses are not precisely Big Data, but they need to be refreshed on a daily basis.
b) Also, they are just interested in Twitter as the social media source. Here the big challenge is to continuously feed our solution with this streaming content, while combining it with the API data.
Having these UCs mind, we come up the following solution that we are going to discuss during the rest of the article.
Based on detailed UCs, and as explained within the intro, the main purpose of this application is to generate a dashboard to show sporting events — in our case related to the Spanish Football League.
Here are its main responsibilities:
The following image depicts the architecture described above:
So, let’s see what each piece of the diagram shown above does within the next sections.
The first step in our solution is the semi-static data ingestion stage. This process consists of retrieving information from two endpoints of the TheSportsDB API, events & teams, on a daily basis.
📝 Note:
As stated on its website, the TheSportsDB is an open, crowd-sourced database of sports artwork and metadata with a free API. It has information on different sports like football, the NBA, the NFL and more. Regarding football information, it contains data about different leagues all around the world like the English, Spanish, Argentinean, among several others.
We are using the information returned by this API for the Spanish Football League (also known as La Liga) as our static-data source. If you want to learn more about this API’s content, you will find all the information related to the available endpoints here.
To accomplish this we are using Azure Data Factory, which is a cloud-based data integration service that allows us to orchestrate and automate data movement and data transformation. We defined a Pipeline with two sequential Copy Activities in order to retrieve information from the endpoints (events & teams), extract the fields we are interested in, and then save that as two intermediary CSV files into the Azure Data Lake. This process also involved the creation of the REST & Data Lake Linked Services, as well as the corresponding Datasets shown in the image below.
You can read more details on how to create your own Data Factory in this first section of our GitHub Workshop, and in our step by step guide on how to ingest the API’s data with Data Factory.
For this we created a Data Flow Activity in the Pipeline. This kind of activity enables us to include several steps inside of it, and we used it to combine the information from the two previous branches (events & teams) into a unique CSV output. Furthermore, we automatically added an extra column to the CSV with a list of keywords for each match. These terms will be used to collect related tweets in the upcoming section. In the next figure you can all see the internal components of the joinData Data Flow, which are a series of join, select, derive and sink operations.
Also, Azure Data Factory has the capability to be triggered with different mechanisms. In our case, we scheduled to trigger the whole Pipeline every day, as it was required by the scenario stated before. Feel free to check this part of our repository for more information on this Data Flow.
This piece of the solution is in charge of obtaining tweets (through the usage of the Twitter API) based on certain keywords already defined. Then, those tweets are sent to Azure Event Hubs to be analyzed in the next stage. For these tasks, we used a C# .NET Core App running inside Azure Container Instances. Below we will briefly explain the different parts involved in this process, but you can see the details here.
The keywords of interest are stored in two separate files. On one side, they are obtained according to the upcoming events (this file is updated daily by the Data Factory) and, on the other hand, we handcrafted a list that contains some synonyms for each team that never vary and are always valid (like Barça for Barcelona F.C.).
These files are stored in Azure Data Lake, so the first step to obtain them is to connect to the resource and retrieve them.
To achieve this we simply leveraged Twitter For Developers, created a developer account and a Twitter Application. With the keys and tokens provided, we connected to Twitter and requested tweets matching the keywords, which can be passed as a parameter.
Afterwards, within the resulting observable, a handler can be created to process each tweet obtained. This will generate a stream of tweets that are being processed as they arrive.
Each tweet fetched is related to a set of keywords which are explicitly associated with a particular event. This implies that we need to create new object combining the most relevant data of a tweet with its associated event. To better illustrate the data contained within the mentioned object, we show you an example below of an empty JSON object.
{
"event_id": "",
"full_tweet": {},
"full_text": "",
"retweet_count": 0,
"favorite_count": 0
}
Once the resulting object has been created, we start sending them to Event Hub for later processing. This will be explained in the following sections.
From the overall diagram, we can see that there are two main Databricks components. In this section we will focus on the one on the top side which uses the streaming features provided by the underlying Spark engine. From previous pipeline steps, we already have a stream of tweets’ data into an Event Hub, now we need to perform some transformations and analysis over that data.
Let’s see what are we doing:
To perform those tasks, we use a Databricks Notebook reading the stream and performing the operations live.
In simple words, text classification consists in assigning a category from a predefined set to a particular short text. For example, you may have the following categories:
The idea is to have an automated way to assign to every piece of text (sentence or paragraph) the corresponding category. As this is not an easy task, it usually requires the use of Machine Learning, Artificial Intelligence, Neural Networks & Deep Learning*.
In our case, and to keep it simple within this sample, we decided to implement the ‘text classifier’ just as a simple manual algorithm (you can see it here). We calculate the score for the text of a given tweet based on the occurrence of words from two sets: one set of football-related words (increase score) and another one with words not related to football (lowers score). We check the words of the given text to see if those belong to football or non-football ones, adding or substracting one score point accordingly, at the end, the resulting score tells us if the text belongs to football or non-football category.
Observe that non-football words were picked carefully because one may tend to think that any word not in the small universe of football-related words may serve, but this is not true. The Football keywords are limited and generally simple to pick (like mentions of teams, famous players, etc), but non-football keywords cannot just be any other word not in the football dictionary. If we pick too many generic words as non-football keywords, the classifier would be biased towards the non-football category.
Without entering much into details here, as it is not the purpose of this article, we can verified that in general 84% of the tweets were classified accurately (known as the classifier’s accuracy). The classification logic can be improved constantly later, trying different approaches and so, but the architecture would remain the same in order to provide this ‘black box’ with inputs to be processed.
*As for the ML/AI approach mentioned above, we documented here a little experiment we did showing the results and failures we had with it for future references and improvements.
In order to perform Sentiment Analysis over the tweets in the stream, first we need to detect the language of the text for two main reasons:
To detect the language, we used Azure Cognitive Services. Even though the Spark NLP library offers language detection capabilities, it did not show good performance with tweets’ text. Azure Cognitive Services offers a wide variety of options to work with Artificial Intelligence, mostly using API calls. Particularly, it offers a Text Analytics feature that allows to detect language, sentiments, keywords and entities, given a certain text. We used the Language Detection feature for the implementation.
First, we created a Cognitive Services instance to then deploy a Text Analytics resource. Once we had our keys we were ready to go and, after a while, we integrated the required API calls to our Databricks Notebook.
You can check this tutorial from Microsoft which helped us to integrate Cognitive Services to Databricks so that we can use this for the stream analysis. However, we replaced the usage of java classes by a manual manipulation of the JSON returned, because we were facing problems with the original way. This guide provides a more detailed view of this stage.
In the Sporting Events Application, we use Sentiment Analysis to add to the Dashboard three additional fields showing for each match the count of tweets with Positive, Neutral or Negative content.
To perform Sentiment Analysis in Databricks we used the NLP Library with a pre-trained model provided by John Snow Labs.
Remember that we managed to implement Text Classification, Language Detection and now we will implement this additional analysis, but it will be done in the main block that reads the stream and performs all the analysis. First step is to include the libraries we need for the analysis, we then need to create the pipeline and with that we perform the analysis of only the tweets that are in English and have been categorized as Football.
Our approach follows this idea, for each line of data (tweets) we create three columns: for Positive, Neutral, and Negative sentiments each, then according to the results of the sentiment analysis for a given tweet, we place a value of 1 in the corresponding column being those mutually exclusive (a tweet may contain a single sentiment at a time).
Here you have more information on how we performed Sentiment Analysis. You can also check the code and find the other parts that perform the connection to the Event Hub to read the stream, the call to the Text Classification and Language Detection functions and then several transformations prior to sending the processed stream to the next Event Hub to be read by the Stream Analytics.
Stream Analytics is a real-time analytics and complex event-processing engine that is designed to analyze and process high volumes of fast streaming data from multiple sources simultaneously, it will let us transform and combine the data being transformed by the previous components.
For this process, we will use three inputs with different characteristics:
By joining these inputs we will be able to generate the outputs for:
All data comes in a way defined in the contract with each previous component. In the case of raw data from the C# app, the input format will contain the following columns:
The processed stream contains the following information:
And, regarding the static CSV files, the following format is expected:
To transform the data, we perform the following steps (see them in more details here):
While performing all these steps we validated the versatility of Stream Analytics to perform transformations, its flexibility for the usage of static and dynamic data sources with large volumes of data, and finally we created two Datasets for Power BI that we will be using to create the main Dashboard.
At this point we already have the Power BI datasets generated by Stream Analytics, so we can go to our Power BI account and find them. Power BI is a business analytics service provided by Microsoft. It aims to deliver interactive visualizations and business intelligence capabilities with an interface simple enough for anyone to create their own reports and dashboards.
📝 Note:
If you may encounter problems accessing the datasets from Power BI web editor, we recommend you to download Power BI desktop editor and proceed that way round. Once in Power BI, we choose Power Platform as our source and select the Power BI datasets options to then connect to it. There, you will be able to see your dataset and create a report for it.
In the previous step we defined two Power BI outputs for the Stream Analytics job. Now, we are going to create a Power BI report to visualize the information consumed and processed by our app.
The idea is to create a serie of reports like these:
To obtain these outputs you can follow our tutorial here. Nevertheless, we will provide an overview of each part of the report.
To generate the first report, we selected different Visualizations from the Power BI panels and drag each field to display it. The process is quite simple using our example as guideline, we based our report on Card elements, but you are able to modify it as desired. In addition, Power BI lets you customize it with several formatting details to the dashboard (custom titles, borders, etc.).
Another special resource we added to the report is the Slicer. This element allows to filter the displayed information based on a certain parameter. We are filtering based on the match name in order to create the detailed view. Also, we added several images, some of them are dynamic (like the teams’ badges, which change based on the match selected) and others are static (like the league’s logo).
To create the ranking report, we added a new page and rename it to something like Matches’ Ranking. Then, added a Funnel visualization, dragging the match name column to the Group placeholder and the match mentions to the Values placeholder. As a result, the matches were ordered by the number of mentions each one has. As before, you can continue adding more funnels (for each metric -favorites, retweets- and for the home and away teams). You can also edit the titles and colors as desired.
To create this report, we needed to connect to the corresponding dataset (not the same we were using above). Here, we based our report on a Table visualization from the Visualizations module and drag the name, screen name, location and followers count columns to the Values placeholder. Next, we added a Card visualization and included the text column in the Fields placeholder. After that, we configured the card to display just the first Tweet from that influencer. Finally, you can adjust the size of each component following our sample reference at top of this article, or choose your favorite format customization.
As it can be seen, we have almost all the information to answer the questions from the project’s requirements. The only missing part is the historical analysis and we will tackle it in the next section.
The final feature of the Sporting Events Application is the capability to log all data that was processed for further analysis. This was done by enabling the Capture feature of Event Hub, it writes the events that passes through it in the Data Lake and can be further loaded for analysis, we did it using a Databricks Notebook. You can check how this was done here.
From the Azure Event Hub documentation we can read that the Capture feature of Event Hub allows to not only process stream data but to save that data for batch processing, allowing us to store the data into Azure Storage (Blob and/or Data Lake) in a scalable, fast, easy-to-use manner with no additional costs.
In fact, any Event Hub can be enabled with the Capture feature, it will place all events coming into it in the selected storage and we can further make additional processing from that data. This can be easily turned on from the Event Hub resource, selecting the desired container. This data will be stored in a directory fashion based on dates, and under AVRO format, which is the only supported file format for Capture as today.
There are many things you could do with the captured data, we will show a simple application here using Databricks. We first load the data, convert it into a table and then perform a simple query to see how many tweets were related to Football and Non-Football:
Also, you can use this stage to get the verified users with more followers (influencers):
What have we shown you? Simply put, an application that integrates different Azure technologies to process big amounts of data. For the sake of the sample, we integrated static data obtained from an API endpoint providing sporting events information, with near to real-time provided by the Twitter Developer API. Later, we transformed the data by using Azure Databricks, Event Hubs, and Stream Analytics services.
As you might have understood while reading this article, Azure Data Stack provides you several options to generate the outcome you need to provide from one step of the pipeline to other one. Although we might have used one component to rule them all, we just decided to play with different managed services just to show you these different alternatives were presented and the decisions we made, also how you can leverage the best of each one of the services -a simple example is how you can analyze real-time stream data by choosing to leverage Databricks or Stream Analytics; both can help you achieve the same outcome but with different approaches, costs and performance- it is up to you to decide which one fits better your needs.
Finally, and to have an end to end compelling scenario, we went graphical. All results were presented in a Power BI dashboard, being able to meet the requirements of the proposed scenario.
We hope we have provided you a good glimpse of how to combine these different technologies in a real life scenario that you can extrapolate to whatever similar challenge you might face.
Originally published by Mauro Krikorian for SOUTHWORKS on Medium 08 January 2021