Real-Time Insights from Sport Events using the Azure Data Stack

Continuing with SOUTHWORKS’ commitment to Big Data projects and our contribution to the big community of Azure developers out there, we present this article in which we describe an application that ingests sport events data from multiple sources, processes them and generates an insightful dashboard with information about matches, teams and Twitter influencers from the Spanish Football League.

All this is done with the help of several managed services from the Azure Data Stack: Data Factory, Databricks & Stream Analytics — just to name a few. Let’s first introduce project’s context/scenario and then move on to how we implemented a solution for it.

Before we move on, I want to mention the group of people that collaborated on creating this article, the reference implementation linked to it and/or provided valuable feedback to allow me sharing it with you: Lucas Gabriel Abella, Nicolás Bello Camilletti , Mariano Agustín Iglesias & Diego Satizabal.

We all hope you enjoy it 😃

Scenario

Let’s suppose that 3 different actors from a sporting TV show have requested us to provide a solution for these 3 situations:

  • As a show producer, I need to know which are the trending topics to focus the most within my programs, also I need a report to easily track real-time metrics of the main matches & teams from La Liga Española, as well as people reactions on social media (mentions, retweets, favorites & sentiments).

a) Luckily, they can provide us an API with data of the Spanish Football League. The endpoint’s responses are not precisely Big Data, but they need to be refreshed on a daily basis.

b) Also, they are just interested in Twitter as the social media source. Here the big challenge is to continuously feed our solution with this streaming content, while combining it with the API data.

  • As a presenter, I need to have easily consumable report (dashboard alike) with information and tweets from influencers (people with a lot of followers, which are opinion formers).
  • As a sport analyst, I need to review certain information from past match matches live, while also have at hand all data for later historical analysis.

Having these UCs mind, we come up the following solution that we are going to discuss during the rest of the article.

Solution’s Overview

General Introduction to the Architecture

Based on detailed UCs, and as explained within the intro, the main purpose of this application is to generate a dashboard to show sporting events — in our case related to the Spanish Football League.

Here are its main responsibilities:

  • The solution get events’ data from a public API employing Data Factory and we use Twitter API from a containerized .NET Core application to search for tweets associated to each event, based on a set of related keywords.
  • Then, a stream of tweets is passed onto a Databricks Notebook in which Text Classification, Language Detection and Sentiment Analysis are performed.
  • A Stream Analytics job makes some additional analysis to find mentions of the match and/or one of the teams in each tweet, and the result is placed in a Power BI Dataset to be displayed in a dashboard.
  • An additional feature is that all tweets’ data is stored after the analysis in Databricks to have an historical record of all processed tweets, including the result of the analysis carried out in the flow (language/sentiment), this can be used later for historical analysis/reports.
  • Finally, we use a Data Lake as our general-purpose storage and Event Hubs as message queue between the different stages of the flow.

The following image depicts the architecture described above:

The General Architecture of the Solution

So, let’s see what each piece of the diagram shown above does within the next sections.

Event Data Ingestion — Azure Data Factory

Consuming Data Providers API

The first step in our solution is the semi-static data ingestion stage. This process consists of retrieving information from two endpoints of the TheSportsDB API, events & teams, on a daily basis.

📝 Note:

As stated on its website, the TheSportsDB is an open, crowd-sourced database of sports artwork and metadata with a free API. It has information on different sports like football, the NBA, the NFL and more. Regarding football information, it contains data about different leagues all around the world like the English, Spanish, Argentinean, among several others.

We are using the information returned by this API for the Spanish Football League (also known as La Liga) as our static-data source. If you want to learn more about this API’s content, you will find all the information related to the available endpoints here.

To accomplish this we are using Azure Data Factory, which is a cloud-based data integration service that allows us to orchestrate and automate data movement and data transformation. We defined a Pipeline with two sequential Copy Activities in order to retrieve information from the endpoints (events & teams), extract the fields we are interested in, and then save that as two intermediary CSV files into the Azure Data Lake. This process also involved the creation of the REST & Data Lake Linked Services, as well as the corresponding Datasets shown in the image below.

The Ingestion Pipeline on Azure Data Factory

You can read more details on how to create your own Data Factory in this first section of our GitHub Workshop, and in our step by step guide on how to ingest the API’s data with Data Factory.

Merging the Data

For this we created a Data Flow Activity in the Pipeline. This kind of activity enables us to include several steps inside of it, and we used it to combine the information from the two previous branches (events & teams) into a unique CSV output. Furthermore, we automatically added an extra column to the CSV with a list of keywords for each match. These terms will be used to collect related tweets in the upcoming section. In the next figure you can all see the internal components of the joinData Data Flow, which are a series of join, select, derive and sink operations.

The joining Data Flow on Azure Data Factory

Also, Azure Data Factory has the capability to be triggered with different mechanisms. In our case, we scheduled to trigger the whole Pipeline every day, as it was required by the scenario stated before. Feel free to check this part of our repository for more information on this Data Flow.

Tweets Data Ingestion — Azure Container Instances

This piece of the solution is in charge of obtaining tweets (through the usage of the Twitter API) based on certain keywords already defined. Then, those tweets are sent to Azure Event Hubs to be analyzed in the next stage. For these tasks, we used a C# .NET Core App running inside Azure Container Instances. Below we will briefly explain the different parts involved in this process, but you can see the details here.

Obtaining the Keywords

The keywords of interest are stored in two separate files. On one side, they are obtained according to the upcoming events (this file is updated daily by the Data Factory) and, on the other hand, we handcrafted a list that contains some synonyms for each team that never vary and are always valid (like Barça for Barcelona F.C.).

These files are stored in Azure Data Lake, so the first step to obtain them is to connect to the resource and retrieve them.

Getting the Tweets

To achieve this we simply leveraged Twitter For Developers, created a developer account and a Twitter Application. With the keys and tokens provided, we connected to Twitter and requested tweets matching the keywords, which can be passed as a parameter.

Afterwards, within the resulting observable, a handler can be created to process each tweet obtained. This will generate a stream of tweets that are being processed as they arrive.

Linking Tweets to Events

Each tweet fetched is related to a set of keywords which are explicitly associated with a particular event. This implies that we need to create new object combining the most relevant data of a tweet with its associated event. To better illustrate the data contained within the mentioned object, we show you an example below of an empty JSON object.

{
 "event_id": "",
 "full_tweet": {},
 "full_text": "",
 "retweet_count": 0,
 "favorite_count": 0
}

Sending data to Event Hub

Once the resulting object has been created, we start sending them to Event Hub for later processing. This will be explained in the following sections.

Data Analysis — Databricks

From the overall diagram, we can see that there are two main Databricks components. In this section we will focus on the one on the top side which uses the streaming features provided by the underlying Spark engine. From previous pipeline steps, we already have a stream of tweets’ data into an Event Hub, now we need to perform some transformations and analysis over that data.

Let’s see what are we doing:

  • Text Classification to filter tweets not related to Football that may have made its way to this stage and are undesired in the next stages.
  • Language Detection that we employ to analyze sentiment only over tweets (just English due to limitations of the library we use).
  • Sentiment Analysis to detect whether the tweet has a Positive, Neutral or Negative content.

To perform those tasks, we use a Databricks Notebook reading the stream and performing the operations live.

Text Classification

In simple words, text classification consists in assigning a category from a predefined set to a particular short text. For example, you may have the following categories:

  • Science
  • Economic
  • Politics
  • Sports

The idea is to have an automated way to assign to every piece of text (sentence or paragraph) the corresponding category. As this is not an easy task, it usually requires the use of Machine Learning, Artificial Intelligence, Neural Networks & Deep Learning*.

In our case, and to keep it simple within this sample, we decided to implement the ‘text classifier’ just as a simple manual algorithm (you can see it here). We calculate the score for the text of a given tweet based on the occurrence of words from two sets: one set of football-related words (increase score) and another one with words not related to football (lowers score). We check the words of the given text to see if those belong to football or non-football ones, adding or substracting one score point accordingly, at the end, the resulting score tells us if the text belongs to football or non-football category.

Sample with a small subset of the words contained in the dictionaries

Observe that non-football words were picked carefully because one may tend to think that any word not in the small universe of football-related words may serve, but this is not true. The Football keywords are limited and generally simple to pick (like mentions of teams, famous players, etc), but non-football keywords cannot just be any other word not in the football dictionary. If we pick too many generic words as non-football keywords, the classifier would be biased towards the non-football category.

Without entering much into details here, as it is not the purpose of this article, we can verified that in general 84% of the tweets were classified accurately (known as the classifier’s accuracy). The classification logic can be improved constantly later, trying different approaches and so, but the architecture would remain the same in order to provide this ‘black box’ with inputs to be processed.

*As for the ML/AI approach mentioned above, we documented here a little experiment we did showing the results and failures we had with it for future references and improvements.

Language Detection

In order to perform Sentiment Analysis over the tweets in the stream, first we need to detect the language of the text for two main reasons:

  • The Sentiment Analysis algorithms, in general, require the language to be specified
  • The NLP Library we use, only detects sentiments for text in English

To detect the language, we used Azure Cognitive Services. Even though the Spark NLP library offers language detection capabilities, it did not show good performance with tweets’ text. Azure Cognitive Services offers a wide variety of options to work with Artificial Intelligence, mostly using API calls. Particularly, it offers a Text Analytics feature that allows to detect language, sentiments, keywords and entities, given a certain text. We used the Language Detection feature for the implementation.

First, we created a Cognitive Services instance to then deploy a Text Analytics resource. Once we had our keys we were ready to go and, after a while, we integrated the required API calls to our Databricks Notebook.

View of the response from Cognitive Services for Language Recognition

You can check this tutorial from Microsoft which helped us to integrate Cognitive Services to Databricks so that we can use this for the stream analysis. However, we replaced the usage of java classes by a manual manipulation of the JSON returned, because we were facing problems with the original way. This guide provides a more detailed view of this stage.

Sentiment Analysis

In the Sporting Events Application, we use Sentiment Analysis to add to the Dashboard three additional fields showing for each match the count of tweets with Positive, Neutral or Negative content.

Sentiments displayed in the Dashboard

To perform Sentiment Analysis in Databricks we used the NLP Library with a pre-trained model provided by John Snow Labs.

Remember that we managed to implement Text Classification, Language Detection and now we will implement this additional analysis, but it will be done in the main block that reads the stream and performs all the analysis. First step is to include the libraries we need for the analysis, we then need to create the pipeline and with that we perform the analysis of only the tweets that are in English and have been categorized as Football.

Our approach follows this idea, for each line of data (tweets) we create three columns: for Positive, Neutral, and Negative sentiments each, then according to the results of the sentiment analysis for a given tweet, we place a value of 1 in the corresponding column being those mutually exclusive (a tweet may contain a single sentiment at a time).

Here you have more information on how we performed Sentiment Analysis. You can also check the code and find the other parts that perform the connection to the Event Hub to read the stream, the call to the Text Classification and Language Detection functions and then several transformations prior to sending the processed stream to the next Event Hub to be read by the Stream Analytics.

Data Transformation — Stream Analytics

Defining Inputs & Outputs

Stream Analytics is a real-time analytics and complex event-processing engine that is designed to analyze and process high volumes of fast streaming data from multiple sources simultaneously, it will let us transform and combine the data being transformed by the previous components.

For this process, we will use three inputs with different characteristics:

  • The first one is semi-static data input containing the events’ information gathered by Azure Data Factory from the API, this data changes once per day and is read from the Azure Data Lake.
  • The second input contains the processed tweets data coming from Databricks as a stream, since the data is constantly arriving through an Event Hub.
  • The third input is also a stream but contains the raw tweets coming directly from the first Event Hub.

By joining these inputs we will be able to generate the outputs for:

  • A detailed view of the events and teams data.
  • A highlighted-users view with the most popular tweets about the topic and the corresponding Twitter accounts.

Performing the Transformation

All data comes in a way defined in the contract with each previous component. In the case of raw data from the C# app, the input format will contain the following columns:

The processed stream contains the following information:

And, regarding the static CSV files, the following format is expected:

To transform the data, we perform the following steps (see them in more details here):

  1. We must first ensure that the values are correct. In this case the values that come preprocessed from streaming over Twitter should be normalized to avoid errors related to the expected data types. For that, a cast must be performed to ensure that the same type of data is obtained as required. Also, we filter by the category of the event to only allow the football-related tweets to reach the output.
  2. The result of this query will have the tweets’ data joined with the events’ data, but we also need to obtain those related to the teams. For this, a match must be made with respect to the team name (using a Regex) and in case of a successful match these statistics need to be added for the teams. To achieve this, a join must be done with the events.
  3. To end the query, another join must be made between the static and the processed-stream data sources. Within the SQL-like language that Stream Analytics offers, the two previous queries must be done using a WITH clause under which the data is combined.
  4. Finally, we append a last query to perform the processing over the raw incoming stream and obtain the dataset for the influencers dashboard. We only allow verified profiles and with a certain number of followers to come through.

While performing all these steps we validated the versatility of Stream Analytics to perform transformations, its flexibility for the usage of static and dynamic data sources with large volumes of data, and finally we created two Datasets for Power BI that we will be using to create the main Dashboard.

Dashboards — Power BI

At this point we already have the Power BI datasets generated by Stream Analytics, so we can go to our Power BI account and find them. Power BI is a business analytics service provided by Microsoft. It aims to deliver interactive visualizations and business intelligence capabilities with an interface simple enough for anyone to create their own reports and dashboards.

📝 Note:

If you may encounter problems accessing the datasets from Power BI web editor, we recommend you to download Power BI desktop editor and proceed that way round. Once in Power BI, we choose Power Platform as our source and select the Power BI datasets options to then connect to it. There, you will be able to see your dataset and create a report for it.

In the previous step we defined two Power BI outputs for the Stream Analytics job. Now, we are going to create a Power BI report to visualize the information consumed and processed by our app.

The idea is to create a serie of reports like these:

Sample Outcome. Detailed view of each match.
Sample Outcome. Ranking of all matches.
Sample Outcome. Highlighted Users (Influencers) View.

To obtain these outputs you can follow our tutorial here. Nevertheless, we will provide an overview of each part of the report.

The Detailed View

To generate the first report, we selected different Visualizations from the Power BI panels and drag each field to display it. The process is quite simple using our example as guideline, we based our report on Card elements, but you are able to modify it as desired. In addition, Power BI lets you customize it with several formatting details to the dashboard (custom titles, borders, etc.).

Another special resource we added to the report is the Slicer. This element allows to filter the displayed information based on a certain parameter. We are filtering based on the match name in order to create the detailed view. Also, we added several images, some of them are dynamic (like the teams’ badges, which change based on the match selected) and others are static (like the league’s logo).

The Ranking View

To create the ranking report, we added a new page and rename it to something like Matches’ Ranking. Then, added a Funnel visualization, dragging the match name column to the Group placeholder and the match mentions to the Values placeholder. As a result, the matches were ordered by the number of mentions each one has. As before, you can continue adding more funnels (for each metric -favorites, retweets- and for the home and away teams). You can also edit the titles and colors as desired.

The Highlighted Users View

To create this report, we needed to connect to the corresponding dataset (not the same we were using above). Here, we based our report on a Table visualization from the Visualizations module and drag the name, screen name, location and followers count columns to the Values placeholder. Next, we added a Card visualization and included the text column in the Fields placeholder. After that, we configured the card to display just the first Tweet from that influencer. Finally, you can adjust the size of each component following our sample reference at top of this article, or choose your favorite format customization.

As it can be seen, we have almost all the information to answer the questions from the project’s requirements. The only missing part is the historical analysis and we will tackle it in the next section.

Historical Logging — Databricks

The final feature of the Sporting Events Application is the capability to log all data that was processed for further analysis. This was done by enabling the Capture feature of Event Hub, it writes the events that passes through it in the Data Lake and can be further loaded for analysis, we did it using a Databricks Notebook. You can check how this was done here.

Event Hub Capture

From the Azure Event Hub documentation we can read that the Capture feature of Event Hub allows to not only process stream data but to save that data for batch processing, allowing us to store the data into Azure Storage (Blob and/or Data Lake) in a scalable, fast, easy-to-use manner with no additional costs.

In fact, any Event Hub can be enabled with the Capture feature, it will place all events coming into it in the selected storage and we can further make additional processing from that data. This can be easily turned on from the Event Hub resource, selecting the desired container. This data will be stored in a directory fashion based on dates, and under AVRO format, which is the only supported file format for Capture as today.

Loading and Analyzing the Data in Databricks

There are many things you could do with the captured data, we will show a simple application here using Databricks. We first load the data, convert it into a table and then perform a simple query to see how many tweets were related to Football and Non-Football:

Graph of Football and Non-Football tweets for a captured data

Also, you can use this stage to get the verified users with more followers (influencers):

List of top 10 verified users with more than 10.000 followers

Wrapping up

What have we shown you? Simply put, an application that integrates different Azure technologies to process big amounts of data. For the sake of the sample, we integrated static data obtained from an API endpoint providing sporting events information, with near to real-time provided by the Twitter Developer API. Later, we transformed the data by using Azure Databricks, Event Hubs, and Stream Analytics services.

As you might have understood while reading this article, Azure Data Stack provides you several options to generate the outcome you need to provide from one step of the pipeline to other one. Although we might have used one component to rule them all, we just decided to play with different managed services just to show you these different alternatives were presented and the decisions we made, also how you can leverage the best of each one of the services -a simple example is how you can analyze real-time stream data by choosing to leverage Databricks or Stream Analytics; both can help you achieve the same outcome but with different approaches, costs and performance- it is up to you to decide which one fits better your needs.

Finally, and to have an end to end compelling scenario, we went graphical. All results were presented in a Power BI dashboard, being able to meet the requirements of the proposed scenario.

We hope we have provided you a good glimpse of how to combine these different technologies in a real life scenario that you can extrapolate to whatever similar challenge you might face.

Originally published by Mauro Krikorian for SOUTHWORKS on Medium 08 January 2021