In this article, we are going to wrap up the series that we started time ago on how to build a Mobile App that monitors participants feelings daily using Amazon Web Services Platform. The application allows managers to create groups and invite participants to register into these each of these groups. Once done, these participants are being asked about daily feelings (check-in points) and an AI automatically analyzes participants shared messages — creating time-series graphs on participants feeling evolution and notifying a manager in case an immediate action needs to take place. As sentiment analysis is a central piece of the whole thing, in this article we discuss on how to use AWS SageMaker for sentiment analysis using advanced machine learning techniques.
We will be seeing how to tackle into two different approaches: using SageMaker’s built-in BlazingText algorithm and a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) provided by the Chainer framework. In the process, we will show how to prepare the dataset (using as baseline a dataset of Twitter tweets that have already been labeled as positive or negative) and then build, train and deploy a model that can effectively categorize unseen inputs.
Finally, we will configure a SageMaker endpoint to provide the service for replacing the Lex/Comprehend bot solution shown in the previous articles (Part I & Part II) in order to have a custom trained model for the specific scenario we have at hand, while keep improving and improving it.
Before moving forward I want to mention the group of people that collaborated on creating this article, the reference implementation linked to it and/or provided valuable feedback to allow me sharing it with you: Raul Benitti, Javier Caniparoli, Pablo Costantini & Mauro Della Vecchia.
AWS SageMaker is a cloud service that provides both a toolset for implementing machine learning flows, and resources to host trained models exposing them as directly consumable services.
It allows developing, testing and training algorithms using Jupyter Notebooks or to bring in custom solutions in the form of docker images, prepare the datasets, train models using custom algorithms (or built-in ones) and deploy endpoints that can be used to consume these models.
We used SageMaker to tackle the next three main tasks:
Training a model
Deploying the model
Text Preprocessing is an important step for Natural Language Processing (NLP) tasks. The transformations done in this step are targeted to bringing the text into a more digestible, predictable and analyzable representation that allows the ML algorithms to perform better.
To clean and format the dataset, we followed the next steps:
1) Remove unnecessary data: the dataset contains columns we do not need (e.g.: username of the user who wrote the tweet, date-time, id and flag). For our purposes, we only need text, and the sentiment label (positive or negative).
2) Separate columns in two lists, one for the text and one for sentiment labels. Both list should have the same length, with each tweet and associated sentiment label stored at the same position in each corresponding list.
3) Process the list of tweets removing information that can affect training: Not all tasks need the same level of preprocessing. For some tasks, you can get away without it or do it at a minimum. However, in some cases the dataset can be so noisy that, if not preprocessed enough, the training turns out to be garbage-in-garbage-out.
Lower casing: Lowercasing ALL text, although commonly overlooked, is one of the simplest and most effective forms of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where the dataset is not very large and significantly helps with consistency of the expected output. For example, models might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital letter and this might reduce the model accuracy.
Noise removal: is about removing characters, digits and pieces of text that can interfere with the text analysis and can produce results that are inconsistent in the downstream tasks. Noise removal is one of the most essential text preprocessing steps and is highly domain dependent. Some ways to reduce it include punctuation removal, special character removal, numbers removal, html formatting removal, domain specific keyword removal (e.g. ‘RT’ for retweets), source code removal, header removal, etc. It all depends on which domain you are working in and what entails noise for the task. For example, in the tweets’ dataset noise could be all special characters except hashtags as it signifies concepts that can characterize a tweet:
Replacing URLs: Links starting with “http” or “https” or “www” are replaced by “URL”.
Replacing Emojis: Replace emojis by using a pre-defined dictionary containing emojis along with their meaning. (eg: “:)” to “EMOJIsmile”)
Replacing Usernames: Replace @Usernames with the “USER” word. (eg: “@Kaggle” to “USER”)
Removing Non-Alphabets: Replacing characters except Digits and Alphabets with a space.
Removing Consecutive letters: 4 or more consecutive letters are replaced by 2 letters. (eg: “Heyyyy” to “Heyy”)
Removing Short Words: Words with length less than 2 are removed.
Removing Stopwords: Stop words are the commonly used words that do not add much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. Examples of stop words in English are “a”, “the”, “is”, “are”, etc. The intuition behind removing stop words is that we remove noise and can focus on the meaningful words instead.
Normalization: Text normalization is the process of transforming text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to the canonical form “good”. Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent.
4) Tokenization and Word Embedding: Tokenization consists of breaking up the text into smaller chunks or segments called tokens (a letter, a word, or sub-words). This chunks of information can be then represented as discrete elements in the form of binary vectors with a dimension for each token. Word Embedding can then be used to compress these high dimensional vectors into smaller dimensions while extracting information about the relation between the tokens (i.e., learning that words like one, two and three are semantically close).
5) Split the dataset: Training algorithms usually need input datasets to be split into training and testing datasets, with the training dataset including all the possible patterns that define the problem. By general rule, 70% of the available data is allocated for training while the remaining 30% is used for validation and testing. Besides selecting the right ratio, it is also important that the resulting datasets are representative for the domain (the dataset should maintain the probability distribution of the domain). For Chainer, we used train_test_split function. BlazingText, on the other hand, does not require partitioning.
6) Upload data: We upload the resulting datasets to an Amazon S3 bucket to be used as input for the training jobs.
Training a model
As we mention before we used both the built-in BlazingText algorithm, and an algorithm provided by the Chainer Framework.
Training a model with a custom algorithm
SageMaker offers a complete set of examples and templates to use both built-in as external frameworks (these are provided as Jupiter Notebooks).
In this example we will cover the steps we followed to train our RNN with LSTM model with Chainer.
Implement the training function
This requires providing a training script that the SageMaker platform can run. The script will be responsible for loading the data from the S3 location, configuring the training job with hyperparameters, training the model and saving its artifacts for deployment. The most important thing to keep in mind is to follow the interfaces required by SageMaker and the algorithm’s framework. To do so, the script must define some standard functions (a generic implementation will be used when a function definition is missing):
model_fn(…): Specifies how to load a model from the artifacts files.
input_fn(…): Parses the input data into an object suitable for predict_fn.
predict_fn(…): Receives data returned by input_fn and then makes a prediction on it using the model returned by model_fn.
output_fn(…): This is called with the return value of predict_fn as input and is used to serialize the predictions back to the client.
This script has access to different SageMaker environment variables that should be used to complete the task. For example, the artifacts generated should be store in the location specified by the SM_MODEL_DIR variable. Please refer to Using Chainer with the SageMaker Python SDK and the official documentation for further details.
Typically, the training scripts receive hyperparameters that can be used to define the number of epochs, timeout, batch-size, learning-rate, etc.
For further details please refer to the official SageMaker documentation:
Once the training script is done, you use it to create an estimator. In this step you also determine the training instance where the training will take place (count and type, the execution role, both python and framework’s version as well as the training hyperparameters).
Executing training job
Finally, with the estimator created the only thing you need to do is start the training job. You can do it by calling the fit method on the estimator.
Training a model with a built-in algorithm
The advantage of using built-in algorithms is that you do not need to code anything related to the algorithm itself. You can create a training job directly from the AWS console, set up all the configuration needed for our use case and after a couple of minutes we will have a model artifact stored in S3 bucket. Then, you just need to configure the endpoint to make predictions.
Deploying the model
The training job only creates the artifacts for our model. In order to use it , you need to deploy this model to an endpoint. This endpoint is a service running in a hidden EC2 instance. You do not need to worry about setting this server up, only need to choose things like instance type, auto-scaling groups, how many servers do you want, etc. After the deployment is done, you can see the deployed endpoint in the SageMaker console.
The advantages of using SageMaker for model deployment can be summarized in:
Host the ML model in an endpoint: You can call this endpoint as you call any other endpoint, you can also call this model from a Lambda function. Note: Amazon hosts the SageMaker model on a physical server that runs 24 hours a day. The server keeps running all the time like EC2, but you don’t configure and manage it, just like Lambda.
All your logs get easily stored in CloudWatch logs: You do not need to create your own logging pipeline.
Simple monitoring: You can monitor the load on your machines and scale the machines based on demand.
Once the model is deployed, you can copy the URL and try calling it using POSTMAN or any similar REST client.
You can look the status/health of the endpoint with the panel called Monitor provided by default.
You can see the logs in Cloud Watch and look if there is any trouble.
Once you tested that the endpoint is working as you expected, you should add security configurations.
Securing the endpoint
All SageMaker endpoints are secured behind AWS signature and cannot be accessed freely through the web unless they are exposed, for example, with API Gateway.
Internally, you can add more granularity to the endpoint by using IAM roles and policies, by using a VPC, or both.
One way of doing it is by allowing access to certain SageMaker operations (SageMaker:InvokeEndpoint) only from specific Lambda functions. When you create an AWS Lambda function, you must create a special IAM role that has permission to execute the function. This role is called the execution role. When you set up a Lambda function, you must specify the IAM role you created as the corresponding execution role.
The execution role provides the Lambda function with the credentials it needs to run and to invoke other web services (in our case, SageMaker:InvokeEndpoint). As a result, you don't need to provide credentials to the Node.js code you write within a Lambda function.
If it is needed to call the endpoint using a client like Postman, you should include in the headers the params AccessKey and SecretKey.
Updating our solution using SageMaker’s models
Now we want to use our SageMaker models in the old step function that predicts sentiments as positive or negative. So, we just updated that part in the whole solution, as you can see below, keeping the results going to the next step in the pipeline as we already explained in Part II.
When our model in SageMaker cannot determine if sentiment is positive or negative, we tag it as unrecognized and send it to the unrecognized workflow that stores this result in an S3 bucket with the same format needed to retrain the model. After a user classifies each message as positive or negative we can retrain the model with this new dataset.
After testing the different models, we got the following inference metrics:
As depicted in the table above, trained models perform better than Lex/Comprehend for our scenario.
In addition, we executed load testing and found that using trained models has a better performance than the Lex/Comprehend tools used in Part II (we tested the BlazingText and Chainer model using 2 different types of AWS instances, ml-t2-medium and ml-m4-xlarge):
Lex and Comprehend has a limit of concurrent calls up to 2 for the $LATEST bot, a single instance starts to fail when more than a hundred concurrent requests are made. This could be mitigated by increasing the limit (a call to Amazon is required), or spinning up multiple instances and balancing the request among them (which could be fairly hard to implement).
During the series of articles we saw our platform evolve in many different ways, and as part of it, two different approaches to get sentiment analysis where used. First, we used Lex and Comprehend as it provided an easier approach when considering retraining although more limited — you can easily retrain the bot to detect new intents. Now, with custom models specifically trained for the scenario of our interest, we showed you we more accurate predictions focused on making the platform more robust and reliable in terms of its main goals.
After performing all the steps required by a typical machine learning pipeline, from data preparation to model deployment, we were able to get a comparison between the old and new approach. As we have seen, using SageMaker to train and deploy a model with built-in algorithms (like BlazingText) provided a more accurate prediction service. Although re-training these models could be a bit more cumbersome, as you must collect enough data and perform the process using the AWS console.
Cost/Benefit wise, Lex charges per message processed while SageMaker endpoints charge per hour of uptime. While the smallest instance of SageMaker costs around $0.065 per hour and can process as many requests as possible (we saw that it can easily handle 1000), Lex charges $0.0020 per text input. If we have more than 25k messages per month, SageMaker will provide us with a better choice both for cost and performance (this in addition to Lex and Comprehend having scalability issues). So even when using the smallest instances, SageMaker endpoints have good support for at least a thousand simultaneous requests, being both faster and more resilient than the Lex/Comprehend counterpart.
That’s all folks! Hope that you enjoyed the whole series.
If you are interested in the technical details of the techniques mentioned in this article, please refer to our reference implementation.
Originally published by Mauro Krikorian for SOUTHWORKS on Medium 25 May 2021