Text Analytics for Biomedical Applications, Part 2: Twitter Sentiment Analysis for Biomedical Applications
From the series: Text Analytics for Biomedical Applications
Learn how to use code for importing data from twitter, pre-process the tweets to remove unnecessary characteristics, convert the text to numbers, and build sentiment analysis models using machine and deep learning algorithms. Then deploy the code as a standalone app (or executable) using MATLAB Compiler™.
So this example highlights implementing Twitter sentiment analysis-- so performing classification explicitly with a pre-built dictionary/data set. And for this particular example, we're using a tool, and this particular tool that we're using is Twitter. Twitter can give us access to a large amount of text data. And actually, to get access to the data, we're following the different stages of the workflow.
So we're going to explore preprocessing the data, building a model. The whole idea behind building this model is certain words will be deemed as having a positive connotation versus a negative connotation. So we see the labels that are here.
At the end, after we build the model, we'll test out that model and look at those results. So do we have an overwhelmingly positive sentiment related to certain words, or is there an overwhelmingly negative sentiment related to certain words? And finally, we have this tool that is going to help us with being able to use all of this work that was done, because one of the themes that may occur for a number of us-- some of us may be the developers of the code, while others may be consumers, where we're just going to use, possibly, this app to help us do some high-level exploration.
So let's continue. So my plan is-- let me go into the MATLAB environment, which is here. And I'm going to highlight a few pieces of information. Let me just change one setting. And as a part of this setting, I wanted to first explore this lexicon that's provided to us.
So what is this lexicon? So this lexicon-- going to visit this folder. This is part of the overall example that you can get access to after this presentation. We have two text files-- one that highlights what were deemed as negative words, and you have another text file that's deemed as having positive words. So that lexicon was built up for us beforehand.
Now, in terms of going through and seeing the example in action, we will do that, starting in a few seconds. But let's actually start by seeing, I think, something very interesting. And what's interesting is the result-- so that app that we saw at the end.
So we have this file called TwitterSentimentApp. And I will press Enter. In a few seconds, my expectation is I will bring the app over to my screen so we could try to maybe see about using the app and explore what type of results we can find fairly quickly.
Now, this app window is going to appear on one of my screens that I'm moving over to this other monitor window. So just bear with me for a second. And my machine may take a few seconds to refresh, so I'm just going to pause for a quick second so my machine can refresh, and it should be fine.
So that's what we have now. And what could we do? So we could type in a term about anything and everything in this app window. In terms of the term that they have for the example, we have this reference to a particular antihistamine, and we have this brand. And I typed in the term, and I'll press Enter.
Let's see what's returned. Let's see this app hopefully be in action to give us some information. OK, so let's see what we have as a part of this output app.
So the results were obtained fairly quickly. And I see, on the left-hand side, we have a series of tweets. If you look in the center, we have a word cloud that points out many of the most popular terms that were returned related to my search term, "zyrtec."
On the right-hand side, we see a visualization that points out the frequency of the words that are there. If I place my cursor over one of the bars, it points out the value of 26. You have this category. And notice the longer this bar, that means there were more appearances of this term.
If I go back to the word cloud for a second, we see that that word, "weight," is highlighted pretty prominently. So we have this overall theme-- the longer the bar, the more popular the term was, and that particular term probably is more prominent in that word cloud. So weight, for example, is quite a bit more prominent versus, let's say, another word that I've seen that's present. It only takes the top so many words, by the way, because if I look at the small words, I'm seeing that "awake" makes an appearance, but size-wise, it's nowhere near the size of number of occurrences . Plus you don't see that term in a larger font, and in this case, it's colored orange.
But what's also happening here at the bottom? So at the bottom, we have, in this case, the date when this data was retrieved, different timestamps of achieving this data, as well as, on the y-axis here, we have these numbers. What do these numbers mean?
We'll see this a little bit later in the code for what's happening in the background, but this is related to a score. So what is this particular score? So for this score-- and you'll see this, once again, in the code. But you have something called a-- it has a particular term, the VADER score. In this case, this is the way I'll highlight it here that points out, for the sentiment analysis, what are we getting in terms of these results? So VADER-- and you'll hear this a little bit later-- it's an acronym.
It's a lexicon, rule-based, sentiment analysis tool that we're using. And pretty much what the score returns is if you have a negative value that's closer to negative 1, then you're highlighting that you have a very negative connotation to terms. Versus a value closer to positive 1, highlighting that you have a more positive connotation for certain terms. And this is highlighting, this line points out the frequency of quite a few of those terms and kind of where the various words were in terms of the score.
Now, this was a lot to see at once, because one of the biggest items is there's a lot of content on Twitter. So this processed this content fairly quickly, which is very promising to see.
And by the way, besides typing in zyrtec that was from the example, we could type in technically any other term. It could have a connection to a biomedical application or not. But I won't do that in this case.
But what about the code? What was happening in the background? So once again the theme for this app it's usable for us, you can get access to this after the presentation, but it gave us results at a very high level. Could we dig deeper? Of course.
For digging deeper, there are quite a few files that we have access to. If you notice you have quite a few files that start with the word demo and have various numbers. And there's an order to those numbers. So we would start with Demo00 and ideally get to the Demo04 file.
And if we open up each of the files we'll notice it divides many of the steps that were implemented that could help us out with this overall workflow. In fact, the whole idea behind this first file is to show an overview of the workflow and just point out the various files that were used for different tasks related to the workflow. So we have the workflow, this should look fairly familiar. And we actually saw the end result, the share. So we saw its final result.
But let's take a few steps back and see about performing some additional exploration. So for instance, how do we get access to this data? So if I run the section of code, there's another file that should open. And the whole thing behind this file is to import the content and possibly explore some reorganization, which is done in this file.
In order to get access to this data, there was a command called Twitter. Where does this come from? So we can access this data through the Twitter API via the Data Feed tool box. And this is quite powerful to help us get access to a large amount of data fairly quickly. In this case, it was text data.
Another component is in addition to getting access to the data, what about searching? So you have resources to help out with performing a search. And notice you have the search command.
One of the biggest items about the search command that I like to point out is the Twitter search API retrieves a sampling of tweets published within a certain time frame. In this case, past seven days. It is not comprehensive. There is documentation for not only our functions that we're using in the MATLAB environment, but because we're using another tool, in this case Twitter, there's also references to seeing the Twitter developer documentation for additional options for retrieving data that is maybe in a much earlier time frame, and looking at other settings as well.
But the whole theme is you have access to the data. You use those two commands. What about reorganizing the data?
So the theme for reorganizing the data is helping us with I'd say visualizing the content at a glance, so being able to see the content in a table, as well as being able to retrieve the data. So as a part of using this file you should be able to get access to data. You should be able to search as well. And once you get access to the data to do searches on different terms, then we should be able to do some more processing.
So notice in this next section of code there's another file. What is this next file? This text file points out the pre-processing task.
So for pre-processing with text data, things start to get very interesting very quickly, at least I think so. So after you retrieve that text data, we're seeing the usage of a word cloud that returns terms that were part of the text data that was imported. After just getting this overview-- and by the way, this is an overview of more than likely the raw data. How can we perform our pre-processing?
So for pre-processing your text data a variety of different tasks that are done. One series of popular tasks tends to be transforming text to lowercase letters. And then the rest of the task tend to depend on the content that you have. So in this case, they have a erase URLs, remove hash tags.
Now, one thing that does happen for a lot of text analytics is tokenizing the text. So consider the words as separate tokens to help with other related task, such as pre-processing and some additional analysis. Erasing punctuation, removing words, other popular task, and, finally, using options to help us analyze work frequency. So you have bag of words and also getting the top in number of words to help determine vocabulary of word counts.
So there are a lot of cases where if you don't see certain terms prominently, then we have to think does it make sense to do a lot of analysis on those words that are not identified as prominently? A little bit later you have a visualization of content based on specifically the processed content that we just saw. We saw, for example, you have a histogram, but this case the histogram's devoted to Medtronic. But we saw a histogram in relation to the term we saw in the app a little bit earlier, zyrtec.
One popular part of pre-processing that some text analytics applications explores tend to do with the following-- so we have these tokens that are these individual words right now-- think of them that way-- that could be analyzed. But a big question comes up, do I need to work with just individual words or do I need to work with a phrase? So you can use counting engrams to say let's look into a collection in successive words. So you have looking into certain phrases that can also help out depending on your analysis and influence a host of other tasks.
So in this case, looking at this updated word cloud with a used counts of engrams. Notice how you have this prominently displayed phrase. It's not just least multiple words, in successive words. We'll see this in a few other cases.
Now, for the exploration of the sentiment itself, you have a lot of information from before based on that processed data. But we now need to make a connection between, OK, here's this particular word, and is this going to have a positive or a negative connotation? So we see this usage here where we have based on labeling. You have this matching up of the certain words. And at the bottom of the code, you just happen to have this series of helper functions for certain tax-- pardon me-- for certain tasks. But what about actually doing the next--
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.