Text Analytics for Biomedical Applications, Part 1: Introduction to Text Analytics
From the series: Text Analytics for Biomedical Applications
Discover the basics of text analytics and the motivation behind using these capabilities for biomedical applications. Learn the typical steps involved in a text analytics workflow using MATLAB®, from data import, pre-processing, and converting text to numbers, to building predictive models using machine and deep learning algorithms.
Hello and welcome. For this presentation, we'll cover text analytics with biomedical applications using MATLAB. My name is Louvere Walker-Hannon. I'm a senior application engineer located in the Natick, Massachusetts office of MathWorks Incorporated. And let's begin.
In this particular presentation, we will see the following topics. We'll have some background information provided related to, what is text analytics and how it can be implemented? We'll also explore the motivation behind using text analytics. We'll look at at least one example of seeing text analytics used for a biomedical application. And at the end of the presentation, we'll explore looking into resources that can help you after this presentation.
Now, a question comes up for a number of individuals. And the question may be, what is text analytics? And also, questions about related topics. So one related topic is, what is natural language processing? And before we even explore what is natural language processing, we're also going to look into some definitions of other related terms. One popular related term is artificial intelligence.
So artificial intelligence is a part of computer science that's concerned with how to make machines do tasks that humans can do. What are those tasks? Such as planning, recognizing objects, interacting, moving, and doing a host of other activities. In terms of implementing artificial intelligence, also known as AI, there are several different techniques that could be used. One of these techniques is machine learning. And machine learning deals with how we can teach machines to do activities similar to what a human being could do.
In order for machine learning to be used, we give the machines ample information so that it can learn how to do these various tasks correctly. Another way of implementing artificial intelligence is to use deep learning. Deep learning is a subfield of machine learning. It is an extension of neural networks that is supposed to mimic how human brains work. Mathematically, when you run through a network of neurons, each of which has some activation function, when a specific threshold is reached, the neuron is activated and data passes through.
Why do we highlight these terms? Well, another related term is natural language processing, also known as NLP. It's a part of AI that deals with human language. Typically, it's written human language. And one more term, the term that we're going to focus in on for most of this presentation, is text analytics.
Text analytics entails using natural language processing. And if we look at this slide, if we look at the diagram, we see the workflow, or a nice overview of the workflow, of implementing text analytics. So our text data could be from a variety of sources. After we get access to this text data, we notice that we're going to be able to get some insight related to real-world questions.
So notice the theme in terms of this implementation of text analytics, this combination of natural language processing plus machine learning. Now, text analytics, as some of us may know, can be used in a wide variety of applications in many different industries. But the focus for this presentation is on the biomedical industry.
And what are the possible applications of text analytics in the biomedical industry? Well, we may first notice with these upcoming examples the overall themes entail, what insight can we obtain from various
documents based on the text in these documents? This insight is identified in the form of these following questions and others that could be explored based on content from these various documents. So we have research articles that can provide insight in terms of themes that may be prevalent and related to these various research articles-- not just the research topic, but other themes, maybe in terms of algorithm development, as one example.
Also, you have design. So there are typically documents that have technical specifications if we're designing various devices. And we could see about answering the question, does the design of a particular object meet guidelines that we find in these technical specifications? In terms of manufacturing, there could be a variety of questions that we can explore. But once again, from these documents, we could possibly explore answering the question, which batch may be defective? after getting access to data from various types of documents.
And finally, what about the following? What about operations? So we're implementing certain tasks and following certain procedures. But we may notice trends and patterns. And one question that can come up related to operations, when may the patient return to a particular facility. So through text analytics, we're going to be able to explore these topics. And we'll see possible implementations of how we could explore these topics coming up very soon.
Another question that comes up is, in general, what can I do with text analytics? And there are a number of tasks that we can implement with text analytics. Text analytics is powerful, but it also does have challenges. Many of those challenges can come from a variety of sources. So coming up in the side, we'll highlight some general overviews of what we can do with text analytics.
Number one, we can generally understand. We can see trends, patterns, and opportunities-- for example, from one source, the research articles we saw not too long ago. And then we're trying to explore, what about meeting certain standards in terms of design requirements? Manufacturing-wise, could we identify inefficiencies? Are there possible ways to improve processes that are in place? So we have those themes.
And finally, you have hidden insights related to a variety of documents. So it could be from patient records. It could be from other related documents. One item I do want to point out is many of the themes that we're seeing here happen to connect to the biomedical industry, which is the focus of this presentation. But it could be expanded to work with text analytics connected to other industries as well.
Now, there was a mention that text analytics can be powerful, but it also has challenges. And let's see about exploring, what are those possible challenges? So a major question comes up. What are some difficulties with respect to text analytics?
Now, if we think about this-- so the text that we find in our documents typically, in many cases, comes from spoken language. So we have a variety of complications related to, possibly, spoken language that may find their way into text. So the reason why I mention that is because you may see some themes related to some of those concepts related to spoken language. And they may make their way into text. And as a part of the text, we may see some problems partially related to spoken language.
So one major theme is many languages may have various dialects. A particular word may have different meetings even though it's spelled the same way and possibly pronounced the same way. Now, in terms of other possible challenges-- so machines can understand logic, But. Human beings cannot. So if we look at this example here, this is really interesting, I feel.
So if you are a friend of mine and you want to go bowling, what would I want to do? So there's a parsing of information where a certain interpretation can be taken away from what's listed here. And depending on, possibly, who the human being is and their background, there may not be the ability to parse through this content to get some sort of meaning.
And then you have quite a few other topics. So there could be a cultural context in terms of understanding what content may be written. It may highlight that in some cases, based on how something is written in text form, there's ambiguity with the, once again, interpretation of what's there and so many other factors.
If we look at this next example, we'll see the following. So you could have the statement, I had a bat while growing up. But that could have multiple meanings. Are referring to the animal? Or are we asking about a bat related to, let's say, the sport of baseball, so with which you hit a ball?
So keep in mind there are a number of challenges. But there are also quite a few benefits. And for us, we're going to use the text analytics workflow to explore information related to at least one application from the biomedical industry.
So what does the text analytics workflow entail? So we'll see multiple stages. Typically, those stages are broken down into four tasks. So the first task, as we could probably-- I'm going to say guess, entails getting access to the data.
So for getting access to the data, this can come from a variety of sources. And those sources could be a database. It could be files and from so many others where we can get access to text data. Now, one of the other themes is exploring this data. So in a number of cases, that exploration has a connection to visualizing the data. And a popular way of visualizing a lot of our text data tends to be in the form of using a word cloud. So in an upcoming example, you may see a word cloud used to assist us with exploring some imported text data.
There's a second part, a second task. So preprocessing the data-- as we can probably guess, based on us getting data from a variety of sources, there may be a lot of content that needs to be cleaned. And the reason why we're doing this second task is to help clean this data so it can get ready for building a model to help us perform certain tasks related to this text data.
In terms of what those different techniques are for cleaning the data, there are quite a few different techniques where not necessarily all of them would be used each and every time. So you may have a variety of techniques that are there for us to use. But we may not need and/or use each one each and every time. But the theme is, for preprocessing, we're trying to get access to this cleaner data, so moving items such as, let's say, punctuation as one type of item that could be cleaned from and removed from raw data, raw text data that we have access to.
A third task-- so developing predictive models. Now, this is where a number of people may ask the following question. Which approach do I use? Because if we look closely, there are a lot of different ways to develop predictive models. Do we use deep learning? Do we use machine learning? And/or we're just going to use some statistical analysis. All of these options can help us explore creating predictive models.
When to use one versus the other, there are a variety of factors that we should consider-- one of them being, for example, how much data you have access to. Think about your hardware. Do you have hardware that's helpful for training to help create a model? We would have to consider that and a variety of other factors.
So there's three options. And between the three options, there are considerations we should explore, such as the amount of data you have. Thinking about your hardware, can I realistically train to get this particular model? And other considerations exist as well.
And finally, for that fourth task, so what do you do once you get some output from your predictive model? Well, you have a number of options. One of those options could entail creating a report. Another option is, well, maybe I want someone to be able to do some level of analysis. But I don't want them to have to write the code. So you can build your own user interface, also known as apps. And these apps can be available on your desktop and/or via the web.
Finally, if you wanted to expand the reach of these apps, who can use them, you have a variety of servers that could be used to help with that particular task. What are those servers? You have production servers and web app servers that can help expand the reach of those apps and, possibly, other content that you would like to share related to your text analytics workflow.
Now we've spoken about some background related to text analytics-- what is it, why would it be used, how it could be implemented. And we have these four tasks that point out the text analytics workflow. This should bring us to the ability to explore at least one example. Let's see that here.
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.