This example shows how to create a co-occurrence network using a bag-of-words model.
Given a corpus of documents, a co-occurrence network is an undirected graph, with nodes corresponding to unique words in a vocabulary and edges corresponding to the frequency of words co-occurring in a document. Use co-occurrence networks to visualize and extract information of the relationships between words in a corpus of documents. For example, you can use a co-occurrence network to discover which words commonly appear with a specified word.
Extract the text data in the file
readtable. The file
weekendUpdates.xlsx contains status updates containing the hashtags
"#vacation". Read the data using the
readtable function and extract the text data from the
filename = "weekendUpdates.xlsx"; tbl = readtable(filename,'TextType','string'); textData = tbl.TextData;
View the first few observations.
ans = 5x1 string "Happy anniversary! ❤ Next stop: Paris! ✈ #vacation" "Haha, BBQ on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation" "getting ready for Saturday night 🍕 #yum #weekend 😎" "Say it with me - I NEED A #VACATION!!! ☹" "😎 Chilling 😎 at home for the first time in ages…This is the life! 👍 #weekend"
Tokenize the text, convert it to lowercase, and remove the stop words.
documents = tokenizedDocument(textData); documents = lower(documents); documents = removeStopWords(documents);
Create a matrix of word counts using a bag-of-words model.
bag = bagOfWords(documents); counts = bag.Counts;
To compute the word co-occurrences, multiply the word-count matrix by its transpose.
cooccurrence = counts.'*counts;
Convert the co-occurrence matrix to a network using the
G = graph(cooccurrence,bag.Vocabulary,'omitselfloops');
Visualize the network using the
plot function. Set the line thickness to a multiple of the edge weight.
LWidths = 5*G.Edges.Weight/max(G.Edges.Weight); plot(G,'LineWidth',LWidths) title("Co-occurence Network")
Find neighbors of the word "great" using the
word = "great"
word = "great"
idx = find(bag.Vocabulary == word); nbrs = neighbors(G,idx); bag.Vocabulary(nbrs)'
ans = 18x1 string "next" "#vacation" "😎" "#weekend" "☹" "excited" "flight" "delayed" "stuck" "airport" "way" "spend" "😊" "lovely" "friends" "-" "mini" "everybody"
Visualize the co-occurrences of the word "great" by extracting a subgraph of this word and its neighbors.
H = subgraph(G,[idx; nbrs]); LWidths = 5*H.Edges.Weight/max(H.Edges.Weight); plot(H,'LineWidth',LWidths) title("Co-occurence Network - Word: """ + word + """");
For more information about graphs and network analysis, see Graph and Network Algorithms.