Fastest way to find text keywords out of large amount of textual news sentences?
6 views (last 30 days)
Show older comments
Song Decn
on 24 Jan 2021
Answered: Walter Roberson
on 8 Feb 2021
Hello, I have a database containing over 900,000 line of news. And I want to scan these lines of texts for certain keyword. I tried
tic; strfind(newsDb.SingleNewline, kws{1}); toc
tic; contains(newsDb.SingleNewline, kws{1}); toc
both takes over 0.003 sec for search in one keyword in one news line.
If I want to create a new database with over 20,000 keywords, then it would take
900000 * 20000 * 0.003 / 60 / 60 / 24
over 600 days to do this. :(
Anyone has perhaps an idea how to to this within perhaps one-two day?
Thank you very much
6 Comments
Walter Roberson
on 28 Jan 2021
What do you want to do about substrings, and plurals, and upper/lowercase and the other factors I asked about? For example if the headline were "Elon visits Oak Hammock Marsh" then is it acceptable that this would match "Mars" ? And "Elon eats musk-melon" ? And "Eucre trumps Bridge in recent poll" ?
Accepted Answer
Walter Roberson
on 8 Feb 2021
You can do the search phase efficiently:
S = [ "Elon Musk is the richest man on the planet"
"Elon Musk is the poorest man on Mars"
"Trump is the president of US"
"Elon eats musk-melon"
"Eucre Trumps Bridge in recent poll"
"Trump is the not president of US"]
Tags = ["Musk" "Trump" "Mars"]
numTags = length(Tags);
pattern = "\<(?<word>(" + strjoin(Tags, "|") + "))\>"
search_results = regexp(S, pattern, 'names')
However, the output is not really what you want: it is information about each tag that was matched for each cell, and needs to re-arranged to give information about where each tag was found.
tags_matched = cellfun(@(C) string({C.word}), search_results, 'uniform', 0).'
TagWasFoundAt = cell(numTags,1);
for K = 1 : numTags; TagWasFoundAt{K} = find(cellfun(@(C) ismember(Tags{K}, C), tags_matched)); end
[cellstr(Tags(:)), TagWasFoundAt]
%OR
match_bits = cell2mat(cellfun(@(C) ismember(Tags, string({C.word})), search_results, 'uniform', 0));
TagWasFoundAt = arrayfun(@(COL) find(match_bits(:,COL)).', (1:numTags).', 'uniform', 0);
[cellstr(Tags(:)), TagWasFoundAt]
It is likely that there are other ways to do the matching from tags to entries.
The first of those two is probably more efficient, but the match_bits array would be useful if you wanted a single data structure that you could easily query to find out which articles contain a particular tag, or which tags a particular article contains. The match_bits array is good for doing boolean searches, for example, such as trying to find articles that contain Musk Or Mars but not Trump
(match_bits(:,1) | match_bits(:,3)) & ~match_bits(:,2)
There might be better ways of doing the matching.
0 Comments
More Answers (0)
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!