Clear Filters
Clear Filters

Analyzing Pattern of Characters

4 views (last 30 days)
Tei Newman-Lehman
Tei Newman-Lehman on 22 May 2023
Answered: Cyrus Monteiro on 16 Jun 2023
Hi, I have a large set of string data (~6K strings) that contains various combinations of letters at random. Each string is associated with an output metric (a number). How can I somehow make a way to associate the strings to determine if there is any correlation to their associated output metrics?
  5 Comments
Image Analyst
Image Analyst on 23 May 2023
Edited: Image Analyst on 23 May 2023
You keep forgetting to attach your data! Please attach a sample string array and metrics vector in a .mat file so we can try things.
If you have any more questions, then attach your data and code to read it in with the paperclip icon after you read this:
Tei Newman-Lehman
Tei Newman-Lehman on 23 May 2023
@Image Analyst, thanks for your help. A snippet of the data is attached
Thanks!

Sign in to comment.

Answers (1)

Cyrus Monteiro
Cyrus Monteiro on 16 Jun 2023
To determine if there is any correlation between the strings and the associated output metrics in MATLAB, you can use the following approach:
  1. Convert the strings into a numerical format that a machine learning algorithm can use. One approach could be to use the bag-of-words (BoW) model to represent the strings as vectors of word frequencies. For example, you can use the "countVector" function in MATLAB to convert your set of strings to a matrix of word frequency counts.
  2. Split the data into training and testing sets. You can use the "cvpartition" function in MATLAB to create cross-validation partitions of your data.
  3. Train a supervised learning algorithm on the training set. For example, you can use the "fitrsvm" function to train a support vector regression model.
  4. Test the trained model on the testing set and calculate the correlation coefficient between the predicted output metrics and the true output metrics. You can use the "corrcoef" function in MATLAB to calculate the correlation coefficient.
Here's some example starter code:
% Load the data
data = readtable('data.csv');
% Convert the strings to numerical format using the BoW model
countVec = countVectorizer(data.Strings);
X = full(countVec);
% Split the data into training and testing sets
cvp = cvpartition(length(data), 'HoldOut', 0.2);
idxTrain = training(cvp);
idxTest = test(cvp);
XTrain = X(idxTrain,:);
yTrain = data.OutputMetric(idxTrain);
XTest = X(idxTest,:);
yTest = data.OutputMetric(idxTest);
% Train a support vector regression model
mdl = fitrsvm(XTrain, yTrain);
% Predict the output metrics on the testing set using the trained model
yHat = predict(mdl, XTest);
% Calculate the correlation coefficient between the predicted output metrics and the true output metrics
corrCoef = corrcoef(yHat, yTest);
disp(['Correlation coefficient: ', num2str(corrCoef(1,2))]);
You can experiment with different machine learning algorithms and hyperparameters to determine the best model for your data. Additionally, you can use feature selection techniques to identify the most important words in the strings that are correlated with the output metric.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!