- do you have a set of strings and one value per entry in the set, and you want to group according to unique string, and for each unique string entry determine whether the corresponding outputs somehow "correlate" with each other, with there no relationship at all implied or inquired for two strings that differ in any way?
- Or are you hoping to analyze some kind of correlation between the different strings and values? For example, discover that "-ing endings correspond to these properties", "second letter is capital-D corresponds to this subset", and so on?
Analyzing Pattern of Characters
4 views (last 30 days)
Show older comments
Hi, I have a large set of string data (~6K strings) that contains various combinations of letters at random. Each string is associated with an output metric (a number). How can I somehow make a way to associate the strings to determine if there is any correlation to their associated output metrics?
5 Comments
Image Analyst
on 23 May 2023
Edited: Image Analyst
on 23 May 2023
You keep forgetting to attach your data! Please attach a sample string array and metrics vector in a .mat file so we can try things.
If you have any more questions, then attach your data and code to read it in with the paperclip icon after you read this:
Answers (1)
Cyrus Monteiro
on 16 Jun 2023
To determine if there is any correlation between the strings and the associated output metrics in MATLAB, you can use the following approach:
- Convert the strings into a numerical format that a machine learning algorithm can use. One approach could be to use the bag-of-words (BoW) model to represent the strings as vectors of word frequencies. For example, you can use the "countVector" function in MATLAB to convert your set of strings to a matrix of word frequency counts.
- Split the data into training and testing sets. You can use the "cvpartition" function in MATLAB to create cross-validation partitions of your data.
- Train a supervised learning algorithm on the training set. For example, you can use the "fitrsvm" function to train a support vector regression model.
- Test the trained model on the testing set and calculate the correlation coefficient between the predicted output metrics and the true output metrics. You can use the "corrcoef" function in MATLAB to calculate the correlation coefficient.
Here's some example starter code:
% Load the data
data = readtable('data.csv');
% Convert the strings to numerical format using the BoW model
countVec = countVectorizer(data.Strings);
X = full(countVec);
% Split the data into training and testing sets
cvp = cvpartition(length(data), 'HoldOut', 0.2);
idxTrain = training(cvp);
idxTest = test(cvp);
XTrain = X(idxTrain,:);
yTrain = data.OutputMetric(idxTrain);
XTest = X(idxTest,:);
yTest = data.OutputMetric(idxTest);
% Train a support vector regression model
mdl = fitrsvm(XTrain, yTrain);
% Predict the output metrics on the testing set using the trained model
yHat = predict(mdl, XTest);
% Calculate the correlation coefficient between the predicted output metrics and the true output metrics
corrCoef = corrcoef(yHat, yTest);
disp(['Correlation coefficient: ', num2str(corrCoef(1,2))]);
You can experiment with different machine learning algorithms and hyperparameters to determine the best model for your data. Additionally, you can use feature selection techniques to identify the most important words in the strings that are correlated with the output metric.
0 Comments
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!