Creating the matrix of GloVe embedded vocabulary

4 views (last 30 days)
I downloaded glove.6B.zip
Per the documentation, the file contains 400k vocabulary words, each of which is represented as a 300d vector.
I want, then, to create a matrix in Matlab, 400k X 300 that lists all the 400k embedded vectors of the vocabulary. I do not need to save the text-word equivalent of each vector.
What might be the simplest Matlab code to create such matrix from glove.6B.zip ?
Thanks for your anticipated help!

Accepted Answer

Shantanu Dixit
Shantanu Dixit on 30 Apr 2025
Edited: Shantanu Dixit on 30 Apr 2025
Hi Amos,
You can create an embedding matrix for the 'GLoVE' embeddings by initializing a matrix of size 400K × 300 initialized with 'zeros': https://www.mathworks.com/help/matlab/ref/zeros.html Corresponsingly each line can be read and stored (only the numeric part) in the matrix, discarding the word. As the file is in the text format, for storing the word vectors 'str2double':https://www.mathworks.com/help/matlab/ref/str2double.html can be used to convert the text to numbers. Each line in the file looks like this:
the 0.04656 0.21318 -0.0074364 -0.45854 ...
Overall after reading each line the corresponding vector can be stored as follows:
fid = fopen('glove.6B.300d.txt', 'r');
embeddingMatrix = zeros(400000, 300);
for i = 1:400000
line = fgetl(fid);
tokens = strsplit(line);
embeddingMatrix(i, :) = str2double(tokens(2:end));
end
fclose(fid);
You can also refer to following other useful documentation pages by MathWorks:
Hope this helps!

More Answers (0)

Categories

Find more on Introduction to Installation and Licensing in Help Center and File Exchange

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!