Implementing Global Vectors (GloVe) in MATLAB.

9 views (last 30 days)
Hi,
I have been trying to implement the Global Vectors model (GloVe) in MATLAB (version: MATLAB R2018b) which seems to be as popular as the models implemented in the word2vec package. The basic idea of GloVe is to start with random initialization of the target and the context word matrices and the bias terms, and then use something like Adaptive Gradient Descent to optimize the parameters of the model; this is what the authors have implemented in their C implementation. I want to avoid writing a long question her with details explaining the model and its main equation, therefore, guiding interested readers to please go through one of the blogs dissecting the model. In the referenced blog, one can find the main equation that needs to be optimized just above the section "Comparison with Word2Vec". There are other places too, where one can find resources explaining the model. I have been trying to implement the same model in MATLAB, but with no luck so far. This is what I have done:
  1. Create the target word and context word co-occurrence matrix using the C codes provided by the GloVe project as these codes are very efficient. I then converted the binary co-occurrence file that those C codes create into a text file using my C implementation. Feel free to ask for the code.
  2. I read the text file line by line in MATLAB (I know this is very inefficient!) and run the optmization model using fminsearch() first, which gave all 0's when tested with the official GloVe evaluation dataset and codes; these codes are also available with the GloVe project.
  3. I am currently testing the model with fminunc(), and awaiting for the results.
I was wondering whethere there are better ways to implement the same model in MATLAB (without using their C implementation and calling from MATLAB) giving same results? I am now looking at Least-squares fitting page and trying how to model the problem, and I even think that the neural networks toolbox can be used to optimize the problem, but I do not know how because that would give us the chance to use gradient-based methods in MATLAB directly. Those who are interested below is my code wrritten in MATLAB.
clear
vocab_path = "..\vocab.txt"; %This where my vocabular file is located which was created using the vocab.c code from the GloVe project.
vocabfileID = fopen(vocab_path,'r');
vocab_size = linecount(vocabfileID);%Code: https://uk.mathworks.com/matlabcentral/fileexchange/42497-linecount
fclose(vocabfileID);
vocab_size = vocab_size - 1;
vector_size = 50;
target_matrix = rand(vocab_size, vector_size);
context_matrix = rand(vocab_size, vector_size);
bias_target = rand(vocab_size , 1);
bias_context = rand(vocab_size , 1);
options = optimoptions(@fminunc,'Display','none','Algorithm','quasi-newton', 'UseParallel', false,'FunValCheck', 'on', 'MaxFunctionEvaluations',50);
%options = optimset('Display','off', 'MaxFunEvals', 50);
cooccurrence_file='..\cooccurrence.txt'; %The co-occurrence file in text format; it would be great to read the binary file directly in MATLAB that the GloVe project creates.
fileID = fopen(cooccurrence_file);
concatenated_vector = rand(2*vector_size, 1);
x_max = 10;
alpha = 0.75;
counter = 0;
while ~feof(fileID)
counter = counter + 1;
if mod(counter,100000)==0
disp(counter);
end
thisline = fgetl(fileID);
records = strsplit(thisline, ' ');
target_id = str2double(records(1));
context_id = str2double(records(2));
cooccurrence_value = str2double(records(3));
concatenated_vector = horzcat(target_matrix(target_id,:),context_matrix(context_id,:),bias_target(target_id),bias_context(context_id));
if cooccurrence_value < x_max
fun = @(concatenated_vector)(cooccurrence_value / x_max)^alpha * dot(concatenated_vector(1:vector_size),concatenated_vector(vector_size+1:2*vector_size)+concatenated_vector(2*vector_size+1)+concatenated_vector(2*vector_size+2)-log(cooccurrence_value)^2);
else
fun = @(concatenated_vector)dot(concatenated_vector(1:vector_size),concatenated_vector(vector_size+1:2*vector_size)+concatenated_vector(2*vector_size+1)+concatenated_vector(2*vector_size+2)-log(cooccurrence_value)^2);
end
%concatenated_vector = fminsearch(fun,concatenated_vector,options); %Gave 0s in word analogy results using the Python evaluation codes from the GloVe project.
concatenated_vector = fminunc(fun,concatenated_vector,options);
target_matrix(target_id,:) = concatenated_vector(1:vector_size); %updating the vectors
context_matrix(context_id,:) = concatenated_vector(vector_size+1:2*vector_size);
bias_target(target_id) = concatenated_vector(2*vector_size+1);
bias_context(context_id) = concatenated_vector(2*vector_size+2);
end
fclose(fileID);
%Here I averaged the target and the context word vectors, and saved the file using dlmwrite(). Then used paste command in Linux to paste the first column of vocabulary file to the saved vectors. Then executed the GloVe word analogy Python evaluation code.

Answers (1)

koosha salehi
koosha salehi on 24 Oct 2020
HI
  • I am using stanford glove data set and i want to design a deep network with lstm i use WordEmbeddingLayer but it doesn't work i think that sequence input layer makes problem. who can help me?
  • and i need a small labled corpus and its Equivalent vectors for Glove format.
any one do it before?

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!