Iteratively search in a website (for dummies)

1 view (last 30 days)
Hi all,
I have a list of thousands of chemical formula (or potentially formula). What I'd like to do is to iteratively get one of this formula (for i=1:size(FormulaList,1)....end), insert the formula into the search bar of the website (that is: https://pubchem.ncbi.nlm.nih.gov/ ), and check if I have a possible matches or I get something like this ("0 results found"):
I've tried to apply the method described here ( https://it.mathworks.com/matlabcentral/answers/400522-retrieving-data-from-a-web-page ) but I was not able to understand how to get the "curl" (sorry: I'm completely ignorant in this!).
Cheers,
Luca
[SL: removed the parenthesis from the end of one of the hyperlinks]

Accepted Answer

Luca D'Angelo
Luca D'Angelo on 9 May 2024
I've found the solution.
% MassList: column-vector with molecular formula
tic
for mass=1:size(MassList,1)
url=strcat('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/',MassList(mass,1),'/cids/JSON?list_return=cachekey');
try
jsonData = webread(url);
ResNum(mass,1)=jsonData.IdentifierList.Size;
catch
ResNum(mass,1)=0;
end
pause(0.205) % the website asks for max 5 requests / second
end
toc
The resulting column-array provides the number of compounds with the same molecular formula found in PubChem.

More Answers (1)

Steven Lord
Steven Lord on 3 May 2024
Your best bet is probably to use one of the access methods that PubChem provides, as described on this page. Note the usage policy. If you have thousands of requests it's likely going to take minutes or longer, or the bulk data downloads functionality linked in the usage policy may be a better fit for your needs.
From the MATLAB side of things, the functions in this documentation category likely will be of use to you as may be the functions on this documentation page. [Before you ask no, I don't have any examples specific to using those functions to access that database.]
  3 Comments
Steven Lord
Steven Lord on 6 May 2024
You haven't shown us what values you're using for the maxAttempts and waitTime variables in your code.
Luca D'Angelo
Luca D'Angelo on 6 May 2024
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
maxAttempts = 10; % Maximum number of attempts
waitTime = 5; % Time to wait between attempts (in seconds)
attempt = 1;
while attempt <= maxAttempts
jsonData = webread(apiUrl);
if ~isfield(jsonData, 'Waiting') || isempty(jsonData.Waiting) %|| ~strcmpi(jsonData.Waiting, 'true')
break; % Exit loop if request is not waiting anymore
end
attempt = attempt + 1;
pause(waitTime);
end
% Check if the request is still processing after the loop
if isfield(jsonData, 'Waiting') && ~isempty(jsonData.Waiting) && strcmpi(jsonData.Waiting, 'true')
disp('Your request is still processing. Please wait and try again later.');
return;
end
if isfield(jsonData, 'Fault')
disp(['Error: ', jsonData.Fault.Message]);
return;
end
numResults = 0; % Initialize number of results
if isfield(jsonData, 'IdentifierList') && isfield(jsonData.IdentifierList, 'CID')
numResults = numel(jsonData.IdentifierList.CID); % Number of search results
end
disp(['Number of results for molecular formula "', molecularFormula, '": ', num2str(numResults)]);
It doesn't really matter, actually. Most of the previous code was written by chatgpt but it's useless. The main lines are:
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
jsonData = webread(apiUrl);
if webread worked, maybe I would be able to find the information I am looking for. The problem is that I think the function launches the search but then doesn't wait for the website to ‘load’ the result, so it shows ‘Your request is still running’. Maybe I should find a way to launch the command, wait and then check if the webpage 'loaded' the results. What do you think?

Sign in to comment.

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!