HTML Page source info

2 views (last 30 days)
b
b on 26 Nov 2020
Commented: Rik on 3 Dec 2020
Hello, many-a-times we come across a series of numbered webpages
basePage.html?page=2
basePage.html?page=3
and so forth, wherein there are several fields identified by their labels:
<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
and so on.
How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,
basePage.html?page=1toInf
be taken (outputted/exported) into one text file, say, Parameter2.txt?
The "textOfInterest" is often alphanumeric with special characters !@#$% also.
Thanks.
  6 Comments
b
b on 1 Dec 2020
Initially, I was hesitant to download this file because I thought it is religious or some such thing. But I am happy to have downloaded it. It is immensely useful and 'on the money' for this thread.
My interest occurs in the function button_Callback in BibleDownloader.m. The webpage is getting saved in the parameter called 'data'. And since finding <div class="pagination"> is right in the ballpark of my initially query, I was greatly excited to see the output and experiment with the case 'NB2014' inside this function. Unfortunately, the code doesn't seem to go here, since I was unable to retrieve either 'data', or the indices idx*. All of these indices idx*, viz idx, idx2 and idx3 will be useful for me. How can I access, and get to this part?
Also, perhaps you can suggest one regexp line to pull out 'textOfInterest' from
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
and better still, if you already have something like the BibleDownloader m-file, with regexp used on extracting text between <div class> and </div> type of structure, that will be great.
Rik
Rik on 1 Dec 2020
Edited: Rik on 1 Dec 2020
The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.
Did you try adapting any of the code? I'll post some code as an answer.

Sign in to comment.

Accepted Answer

Rik
Rik on 1 Dec 2020
One possibility with strfind:
close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
end_of_text=close_div(close_div>position(n));
end_of_text=end_of_text(1)-1;
texts{n}=d(position(n):end_of_text);
end
Or with a regexp:
d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
' : </label> <div class="category-related">',...
'(',... % use parentheses to capture a token
'[^<]*',... % this matches any number of characters other than <
')',...
'</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)
You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*
  8 Comments
b
b on 2 Dec 2020
Thanks for the link.
Downloaded the readfile from github. The 'elements' seems promising, except for - what are those ->->-> arrows in front of all the fields of interest?! Anyways, glad that it has brought to this point.
But the same situation with all the three approaches : when the mail-field is missing, then how to write 'NULL' in the output-file and continue with the loop?
Name1 mail1
Name2 missing
Name3 mail3
Name4 mail4
The strfind and regexp approaches give
Name{1}='Name1'
Name{2}='Name2'
Name{3}='Name3'
Name{4}='Name4'
and
Parameter{1}='mail1'
Parameter{2}='mail3'
Parameter{3}='mail4'
How to bypass the 'for loop' and at the same time, print 'NULL' in the corresponding excel row-column entry? In this example, (row=2,col=2) will be 'NULL', and (row=3,col=2) will be Parameter{2}.
It is not the question of 'skipping if not found', because numel(position) has already been evaluated, =4 here for the Name field, and =3 for the Parameter. So it seems to be hardcoded.
Rik
Rik on 2 Dec 2020
Those arrows are probably newline characters. What release are you using?
I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

Sign in to comment.

More Answers (1)

b
b on 3 Dec 2020
That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.
What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?
  3 Comments
b
b on 3 Dec 2020
I am overwhelmed by the way you have patiently worked with me on this thread. I think I will close this elaborate thread here only, but not before posting this limerick:
There was once a man named Rik,
Who wrote matlab codes so quick,
To the topic, they were relevant
The codes themselves so elegant,
His m-files, sir, were completely sick!
Enjoy your freedom from this thread.
Rik
Rik on 3 Dec 2020
You're welcome (and thanks for the limerick XD).
If you have follow-up question, feel free to post a link to it here.

Sign in to comment.

Categories

Find more on Just for fun in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!