HTML Page source info

Question

0 votes

Hello, many-a-times we come across a series of numbered webpages

basePage.html?page=2
basePage.html?page=3

and so forth, wherein there are several fields identified by their labels:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>

and so on.

How can the "textOfInterest" of one particular parameter, say, Parameter2, of all the Name*, of all the pages,

basePage.html?page=1toInf

be taken (outputted/exported) into one text file, say, Parameter2.txt?

The "textOfInterest" is often alphanumeric with special characters !@#$% also.

Thanks.

6 Comments
Show 4 older comments Hide 4 older comments

b on 1 Dec 2020

Initially, I was hesitant to download this file because I thought it is religious or some such thing. But I am happy to have downloaded it. It is immensely useful and 'on the money' for this thread.

My interest occurs in the function button_Callback in BibleDownloader.m. The webpage is getting saved in the parameter called 'data'. And since finding <div class="pagination"> is right in the ballpark of my initially query, I was greatly excited to see the output and experiment with the case 'NB2014' inside this function. Unfortunately, the code doesn't seem to go here, since I was unable to retrieve either 'data', or the indices idx*. All of these indices idx*, viz idx, idx2 and idx3 will be useful for me. How can I access, and get to this part?

Also, perhaps you can suggest one regexp line to pull out 'textOfInterest' from

<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

and better still, if you already have something like the BibleDownloader m-file, with regexp used on extracting text between <div class> and </div> type of structure, that will be great.

Rik on 1 Dec 2020

Edited: Rik on 1 Dec 2020

The goal of Bible downloader is religious (although you can use the text of a Bible translation for non-religous purposes as well of course), but the code isn't.

Did you try adapting any of the code? I'll post some code as an answer.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Rik on 1 Dec 2020

Open in MATLAB Online

0 votes

One possibility with strfind:

close_div=strfinf(d,'</div>');
param=1;
pat=sprintf('<label>Parameter%d : </label> <div class="category-related">',param)
position=strfind(d,pat);
position=position+numel(pat);%this will be the start of your text of interest
texts=cell(size(position));
for n=1:numel(position)
    end_of_text=close_div(close_div>position(n));
    end_of_text=end_of_text(1)-1;
    texts{n}=d(position(n):end_of_text);
end

Or with a regexp:

d=['<h2 class="category-heading">Name1</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name2</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'...
'<h2 class="category-heading">Name3</h2>'...
'<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>'...
'<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>'];
RE=['<label>Parameter\d',... % \d matches a single digit
    ' : </label> <div class="category-related">',...
    '(',... % use parentheses to capture a token
    '[^<]*',... % this matches any number of characters other than <
    ')',...
    '</div>'];
t=regexp(d,RE,'tokens');
clc
celldisp(t)

You can also adapt the expression to look forward to match </div> so you can use .* instead of [^<]*

8 Comments
Show 6 older comments Hide 6 older comments

b on 1 Dec 2020

Open in MATLAB Online

Thank you.

But I have run into problem with the following part:

Trying to take the output of the two parameters simultaneously: Parameter1 and Parameter2. It so happens, that many times, Parameter1 is present, but the Parameter2 is missing. That is, the structure is like this:

<h2 class="category-heading">Name1</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name2</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter3 : </label> <div class="category-related">textOfInterest</div>
<h2 class="category-heading">Name3</h2>
<label>Parameter1 : </label> <div class="category-related">textOfInterest</div>
<label>Parameter2 : </label> <div class="category-related">textOfInterest</div>

Same problem if try to take all the three parameters.

When all three parameters are to be extracted, the objective is to get ' ' (no value) at the place where it is missing, rather than skipping it completely, because skipping it completely would result in a mismatch (so that when it is exported to the output text file, the corresponding entry is simply blank).

In the first (strfind) code, I tried to replicate the 'for loop' three times for the three parameters, but quickly ran into problems.

b on 2 Dec 2020

Open in MATLAB Online

Thanks for the link.

Downloaded the readfile from github. The 'elements' seems promising, except for - what are those ->->-> arrows in front of all the fields of interest?! Anyways, glad that it has brought to this point.

But the same situation with all the three approaches : when the mail-field is missing, then how to write 'NULL' in the output-file and continue with the loop?

Name1    mail1
Name2    missing
Name3    mail3
Name4    mail4

The strfind and regexp approaches give

Name{1}='Name1'
Name{2}='Name2'
Name{3}='Name3'
Name{4}='Name4'

and

Parameter{1}='mail1'
Parameter{2}='mail3'
Parameter{3}='mail4'

How to bypass the 'for loop' and at the same time, print 'NULL' in the corresponding excel row-column entry? In this example, (row=2,col=2) will be 'NULL', and (row=3,col=2) will be Parameter{2}.

It is not the question of 'skipping if not found', because numel(position) has already been evaluated, =4 here for the Name field, and =3 for the Parameter. So it seems to be hardcoded.

Rik on 2 Dec 2020

Those arrows are probably newline characters. What release are you using?

I would suggest parsing each element separately. That way you can write an empty char or whatever you prefer in the email field for that person.

Sign in to comment.

Answer 2

b on 3 Dec 2020

0 votes

That is exactly how I am doing it. By parsing it separately, there is no way to correlate which Name-field has the corresponding Mail-field missing. It parses all the Name-fields, then it parses all the mail-fields, as a sequential process.

What modification should be made in the codes, so that they print 'Not Found' when the mail field is missing in the corresponding iteration? Is there a way to get the index values of the missing Mail-fields?

3 Comments
Show 1 older comment Hide 1 older comment

b on 3 Dec 2020

Open in MATLAB Online

I am overwhelmed by the way you have patiently worked with me on this thread. I think I will close this elaborate thread here only, but not before posting this limerick:

There was once a man named Rik, 
Who wrote matlab codes so quick, 
To the topic, they were relevant
The codes themselves so elegant, 
His m-files, sir, were completely sick!

Enjoy your freedom from this thread.

Rik on 3 Dec 2020

You're welcome (and thanks for the limerick XD).

If you have follow-up question, feel free to post a link to it here.

Sign in to comment.

HTML Page source info

6 Comments
Show 4 older comments Hide 4 older comments

Accepted Answer

8 Comments
Show 6 older comments Hide 6 older comments

More Answers (1)

3 Comments
Show 1 older comment Hide 1 older comment

Categories

Tags

Community Treasure Hunt

HTML Page source info

6 Comments Show 4 older comments Hide 4 older comments

Accepted Answer

8 Comments Show 6 older comments Hide 6 older comments

More Answers (1)

3 Comments Show 1 older comment Hide 1 older comment

Categories

Tags

See Also

Community Treasure Hunt

6 Comments
Show 4 older comments Hide 4 older comments

8 Comments
Show 6 older comments Hide 6 older comments

3 Comments
Show 1 older comment Hide 1 older comment