Selecting parts of HTML file

Question

v k on 23 Nov 2020

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/658198-selecting-parts-of-html-file

Commented: v k on 26 Nov 2020

3 letter Hindi words without matra – Part 4 – Kathakar.txt

Hello,

I have a series of text files in a directory, serially numbered, *Part 1*, Part 2, Part 3 ... These are actually HTML files, but I can save them as text files also. One such file, Part 4, is attached herewith. (There is '3' at the begining of the title, but it is a part of the title. It is not the serial number. '3' occurs in every file in the directory.) The structure of all these files is exactly the same. The region of interest always occurs from line 26 to line 40.

I wish to save all the vernacular text from line 26 to line 40 in a separate text file, and the words in the bracket immediately following these vernacular words in another separate text file. The vernacular text always occurs after the serial number, followed by fullstop, followed by space, followed by asterix. The words following this vernacular text always occur within open and closed brackets after space preceded by the vernacular text.

How to take these in two separate text files for all of the html files in the directory at once?

Thanks.

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Rik on 23 Nov 2020

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/658198-selecting-parts-of-html-file#answer_553028

Open in MATLAB Online

First read the html files (you can get my readfile function from the FEX. If you are using R2017a or later, you can also get it through the AddOn-manager, alternatively on R2020b you can use readlines):

data=readfile('https://www.mathworks.com/matlabcentral/answers/uploaded_files/424138/3%20letter%20Hindi%20words%20without%20matra%20%E2%80%93%20Part%204%20%E2%80%93%20Kathakar.txt');
lines_of_interest=data(26:40);

What you need to do next is to parse the specific lines. You already have the patterns you're looking for. There is an optimal way with a regular expression, and an easy way with several call to strfind. If you have trouble implementing that, don't hesitate to post a comment with what you tried.

7 Comments
Show 5 older commentsHide 5 older comments

Rik on 23 Nov 2020

Edited: Rik on 23 Nov 2020

Open in MATLAB Online

You're welcome, it is one of my function I'm more proud of, although I'm of course very biased, especially with the time it took to write it. (it even works better than the readlines function introduced in R2020b (i.e. it doesn't fail for emoji outside of the Basic Multilingual Plane), although that does support more encodings than just ASCII and UTF-8)

For the second part you need to take it step by step:

hindi=cell(size(lines_of_interest));
english=cell(size(lines_of_interest));
for n=1:numel(lines_of_interest)
    current_line=lines_of_interest{n};
    
    % now you can use strfind to find the starts of the patterns you describe
    pat_start_hindi='. *';
    pat_hindi_english=' (';
    pat_english_stop=')*';
    
    %what do you need to do with these indices to determine the start and stop of the two parts?
    ind1=strfind(current_line,pat_start_hindi);
    ind2=strfind(current_line,pat_hindi_english);
    ind3=strfind(current_line,pat_english_stop);
    
end

Try something and show what you tried.

First put the text in separate variables, then you can worry about writing it to a text file (for which you can find plenty of examples on Google).

Rik on 24 Nov 2020

Open in MATLAB Online

This is the general structure:

RE=[...
    '  &          ',...%start with an ampersand symbol
    '  (          ',...%capture the first token
    '      [^;]*  ',...%match anything that isn't a semicolon
    '  )          ',...%
    '  .*         ',...%match any character
    '  \*         ',...%match a litteral *
    '  (          ',...%capture the second token
    '      .*     ',...%match any character
    '  )          ',...%
    '  %          ',...%end with a percent symbol
    ''];
RE=RE(~isspace(RE));%remove spaces (match actual spaces with \s)
str='do not match, but &match this; and *this%';
t=regexp(str,RE,'tokens');
celldisp(t)
 
t{1}{1} =
 
match this
 
 
t{1}{2} =
 
this
 

Regular expression are horrible to document, so this is about the best I can do. You should be able to easily adapt it to your situation. I use this style of writing and documenting regular expressions if the actual expression is unreadable. In this case this is the actual expression:

RE='&([^;]*).*\*(.*)%';

v k on 26 Nov 2020

I think this is a good succint example. But, not able to get 't' (only null) however much I tweak with the RE expression.

Anyways, written as above, I think there is a way to resolve another query. Let me pose it as a separate question :

https://in.mathworks.com/matlabcentral/answers/662608-converting-strings-to-operators

Maybe, this type of RE structure can be helpful there.

Sign in to comment.

Selecting parts of HTML file

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

7 Comments
Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

Selecting parts of HTML file

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

7 Comments Show 5 older commentsHide 5 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

7 Comments
Show 5 older commentsHide 5 older comments