Selecting parts of HTML file

4 views (last 30 days)
v k
v k on 23 Nov 2020
Commented: v k on 26 Nov 2020
Hello,
I have a series of text files in a directory, serially numbered, *Part 1*, Part 2, Part 3 ... These are actually HTML files, but I can save them as text files also. One such file, Part 4, is attached herewith. (There is '3' at the begining of the title, but it is a part of the title. It is not the serial number. '3' occurs in every file in the directory.) The structure of all these files is exactly the same. The region of interest always occurs from line 26 to line 40.
I wish to save all the vernacular text from line 26 to line 40 in a separate text file, and the words in the bracket immediately following these vernacular words in another separate text file. The vernacular text always occurs after the serial number, followed by fullstop, followed by space, followed by asterix. The words following this vernacular text always occur within open and closed brackets after space preceded by the vernacular text.
How to take these in two separate text files for all of the html files in the directory at once?
Thanks.

Accepted Answer

Rik
Rik on 23 Nov 2020
First read the html files (you can get my readfile function from the FEX. If you are using R2017a or later, you can also get it through the AddOn-manager, alternatively on R2020b you can use readlines):
data=readfile('https://www.mathworks.com/matlabcentral/answers/uploaded_files/424138/3%20letter%20Hindi%20words%20without%20matra%20%E2%80%93%20Part%204%20%E2%80%93%20Kathakar.txt');
lines_of_interest=data(26:40);
What you need to do next is to parse the specific lines. You already have the patterns you're looking for. There is an optimal way with a regular expression, and an easy way with several call to strfind. If you have trouble implementing that, don't hesitate to post a comment with what you tried.
  7 Comments
Rik
Rik on 24 Nov 2020
This is the general structure:
RE=[...
' & ',...%start with an ampersand symbol
' ( ',...%capture the first token
' [^;]* ',...%match anything that isn't a semicolon
' ) ',...%
' .* ',...%match any character
' \* ',...%match a litteral *
' ( ',...%capture the second token
' .* ',...%match any character
' ) ',...%
' % ',...%end with a percent symbol
''];
RE=RE(~isspace(RE));%remove spaces (match actual spaces with \s)
str='do not match, but &match this; and *this%';
t=regexp(str,RE,'tokens');
celldisp(t)
t{1}{1} = match this t{1}{2} = this
Regular expression are horrible to document, so this is about the best I can do. You should be able to easily adapt it to your situation. I use this style of writing and documenting regular expressions if the actual expression is unreadable. In this case this is the actual expression:
RE='&([^;]*).*\*(.*)%';
v k
v k on 26 Nov 2020
I think this is a good succint example. But, not able to get 't' (only null) however much I tweak with the RE expression.
Anyways, written as above, I think there is a way to resolve another query. Let me pose it as a separate question :
Maybe, this type of RE structure can be helpful there.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!