How to ignore special characters and retrieve the data prior to the character
Show older comments
I have 40 years of data. Unfortunately, each text file has special characters # or * in them representing the highest or lowest temperatures of that specific day and month. My code works (outside regexp(minT_tbl,'#*','match') and its counterpart). However, the special characters is confusing the program making data wrong. Any help would be great!
close all;
clear all;
clc;
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll.Year = year(dataAll.Day);
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
%delete leap year
LY = (dataAll.Month(:)==2 & dataAll.DD(:)==29);
dataAll(LY,:) = [];
% Unstack variables
minT_tbl = unstack(dataAll,"MinT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
maxT_tbl = unstack(dataAll,"MaxT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
yrs =str2double(minT_tbl.Properties.VariableNames(3:end))';
%ignore special characters
regexp(minT_tbl,'#*','match')
regexp(maxT_tbl,'#*','match')
% find min
[Tmin,idxMn] = min(minT_tbl{:,3:end},[],2,'omitnan');
Tmin_yr = yrs(idxMn);
% find max
[Tmax,idxMx] = max(maxT_tbl{:,3:end},[],2,'omitnan');
Tmax_yr = yrs(idxMx);
% find low high
[lowTMax,idxMx] = min(maxT_tbl{:,3:end},[],2,'omitnan');
LowTMax_yr = yrs(idxMx);
% find high low
[highlowTMn,idxMn] = max(minT_tbl{:,3:end},[],2,'omitnan');
HighLowT_yr = yrs(idxMn);
% find avg high
AvgTMx = round(mean(table2array(maxT_tbl(:,3:end)),2,'omitnan'));
% find avg low
AvgTMn = round(mean(table2array(minT_tbl(:,3:end)),2,'omitnan'));
% Results
tempTbl = [maxT_tbl(:,["Month","DD"]), table(Tmax,Tmax_yr,AvgTMx,lowTMax,LowTMax_yr,Tmin,Tmin_yr,AvgTMn,highlowTMn,HighLowT_yr)]
tempTbl2 = splitvars(tempTbl)
FID = fopen('Meda 05 Temperature Climatology.txt','w');
report_date = datetime('now','format','yyyy-MM-dd HH:MM');
fprintf(FID,'Meda 05 Temperature Climatology at %s \n', report_date);
fprintf(FID,"Month DD Temp Max (°F) Tmax_yr AvgTMax (°F) lowTMax (°F) LowTMax_yr TempMin (°F) TMin_yr AvgTMin (°F) HighlowTMin (°F) HighlowT_yr \n");
fprintf(FID,'%3d %6d %7d %14d %11d %11d %15d %11d %13d %10d %13d %17d \n', tempTbl2{:,1:end}');
fclose(FID);
winopen('Meda 05 Temperature Climatology.txt')
function Tbl = readMonth(filename)
opts = detectImportOptions(filename)
opts.ConsecutiveDelimitersRule = 'join';
opts.MissingRule = 'omitvar';
opts = setvartype(opts,'double');
opts.VariableNames = ["Day","MaxT","MinT","AvgT"];
Tbl = readtable(filename,opts);
Tbl = standardizeMissing(Tbl,{999,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
Tbl = standardizeMissing(Tbl,{-99,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
[~,basename] = fileparts(filename);
nameparts = regexp(basename, '\.', 'split');
dateparts = regexp(nameparts{end}, '_','split');
year_str = dateparts{end}
d = str2double(extract(filename,digitsPattern));
Tbl.Day = datetime(d(3),d(2),Tbl.Day)
end
6 Comments
Cris LaPierre
on 7 Feb 2024
Edited: Cris LaPierre
on 7 Feb 2024
A couple issues to point out.
- Because some of your non-leap year files have info for Feb 29, that date gets (correctly) convereted to Mar 1. That means your approach to removing Feb 29 will not catch those dates. You need to check month and day before the date gets converted to a datetime to avoid this. That means checking in the readMonth function.

- In your read function, none of the following code is used and can be deleted.
[~,basename] = fileparts(filename)
nameparts = regexp(basename, '\.', 'split');
dateparts = regexp(nameparts{end}, '_','split');
year_str = dateparts{end};
- You can combine all your missing values into a single cell array (added one you missed)
Tbl = standardizeMissing(Tbl,{-99,999,999.9,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'});
- The line of code that currently calls splitvars is not actually doing anything and can be removed.
tempTbl2 = splitvars(tempTbl)
- Your MissingRule is likely not the option you want ('omitvar'). This option means do not import a variable (i.e. an entire column of data) if it contains missing data. Fortunately, any missing values have been replaced with a numeric code (e.g. 999) so they are not treated as missing, and that variable is not omitted. I'd remove this line from your code. I think you want to use the EmptyLineRule instead.
Jonathon Klepatzki
on 7 Feb 2024
Cris LaPierre
on 7 Feb 2024
You need to check if the month ends in 29 before you convert Day into a datetime. That conversion happens in readMonth. Since 2/29 gets convereted to 3/1 in non-leap year years, you need to identify this date before the converstion to datetime.
Normally this wouldn't be an issue, but for some reason your files contain data for Feb 29 even in non-leap year years.
Jonathon Klepatzki
on 7 Feb 2024
Test it out. It doesn't elminate them because month does not equal 2 anymore, and day does not equal 29. They are now 3 and 1.
dataAll = table();
dataAll.Day = datetime(1981,2,29) % Feb 29, 1981, which is a non-leap year
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
% Remove all Feb 29 dates from the table
LY = (dataAll.Month(:)== 2 & dataAll.DD(:) == 29);
dataAll(LY,:) = [ ]
As you can see, the current LY code did not remove the data.
Jonathon Klepatzki
on 7 Feb 2024
Accepted Answer
More Answers (3)
Here is one possible solution, to get the data correctly from the data file:
% Open the data file for reading
FID = fopen('temp_summary.05.03_1998.txt', 'r');
% Initialize a cell array to store the cleaned data
C_Lines = {};
% Read the file line by line
N_line = fgetl(FID);
while ischar(N_line)
% Remove '*' and '#' characters from the line
C_Line = strrep(N_line, '*', '');
C_Line = strrep(C_Line, '#', '');
% Store the cleaned line if it is not empty
if ~isempty(C_Line)
C_Lines{end+1} = C_Line;
end
% Read the next line
N_line = fgetl(FID);
end
% Close the file:
fclose(FID);
% Convert the cell array of cleaned lines to a character array:
C_Data = char(C_Lines)
Cris LaPierre
on 7 Feb 2024
Edited: Cris LaPierre
on 8 Feb 2024
I think another rather straightforward approach is to treat * and # as delmiters.
I've simplified the read function for readability
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end
5 Comments
Jonathon Klepatzki
on 7 Feb 2024
Moved: Cris LaPierre
on 8 Feb 2024
Cris LaPierre
on 7 Feb 2024
Moved: Cris LaPierre
on 8 Feb 2024
The code I shared does not produce this error with any of the 12 files you have shared so far. Can you identify which file is causing this error and share it for us to test with?
Jonathon Klepatzki
on 7 Feb 2024
Moved: Cris LaPierre
on 8 Feb 2024
Cris LaPierre
on 7 Feb 2024
Moved: Cris LaPierre
on 8 Feb 2024
Hmm. Works here. Have you shared the full error message (all the red text)?
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end
Jonathon Klepatzki
on 7 Feb 2024
Moved: Cris LaPierre
on 8 Feb 2024
Walter Roberson
on 8 Feb 2024
0 votes
To answer the original question:
An alternative way to read the files is to use FixedWidthImportOptions together with readtable() https://www.mathworks.com/help/matlab/ref/matlab.io.text.fixedwidthimportoptions.html
Categories
Find more on Timetables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!