Deleting Characters from String and Repeating
Show older comments
I have a string of text (hexadecimal) collected from hardware, and within the string are multiple instances of noise. The beginning of each data set begins with a header, but I am having trouble extracting the data sets from the noise. I need to find the 4 character headers in the string, then extract the next 96 characters. Then I need to repeat this process until I find the next header, but once I find the header then I need to delete/throw out the characters in-between the last character of the data and the first character of the next header.
I have used the strsplit() function on the string attached in the .txt file and with the header characters of 'AA02', but once I do this it places each instance into a cell, and I can't figure out how to delete any characters of the cells that come after the data. Ideally, I could have all of the data in one long string after the noise is deleted.
Any suggestions on this problem would be greatly appreciated.
Thanks
2 Comments
Is the 96-character data section of the message after the header noise-free and then there's "noise" until the next header record? Or is there more-or-less random noise embedded throughout?
How can you determine what is/is not noise vis a vis valid data other than by counting characters? Or is it even possible to tell?
It would be simple enough to simply truncate the strings to a given length, but will you wind up with anything at all useful if you do so?
fid=fopen('1066desk.txt','r');
msg=fread(fid,'*char').';
fid=fclose(fid);
txt=strsplit(msg,'AA02');
l=cellfun(@length,txt);
txt=txt(l>0);
l=cellfun(@length,txt);
returns the "messages" identified by strings after each header value. Unfortunately,
>> [min(l) max(l)]
ans =
112 661290
>>
doesn't indicate there is a single instance of only 96 characters.
u=unique(l); % how many different lengths are there?
n=histc(l,u); % count how many of each
>> [u n] % and look at the statistics
ans =
112 201
268 1
324 13
304118 1
661290 1
>>
From the above a sizable majority are the 112, a few of moderate difference and then a couple instances where it looks like something really interfered for quite a while.
>> l(1:10)
ans =
268
112
112
112
112
112
112
112
112
112
>> txt(1:10)
ans =
10×1 cell array
{'287E13673A34004B7958234DFEB8FFAE080800090011FFFF00030107F7990006FFFAFFFFFEA9FFBC07D20005FFFF00060000011AF78AFFE2FFFF00073F603BD03A40393043C18000C0D0000042E80000C371800043814000C440200043008000C184000043160000C0D0000043308000C24800000000000000000000000000002B44A02100C1'}
{'2934137EAA310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200000DD4' }
{'2A34138292330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200000DC3' }
{'2B3413867A350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C18C0000430B00000DA1' }
{'2C34138A62350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C18C0000430B00000D8E' }
{'2D34138E4A310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003FC00000430F00000DA6' }
{'2E34139232320006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003FC00000430F00000D94' }
{'2F3413961A330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040F0000042E600000E87' }
{'3034139A02350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040F0000042E600000E76' }
{'3134139DEA350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F10000C1080000430700000E15' }
>>
There appears to be a repetitive pattern in those of same length; possibly some pattern recognition type logic could be written to find and resynch the data stream from those?
I had a similar problem almost 50 years ago now when were returning data from plant computer via punched paper tape...a mispunch or torn tape caused similar-looking issues in data streams. Had to go through and find recognizable and convertible floating point values and use those to skip the bum points until the next case.
I've not taken the time to try to match the above data stream having no further knowledge of what is/isn't expected.
ADDENDUM:
>> cumsum(u.*n)
ans =
22512
22780
26992
331110
992400
>> cumsum(u.*n)/sum(l)
ans =
0.0227
0.0230
0.0272
0.3336
1.0000
>>
NB: If you were to just truncate after 96 or 112 characters until the next header you would throw away 97% of the data in the file.
>> numel(msg)/112
ans =
8.8685e+03
>>
which indicates there are enough characters in the received string for almost 9,000 datasets but there are only 217 instances of the header being intact.
>> numel(strfind(msg,'AA02'))
ans =
217
>>
which matches up with the result of strsplit as it should.
You'll need more sophisticated parsing or a way to clean up the transmission channel to not lose almost all the data.
Taylor Knuth
on 11 Feb 2020
Accepted Answer
More Answers (0)
Categories
Find more on Data Type Conversion in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!