# Data type conversion changes the value of the signal

3 views (last 30 days)
Luca Ferro on 2 Feb 2023
Commented: Luca Ferro on 6 Feb 2023
Any clue why this happens?
For reference this is how the signal is defined:
I guess the offset has something to deal with it.

Andy Bartlett on 3 Feb 2023
Edited: Andy Bartlett on 3 Feb 2023
Why cast to single has large quantization error.
The cast from fixed-point to single will use single precision for constants and single precision for all operations with one exception. The exception is the cast of the input stored integer value to single. That cast has integer input and single precision output.
Generating C Code using Simulink Coder or Embedded Coder will show the details.
Y1 = (real32_T)U1 * 0.0009765625F - 2.056192E+6F;
There CAN be precision losses at each an every step.
The representation of the Slope 0.0009765625F happens to be lossless in this case.
The representation of the Bias -2.056192E+6F also happens to be lossless in this case.
(real32_T)U1 chops the 32 bit input down to the 24-bit mantissa of singles lossing up to 8-bits of precision.
(real32_T)U1 * 0.0009765625F is 24-bit mantissa times 24-bit mantissa so full-precision can require up to 48-bits. That will be quantized down to 24-bits.
The subtraction can also lose some precision but that will be small in relative terms.
Quantitative Analysis of Errors
dt1 = fixdt(0,32,0.0009765625,-2056192);
uIdeal1 = 0.1;
u = fi(uIdeal1, dt1);
uStoredInteger = u.storedInteger
uStoredInteger = uint32 2105540710
siInFloat = single(uStoredInteger)
siInFloat = single 2.1055e+09
errorInStoredIntegerCast = double(siInFloat) - double(uStoredInteger)
errorInStoredIntegerCast = 26
realWorldImpactOfErrorInStoredIntegerFloat = errorInStoredIntegerCast * dt1.Slope
realWorldImpactOfErrorInStoredIntegerFloat = 0.0254
The error converting the 32-bit integer to a 24-bit mantissa floating-point is the dominant error source in this case.
0.996 + 0.0254 = 0.125
To get higher precision cast to double first
To get higher precision in converting slope-bias fixed-point to single, an approach is to first cast to double and then downcast to single. This approach is unnecessary if the fixed-point type has binary-point scaling (bias is zero and slope is an exact power of two). This approach is most impactful if the fixed-point type uses more than 24-bits and the slope is not an exact power of two.
The downside of this approach is that casts to double could be very costly on an embedded processor like a ARM Cortex M4F that has single precision floating-point hardware, but not double precision floating-point hardware. The double math would need to be emulated in software which would be much slower. This is a key reason a large body of users requested that casts and operations that mix fixed-point and single precision floating-point should only use single and not use doubles. This group of users prefered to model an explicit cast up to double when greater precision was need.
Luca Ferro on 6 Feb 2023
Thank you for the explanation, unfortunately due to the nature of the application casting to double is not possible. Still now that the issue is clearer i can figure out how to deal with it in the most effective way

Andy Bartlett on 3 Feb 2023
Edited: Andy Bartlett on 3 Feb 2023
The original value 0.1 is between two representable values (0.099609375 and 0.1005859375) of the type being quantized to fixdt(0,32,0.0009765625,-2056192). The original value is rounded to the nearest of the two representable values of the output type.
Tool to explain "any" case
The attached function provides a more detailed explaination of what happens when quantizing "any" scalar value to "any" numeric type. It should work fine for just about "any" case of interest including fixed-point, integer, and floating-point.
I put "any" in quotes because the analysis and plotting use calculations in double precision floating-point. If the input value or data type used was extreme relative to doubles then the analysis will fall apart. For example, the maximum finite value of double is approximately 1e308. The data type fixdt(0,8,-3000) has a maximum representable value equal to 255*2^3000 or approximately 1e905 which is extreme compared to the finite range of double.
dt1 = fixdt(0,32,0.0009765625,-2056192);
uIdeal1 = 0.1;
explainQuantizationOfConstant(uIdeal1,dt1)
Data type: fixdt(0,32,0.0009765625,-2056192) Original ideal value 0.1 Quantized value 0.099609375 Original ideal value is between two representable values in fixdt(0,32,0.0009765625,-2056192) Representable value below 0.099609375 Representable value above 0.1005859375 The ideal input was rounded to nearest of the two values.
##### 2 CommentsShowHide 1 older comment
Andy Bartlett on 3 Feb 2023
Hi Les,
Ah. Thanks for pointing out the second possible and more likely intended question.
Andy

### Categories

Find more on Signal Attributes and Indexing in Help Center and File Exchange

R2022a

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!