Fixed Point Scaling – Low Pass Filter example

For the Mesh Potato spectrum analyser I used a filter to smooth the signal strength estimates. There isn’t much written about fixed point DSP, so I decided to describe how I built the filter. This post is a mini tutorial on fixed point DSP.

Now don’t get scared, if you understand year 8 algebra, you can follow this post.

First of all, why fixed point? Well it runs faster on CPUs that don’t have floating point hardware. For example low cost DSP chips and the SoC router chip we use for the Mesh Potato. These CPUs do integer maths OK, but are slow at floating point maths as they need to call library functions. Kernel mode code (like the Linux kernel) also avoids floating point maths, even if the CPU has floating point support.

The Filter

The low pass filter is a fixed point Infinite Impulse Response (IIR) filter. Now that sounds scary. Lets break it into small pieces. The filter I used look like this:

  y(n) = 0.875*y(n-1) + 0.125*x(n). 

This means “take 1/8 of the current input sample x(n) and add it to 7/8 of the previous output sample y(n-1). This gives us the new output sample y(n)”. So if we feed a x(n) =1 (a DC signal, top plot) into it, we get y(n) (bottom plot):

IIR filter with DC input

See how y(n) responds slowly to the new input? That’s the averaging or low pass filter effect. If we feed an impulse into it (x(0)=1, x(n)=0 after that), we get:

IIR filter with impulse input

This is known as the impulse response. Because of the y(n) = 0.875*y(n-1) feedback loop the impulse response for this filter actually goes on forever. It gets smaller and smaller but never quite gets to zero as 0.875 of a very small number is still a number greater than 0. So in DSP-speak we call it an infinite impulse response (IIR) filter. I generated these plots with this Octave script. GNU Octave is very cool software for experimenting with DSP.

Fixed Point Numbers

Fixed point DSP is the art of representing real numbers with integers. Lets say we have 16 bit integers. We usually say this represents the numbers of 32767 to -32768. The binary codes map to numbers like this:

Binary Number
0000 0000 0000 0001 1
0000 0000 0000 0010 2
0111 1111 1111 1111 32767
1000 0000 0000 0000 -32768

However if we like, it could represent any range of numbers with 2^16 discrete values. For example lets assume the LSB in the integer represents 0.125, rather than 1. Then we get something like:

Binary Number
0000 0000 0000 0001 0.125
0000 0000 0000 0010 0.250
0000 0000 0000 1000 1.000
0000 0000 0000 1001 1.125
0111 1111 1111 1111 4095.875
1000 0000 0000 0000 -4096.000

This is a fixed point number, as the position of the decimal point is always fixed. One form of describing a fixed point number is Q format. For example the first table is “Q16.0” (regular integers), the second table “Q13.3”. Q-notation tells you how many bits are to the right and left of the decimal point.

Fixed Point Scaling

Now lets convert the IIR filter described above to fixed point. The filter is used for averaging Wifi signal strengths that are numbers like x = -65, -60, -63 etc. The averaged output from the filter is y. I know that for this application (filtering signal strengths) the x numbers are always in the range of 0 to -128. So for this example I have decided to use Q13.3 format. This allows me to represent numbers between 0 to -128 and gives me a few fractional bits to improve the accuracy of my maths.

However I could have chosen other formats. For example Q8.8 would be a better choice, as this lets me represent numbers between 127.996 and -128.0 and gives me 8 fractional bits to make my maths nice and precise. However I only figured this out while I was writing this post. As I am too lazy to change the actual code I am sticking to Q13.3 for this example.

I’ll start with floating point C code and work my way through using simple algebra. We start with y and x as floats:

  y = 0.875*y + 0.125*x 

Now in Q13.3 format, a 1.0 is represented by binary 1000 or an integer “code” of 2^3 = 8. So we need to multiply everything by 8. Another way of looking at this is for Q13.3 we shift everything left 3 bits. This gives us:

  8y = 0.875*(8y) + x

Note how I have kept the 8 close to the y above. This is because they are the same variable in the C code, and must have the same fixed point format. This clearer if we substitute z=8y:

  z = 0.875*z + x

Now because I cheated and used 0.875 as the filter coefficient we can re-arrange:

  z = (1 - 0.125)*z + x
  z = z - 0.125*z + x

Multiplying by 0.125 is the same as a 3 bit right shift. Shifts are efficient ways to do divides in fixed point. Now you can see why I chose 0.875 = 1 – 0.125 as the filter coefficient, it maps to a nice power of two in the filter implementation. So anyway by converting the multiply to a shift we get:

  z = z - z>>3 + x

So that’s the actual C code implementation of the fixed point IIR filter in Q3.13, it’s equivalent to the floating point C code “y = y*0.875 +0.125*x”. We can extract the filtered output y:

  z = 8y
  y = z/8
  y = z >> 3

As z is in Q13.3 we can convert it to y in Q16.0 by a right shift.

For a long time I did fixed point scaling by guesswork, but I like the approach above as I can use algebra to get an engineered solution faster. I tend to put the algebra above in C source comments so that the fixed point code looks less like Klingon when I have to maintain it.

The general approach for fixed point porting is to first choose a Q format to suit the dynamic range of the floating point signal you are working with. For example Q13.3 (or even Q8.8) suits the signal strength values above. As you convert each variable to fixed point, test carefully (say with unit tests) as each fragment is ported.


If you found this post useful take a look at this step by step fixed point port of the Goertzal tone algorithm used for CTSS tone detection.

16 thoughts on “Fixed Point Scaling – Low Pass Filter example”

  1. Is there some conspiracy out there to attach frightening names to simple equations? In the network traffic control code, the linux kernel calls your “Infinite Impulse Response filter” an “Exponential Weighted Moving Average”. The equation is identical, although the coefficients are tunable from user space. And since there is no floating point in the kernel, its all done fixed point arithmetic there too.

    I guess the term Exponential Weighted Moving Average does have one good point. Trying to figure out good coefficients for it is a complete pita, and so the name hinting that black magic might be involved is truth in advertising.

    1. Actually both IIR and EWMA describe what’s going on. The filter calculates a moving (continually updated) average. The average is based on a bunch of exponential weights, you can see the exponential decay in the impulse response figure above. So its like taking the exponential response above and using it to weight a bunch of recent samples to get the average.

  2. Hi David,

    I think you should refrain from using algebra on C-style ‘equations’. The latter are not equations in the algebraic sense: The right-hand side states which operations have to be executed while the left-hand side indicates where the result should be stored. You’re getting away with this because the ‘equation’ is a linear one. How would you convert ‘y = 0.875*y^2 + 0.125*x’ ?


    1. Hello Bart,

      The notation is a short hand for y(n) = 0.875*y(n-1) + x(n), the y on the RHS is the previous sample. While the notation is not correct in the algebraic sense it’s convenient as it matches how we implement the C code. In particular we often (re) use the same variable for y(n) and y(n-1). This can cause confusion in the fixed point maths. The notation I am suggesting helps model that reuse.

      I think it works for your non-linear example:

      y = 0.875*y^2 + 0.125*x
      y = 0.875*y*y + 0.125*x
      8y = 8*0.875*y*y + 8*0.125*x
      8y = (0.875/8)*8y*8y + x
      sub z=8y
      z = (0.875/8)*z*z + x
      z = (1 – 0.125)*z*z/8 + x
      z = z*z/8 – 0.125*z*z/8 + x
      z = z*z>>3 – z*z>>6 + x

      This makes sense – when we add fixed point numbers the Q formats must be the same. Multiplying two Q format numbers e.g. Q13.3 * Q13.3 gives you a Q26.6, so we need to shift right to get it back to Q13.3 for the addition.



      1. You’re right. It does work, as long as you don’t start moving terms across the equal sign. Some more remarks:
        1/ Substituting z=2^n*y makes z contain values that are n times higher. So an overflow check is always appropriate. Actually, you have to search for the highest possible n without causing an overflow to find the most accurate equation.
        2/ Individual terms can cause overflow as well. Using brackets to force a division to take precedence over a multiplication can help, e.g.: z*z/8 -> z*(z/8).
        3/ The multiply-to-shift conversion isn’t required. gcc will gladly optimize the operation (division by 2^n) for you. Plus it keeps the code more ‘algebraic’.


  3. Hi Bart,


    (1) is taken care of when the Q factor is selected to suit the known dynamic range of the signal being represented by the variable. For example if we know the maximum value is 128, then Q8.8 won’t get any overflows when we shift up.

    In (2), using z*(z/8) would lose precision, for example for z=4 we would get 4*(4>>3) = 4*(0) = 0. Usually fixed point multipliers (say in a DSP chip) have a 32 bit result with two 16 bit operands, often coupled with 32 bit accumulators. At some point the output may be scaled back down and stored in a 16 bit number.

    Re (3), I tried compiling this:

    int test(int a) {
    return a/8;


    int test(int a) {
    return a>>3;

    with “gcc -c -S test.c” to look at the assembler. You are right gcc converts the former to shifts but for some reason the code is less efficient (at least on my gcc).



  4. Thanks so much for the implementation details. I have just tracked down a bug in an existing product due to a poorly coded low pass filter giving loss of precision, and this page not only showed me what was wrong but some simple code to fix it.

    Electric car stuff is great as well!

    1. So glad I could help Tom. One of the reasons I blog is to help others who might have the same problem I experienced.

  5. 1111 1111 1111 1111
    is actually -1 in 2’s complement (which all CPUs use these days).

  6. Hi David,

    If my input (I & Q complex samples) is 16bit signed range (-32768,32767) and my coefficients are in Q1.15 format, then what scaling is required on input to perform the above IIR filter. I guess I can perform without any scaling, treating IQ also Q1.15 format. Lets say my coefficients are in 13.3, then what scaling is required. Can you please help me in understanding the scaling requirements for input side.


    1. Ben,

      Here’s the first step:

      The floating point IIR filter is:

      y = 0.875*y + 0.125*x

      If the coeffs are in Q15 we multiply both sides of the equation by 1<<15: (1<<15)y = (1<<15)0.875*y + (1<<15)0.125*x Then follow the steps above. Pls post your working, so others can share another worked example. – David

Comments are closed.