Q_rsqrt() vs 1/sqrt()

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Q_rsqrt() vs 1/sqrt()

Stephen Taylor
Because the hardware environment has changed, and the tradeoffs on integer and floating-point arithmetic are different. (Like it says in the Wikipedia article.)  Out of order execution might be messing up your measurements, too.

---------- message ----------
From: "uǝlƃ ↙↙↙" <[hidden email]>
To: FriAM <[hidden email]>
Cc:
Bcc:
Date: Thu, 7 Jan 2021 15:24:15 -0800
Subject: [FRIAM] Q_rsqrt() vs 1/sqrt()
https://en.wikipedia.org/wiki/Fast_inverse_square_root

So, why is Q_rsqrt() *slower* than 1/sqrt()?

1/sqrt() took 0.294771 s
Q_rsqrt() took 0.51579 s

--
↙↙↙ uǝlƃ






- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
Reply | Threaded
Open this post in threaded view
|

Re: Q_rsqrt() vs 1/sqrt()

gepr
Would out-of-order execution produce the same out-of-order order over, say, 10 executions?

The clock() results between GCC and TCC are similar. But the ASM looks fairly different. I'm still not seeing rsqrt or sqrt instructions even after specifying short floats throughout and using sqrtf(), with or without -O0, for whatever that's worth. But the speed of the 1/sqrtf() increased quite a bit from 1/sqrt().

gepr@cormac:~/lang/c$ ./gcc.out
1/sqrt() took 0.076633 s
Q_rsqrt() took 0.473007 s

gepr@cormac:~/lang/c$ ./tcc.out
1/sqrt() took 0.078259 s
Q_rsqrt() took 0.46164 s

On 1/8/21 8:46 AM, Stephen Taylor wrote:
> Because the hardware environment has changed, and the tradeoffs on integer and floating-point arithmetic are different. (Like it says in the Wikipedia article.)  Out of order execution might be messing up your measurements, too.

--
↙↙↙ uǝlƃ

- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
uǝʃƃ ⊥ glen
Reply | Threaded
Open this post in threaded view
|

Re: Q_rsqrt() vs 1/sqrt()

Marcus G. Daniels
mdaniels@daniels:~$ cat t.c
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc,const char **argv) {
  float val = atof (argv[1]);
  float ret = (1.0f/sqrtf(val));
  printf("%f\n",(double) ret);
}
mdaniels@daniels:~$ gcc -march=native -O2 -ffast-math -S t.c
mdaniels@daniels:~$ grep sqrt t.s
        vrsqrtss        %xmm0, %xmm0, %xmm1

-----Original Message-----
From: Friam <[hidden email]> On Behalf Of u?l? ???
Sent: Friday, January 8, 2021 2:49 PM
To: [hidden email]
Subject: Re: [FRIAM] Q_rsqrt() vs 1/sqrt()

Would out-of-order execution produce the same out-of-order order over, say, 10 executions?

The clock() results between GCC and TCC are similar. But the ASM looks fairly different. I'm still not seeing rsqrt or sqrt instructions even after specifying short floats throughout and using sqrtf(), with or without -O0, for whatever that's worth. But the speed of the 1/sqrtf() increased quite a bit from 1/sqrt().

gepr@cormac:~/lang/c$ ./gcc.out
1/sqrt() took 0.076633 s
Q_rsqrt() took 0.473007 s

gepr@cormac:~/lang/c$ ./tcc.out
1/sqrt() took 0.078259 s
Q_rsqrt() took 0.46164 s

On 1/8/21 8:46 AM, Stephen Taylor wrote:
> Because the hardware environment has changed, and the tradeoffs on integer and floating-point arithmetic are different. (Like it says in the Wikipedia article.)  Out of order execution might be messing up your measurements, too.

--
↙↙↙ uǝlƃ

- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
Reply | Threaded
Open this post in threaded view
|

Re: Q_rsqrt() vs 1/sqrt()

Marcus G. Daniels
I mean I think it is that you may be targeting too low of a common denominator in terms of the processor.   That should work for doubles too.

-----Original Message-----
From: Friam <[hidden email]> On Behalf Of Marcus Daniels
Sent: Friday, January 8, 2021 3:23 PM
To: The Friday Morning Applied Complexity Coffee Group <[hidden email]>
Subject: Re: [FRIAM] Q_rsqrt() vs 1/sqrt()

mdaniels@daniels:~$ cat t.c
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc,const char **argv) {
  float val = atof (argv[1]);
  float ret = (1.0f/sqrtf(val));
  printf("%f\n",(double) ret);
}
mdaniels@daniels:~$ gcc -march=native -O2 -ffast-math -S t.c mdaniels@daniels:~$ grep sqrt t.s
        vrsqrtss        %xmm0, %xmm0, %xmm1

-----Original Message-----
From: Friam <[hidden email]> On Behalf Of u?l? ???
Sent: Friday, January 8, 2021 2:49 PM
To: [hidden email]
Subject: Re: [FRIAM] Q_rsqrt() vs 1/sqrt()

Would out-of-order execution produce the same out-of-order order over, say, 10 executions?

The clock() results between GCC and TCC are similar. But the ASM looks fairly different. I'm still not seeing rsqrt or sqrt instructions even after specifying short floats throughout and using sqrtf(), with or without -O0, for whatever that's worth. But the speed of the 1/sqrtf() increased quite a bit from 1/sqrt().

gepr@cormac:~/lang/c$ ./gcc.out
1/sqrt() took 0.076633 s
Q_rsqrt() took 0.473007 s

gepr@cormac:~/lang/c$ ./tcc.out
1/sqrt() took 0.078259 s
Q_rsqrt() took 0.46164 s

On 1/8/21 8:46 AM, Stephen Taylor wrote:
> Because the hardware environment has changed, and the tradeoffs on integer and floating-point arithmetic are different. (Like it says in the Wikipedia article.)  Out of order execution might be messing up your measurements, too.

--
↙↙↙ uǝlƃ

- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/
- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
Reply | Threaded
Open this post in threaded view
|

Re: Q_rsqrt() vs 1/sqrt()

gepr
Thanks. I didn't think of trying -O2. That and -O1 give me a sqrtsd instruction. With both -O2 and -march=native, I get a vsqrtsd. And all 3 options give me a vrsqrtss. What's hilarious is the -O[12] make Q_rsqrt() faster than 1/sqrtf(), in spite of the assembler instruction(s).

gepr@cormac:~/lang/c$ ./O1.out
1/sqrt() took 0.095175 s
Q_rsqrt() took 0.065637 s

gepr@cormac:~/lang/c$ ./O2.out
1/sqrt() took 0.052231 s
Q_rsqrt() took 0.029407 s


On 1/8/21 3:28 PM, Marcus Daniels wrote:

> I mean I think it is that you may be targeting too low of a common denominator in terms of the processor.   That should work for doubles too.
>
> -----Original Message-----
> From: Friam <[hidden email]> On Behalf Of Marcus Daniels
> Sent: Friday, January 8, 2021 3:23 PM
> To: The Friday Morning Applied Complexity Coffee Group <[hidden email]>
> Subject: Re: [FRIAM] Q_rsqrt() vs 1/sqrt()
>
> mdaniels@daniels:~$ cat t.c
> #include <math.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> int main(int argc,const char **argv) {
>   float val = atof (argv[1]);
>   float ret = (1.0f/sqrtf(val));
>   printf("%f\n",(double) ret);
> }
> mdaniels@daniels:~$ gcc -march=native -O2 -ffast-math -S t.c mdaniels@daniels:~$ grep sqrt t.s
>         vrsqrtss        %xmm0, %xmm0, %xmm1


--
↙↙↙ uǝlƃ

- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
uǝʃƃ ⊥ glen
Reply | Threaded
Open this post in threaded view
|

Re: Q_rsqrt() vs 1/sqrt()

Marcus G. Daniels
I didn't notice any precision loss with the instructions.   That might be the cost.

-----Original Message-----
From: Friam <[hidden email]> On Behalf Of u?l? ???
Sent: Friday, January 8, 2021 4:04 PM
To: [hidden email]
Subject: Re: [FRIAM] Q_rsqrt() vs 1/sqrt()

Thanks. I didn't think of trying -O2. That and -O1 give me a sqrtsd instruction. With both -O2 and -march=native, I get a vsqrtsd. And all 3 options give me a vrsqrtss. What's hilarious is the -O[12] make Q_rsqrt() faster than 1/sqrtf(), in spite of the assembler instruction(s).

gepr@cormac:~/lang/c$ ./O1.out
1/sqrt() took 0.095175 s
Q_rsqrt() took 0.065637 s

gepr@cormac:~/lang/c$ ./O2.out
1/sqrt() took 0.052231 s
Q_rsqrt() took 0.029407 s


On 1/8/21 3:28 PM, Marcus Daniels wrote:

> I mean I think it is that you may be targeting too low of a common denominator in terms of the processor.   That should work for doubles too.
>
> -----Original Message-----
> From: Friam <[hidden email]> On Behalf Of Marcus Daniels
> Sent: Friday, January 8, 2021 3:23 PM
> To: The Friday Morning Applied Complexity Coffee Group <[hidden email]>
> Subject: Re: [FRIAM] Q_rsqrt() vs 1/sqrt()
>
> mdaniels@daniels:~$ cat t.c
> #include <math.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> int main(int argc,const char **argv) {
>   float val = atof (argv[1]);
>   float ret = (1.0f/sqrtf(val));
>   printf("%f\n",(double) ret);
> }
> mdaniels@daniels:~$ gcc -march=native -O2 -ffast-math -S t.c mdaniels@daniels:~$ grep sqrt t.s
>         vrsqrtss        %xmm0, %xmm0, %xmm1


--
↙↙↙ uǝlƃ

- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/ 
- .... . -..-. . -. -.. -..-. .. ... -..-. .... . .-. .
FRIAM Applied Complexity Group listserv
Zoom Fridays 9:30a-12p Mtn GMT-6  bit.ly/virtualfriam
un/subscribe http://redfish.com/mailman/listinfo/friam_redfish.com
archives: http://friam.471366.n2.nabble.com/
FRIAM-COMIC http://friam-comic.blogspot.com/