[Openbook question]  #define N 10000 __global__ void vectorA…

[Openbook question]  #define N 10000 __global__ void vectorAdd(float *a, float *b, float *c) {     int idx = blockIdx.x * blockDim.x + threadIdx.x;     if (idx < N/10)         c[idx*10] = a[idx*10] + b[idx*10]; } How can we improve the Floating-point operations per byte for the above code? There are 4 CUDA blocks, and each CUDA block has 10 threads.   choose all 

[Openbook question]  #define N 10000 __global__ void vectorA…

[Openbook question]  #define N 10000 __global__ void vectorAdd(float *a, float *b, float *c) {     int idx = blockIdx.x * blockDim.x + threadIdx.x;     if (idx < N/10)         c[idx*10] = a[idx*10] + b[idx*10]; } In the above code, what will be the Floating-point operations per Byte? Assume that the memory transaction size is 128B and there is no cache. Choose the closest value.

[Open book] #define N 10000 __global__ void vectorAdd(float…

[Open book] #define N 10000 __global__ void vectorAdd(float *a, float *b, float *c) {     int idx = blockIdx.x * blockDim.x + threadIdx.x;     if (idx < N/10)         c[idx*10] = a[idx*10] + b[idx*10]; } Assuming 100 CUDA blocks, each consisting of 100 threads, with a warp width of 16, and a page size of 4KB, what optimizations would be most helpful in reducing address translation overhead in this code?