You may have noticed a reduction in blog posts of late. The cause is a GPU porting project I've undertaken for theSkyNet POGS. I guess I should post something about it, in case some of you are interested.
As you may or may not know, the guts of the main POGS application is MAGPHYS. Basically, this science application loops through library files to find a best fit of different attributes for a given image pixel. The problem with the existing client is the main section of code loops through in a brute-force sequential kind of fashion. Hence, the application by default is not very "parallel" and requires re-work to parallalise and produce a successful GPU port.
We decided that converting from FORTRAN (F77) to C/C++ initially would be the best option as most of the GPU platform frameworks uses some derivation of C99. This is not to say that frameworks such as OpenCL and CUDA don't support other languages, it was just cleaner this way. Our choice of GPU framework was OpenCL.
The actual port from FORTRAN to C was fairly straight-forward, however, there were quite a few little "gotchas". This included things like: array indexing differences - FORTRAN starts at 1; floating point problems - FORTRAN does not do nearest-even rounding; and, overall output "weirdness" attributed to FORTRAN and C differences. None of these really kept me bogged down for too long. It just meant a lot of debugging and customised functions to get around them. These are temporary since the goal is to eventually move over to the C client only.
For the GPU port, it was a bit of a learning curve being the first time implementing OpenCL kernel code. I've worked with things like threading before, however, parallelising code for processing on a GPU was new to me. After hours of reading, I dove right in and started coding.
Writing the C code to prepare the OpenCL program, kernel, devices etc... and run the kernel was fairly easy. The tricky part was re-working some of the data that was going to be buffered into device memory and read back later. I decided that batching up sets of library models for the kernel threads to crunch was the best option. This essentially meant parsing the library model arrays to the device memory once and allocating space in device memory for the kernel threads to output data. At the end of the batch I would read back memory from the device and have a C based loop consolidate those results appropriately.
At present, we get a 3 - 4 time speed increase using the OpenCL implementation on modern machines with modern cards. There is a bit more work to increase that slightly by parallelizing the post batch part that accumulates output results. I find the biggest bottleneck to be reading large quantities of memory back from the device. If I can do the post batch stuff inside the device, I can reduce the amount of memory to be read back from the device before moving onto the next batch of models.
Overall, the porting process has been a great learning exercise. I'm hoping that in the next month the application can be stabilised and released to the public to crunch POGS on their GPUs. This needs approval from the project leader of course. I know there are a few more hurdles to overcome, but I'm optimistic.
Regarding my little cruncher projects (Raspberry Pi and ODROID), I will try and get back into them. I have 4 subjects coming up in Semester 1 so I'm guessing that this will drown most of my “play time”.