You may have noticed a reduction in blog posts of late. The cause is a GPU porting project I've undertaken for theSkyNet POGS. I guess I should post something about it, in case some of you are interested.
As you may or may not know, the guts of the main POGS application is MAGPHYS. Basically, this science application loops through library files to find a best fit of different attributes for a given image pixel. The problem with the existing client is the main section of code loops through in a brute-force sequential kind of fashion. Hence, the application by default is not very "parallel" and requires re-work to parallalise and produce a successful GPU port.
We decided that converting from FORTRAN (F77) to C/C++ initially would be the best option as most of the GPU platform frameworks uses some derivation of C99. This is not to say that frameworks such as OpenCL and CUDA don't support other languages, it was just cleaner this way. Our choice of GPU framework was OpenCL.
The actual port from FORTRAN to C was fairly straight-forward, however, there were quite a few little "gotchas". This included things like: array indexing differences - FORTRAN starts at 1; floating point problems - FORTRAN does not do nearest-even rounding; and, overall output "weirdness" attributed to FORTRAN and C differences. None of these really kept me bogged down for too long. It just meant a lot of debugging and customised functions to get around them. These are temporary since the goal is to eventually move over to the C client only.
For the GPU port, it was a bit of a learning curve being the first time implementing OpenCL kernel code. I've worked with things like threading before, however, parallelising code for processing on a GPU was new to me. After hours of reading, I dove right in and started coding.
Writing the C code to prepare the OpenCL program, kernel, devices etc... and run the kernel was fairly easy. The tricky part was re-working some of the data that was going to be buffered into device memory and read back later. I decided that batching up sets of library models for the kernel threads to crunch was the best option. This essentially meant parsing the library model arrays to the device memory once and allocating space in device memory for the kernel threads to output data. At the end of the batch I would read back memory from the device and have a C based loop consolidate those results appropriately.
At present, we get a 3 - 4 time speed increase using the OpenCL implementation on modern machines with modern cards. There is a bit more work to increase that slightly by parallelizing the post batch part that accumulates output results. I find the biggest bottleneck to be reading large quantities of memory back from the device. If I can do the post batch stuff inside the device, I can reduce the amount of memory to be read back from the device before moving onto the next batch of models.
Overall, the porting process has been a great learning exercise. I'm hoping that in the next month the application can be stabilised and released to the public to crunch POGS on their GPUs. This needs approval from the project leader of course. I know there are a few more hurdles to overcome, but I'm optimistic.
Regarding my little cruncher projects (Raspberry Pi and ODROID), I will try and get back into them. I have 4 subjects coming up in Semester 1 so I'm guessing that this will drown most of my “play time”.
Hi Daniel.
ReplyDelete99.99% of this is 100% over my head ... oh how I remember Fortran, though. Read as "I failed".
One thing immediately pops into my tiny brain. Savagely cutting through this excellent thread ...
"loops through library files to find a best fit of different attributes for a given image pixel .... the biggest bottleneck to be reading large quantities of memory back from the device"
Are pixels that cannot ever provide any results dropped from the process at 1st read or even dropped from the raw data before it is presented for upload?
If I read the problem properly (and I'm happy to be told I'm an idiot) a large number of pixels will be useless for processing because they have no usable attributes. If they were to be ignored from the outset that might significanty reduce the data size being handled.
Alternatively, - like the data compression of sound/image/video - a stream of "all x00 or all xFF" bytes would be encoded to occupy a tiny space compared with tons of raw x00/xFF bytes. If you get my drift.
Someone has probably thought of that already so I'll shut up now :-)
HTH, Ray
You're spot on. There's a process (i know nothing about) that filters out bad pixels before sending out. There's also a threshold for filtering out library models as well.
ReplyDeleteThe cruncher usually gets a file of 1-20 pixel IDs containing observation of various attributes. These pixels are then matched up against a model and results fired back to the BOINC server for ingestion.
The actual memory device issue is a different part. The issue lies where the kernel threads are building up probability histogram values. Basically each thread is making its own of structure instead of writing to a shared one. I'm doing this to get around using atomic operations and locks. I can fix this by either having another set of threads that accumulate these results into one structure before copying from device or actually use semaphores inside the first set of threads. It would mean 3000x16 bytes compared to 1000000x16 bytes needing to be copied back...Big difference...Just need to get it done.
This wouldn't be an issue if I was dealing with an OpenCL CPU app as it allows direct reading/writing of of device memory whereas GPU you need to copy back. At least I think that's the case from my readings.
“The issue lies where the kernel threads are building up probability histogram values. Basically each thread is making its own structure instead of writing to a shared one. I'm doing this to get around using atomic operations and locks.”
ReplyDeleteTwo things -
a) limit the number of pixels processed to something-usable-per-unit-of-memory before the process starts (split input to smaller and more manageable units) - maybe more difficult later stitching together any significant results???
b) limit the extent of each structure built beyond which meaningful result would be worthless
FWIW I have no idea whether any of these ideas have any relevance to anything ...
Hey Ray
ReplyDelete- "a) limit the number of pixels processed to something-usable-per-unit-of-memory before the process starts (split input to smaller and more manageable units) - maybe more difficult later stitching together any significant results???"
The actual program, fit_sed, deals with one pixel at a time. It's the models it compares against that get batched up. Each batch fires off 1,000,000 threads that have the task of calculating probability against the model attributes that are indexed by its thread id. I can reduce this number but it still has to get through all possible models. It actually used to be more but I had to reduce due to running out of device memory. At the moment it uses 150MB.
- "b) limit the extent of each structure built beyond which meaningful result would be worthless"
That's definitely a possibility. I could do something dynamically but I'd need to think about it a bit more.
I really just need to do more in the device that will lead to copying less data back for each batch.
In saying all this, at the moment the OpenCL app gets 3-4x time improvement over original fortran. This additional feature to reduce memory copy back will shave another 10% or so off the time. So it's not a huge improvement I'm trying to make.
Good talk by the way. It's always good to get others thoughts on things.
DeleteHi Daniel.
DeleteI hope so ... anyway I thunk a bit. Hope I'm making sense as follows.
When you say "It's the models it compares against that get batched up" ... how are the models given? Is it any old random selection or a bet fit batch for re-processing thing? Maybe that's a question for the project team rather than a developer ...
When you say "it still has to get through all possible models" maybe that could apply to b) above ... defined by the word "possible".
Does that mean:-
bail out on 1st hit or
record all possible hits
(if no hits after all done then OK - it's more thought time)
The former suggests batches could be reduced as a hit vs the 1st batch obviates the need for other processing (or issue some kind of "cancel remaining threads as job done" command if batches cannot be reduced).
The latter would mean more thought e.g., maybe pre-processing the batches to weed out clearly useless models before attempting to process the pixel against them. Kinda like approaching the problem from a different angle.
For example, if you have a needle and you're looking for a matching needle in a haystack with one or more needles hidden inside then first get a big magnet to get all the needles out and you're only dealing with needles rather than needles & hay before you start the comparison. (It's probably way more complicated than that I guess, but I hope you get my drift.)