Interactive global illumination using NVIDIA’s CUDA

14 Feb 2010

Currently I prepare my “Diplomarbeit” (German Master’s thesis equivalent) about interactive global illumination on the GPU (Graphics Processing Unit). The exact topic of the thesis is not fixed yet. I use the paper

An Efficient GPU-based Approach for Interactive Global Illumination
Rui Wang, Rui Wang, Kun Zhou, Minghao Pan, Hujun Bao
ACM Transactions on Graphics, Vol. 28, No. 3, August 2009

as a starting point. Since I’m very interested in writing and testing my own sample application, I decided to explore NVIDIA’s CUDA (Compute Unified Device Architecture) development kit. Based on my experiments with CUDA I want to define the exact topic of the thesis.

As CUDA is just an extension of the well known C programming language, it wasn’t too challenging to get started. After some time I chose to implement a simple CUDA-based ray tracer to start my application. I began with porting parts of my Java-based ray tracer I had written while working on courses about computer graphics. To import scene models I currently use the Open Asset Import Library. It met my requirements as I currently want triangle data (vertices, normals and colors) only.

The latest version of the application performs the following tasks directly on the GPU:

  • Generation of rays (both primary and secondary rays)
  • Ray-triangle-intersection test
  • Shadow ray generation and intersection
  • Phong shading at found intersection points

It has to be noted that there is no acceleration data structure (to speed up intersection search) yet. Therefore a bruteforce approach is taken which tests each ray against all triangles. Because of this I only work with very small scenes consisting of less than 100 triangles. One prominent example for such scenes is the Cornell Box scene. I use a simplified variant of that scene as I only need some basic scene information.

The following screenshot shows this Cornell Box variant rendered with my CUDA-based ray tracer at 512×512 resolution. To improve quality it was generated using simple 2×2-supersampling. I deactivated secondary rays since the model had no reflection or transmission material properties.

On my NVIDIA GeForce GTS 250 I get about 12 fps. Before I optimize the raytracer or add other features such as photon mapping I want to implement an acceleration data structure, specifically the kd-tree as described in

Real-Time KD-Tree Construction on Graphics Hardware
Kun Zhou, Qiming Hou, Rui Wang, Baining Guo
ACM Transactions on Graphics, Vol. 27, No. 5, December 2008

For now I will not upload any files or source code. I plan to do that when everything is a little more stable and optimized.


Please note that it's not possible to submit additional comments. The comment form did not make it into this version of my site.

By: WangPeng 12 Apr 2010

Dear Mathias:
Thank you for solving my problems about the kdtree a few days ago!
This Wensday I will have to give a talk on the “An Efficient GPU-based Approach for Interactive Global Illumination” at the front of my fellow. But I was stuck in the “illumination cut” of that paper. I will be very grateful if you can extricate me from the confusion.

From the paper we know that the kdtree of photon map in wang’s paper is different from the one in the “Realistic Image Synthesis Using Photon Mapping “. There may have 1 to 32 photons in the leaf node.(I think it is better to using Jensen’method to make a more blanced tree at the large node section.)

The first problem is what is the value of ->Np?
->Np is located at “->Np is the normal at the node’s center, and Rp is the ” which is two line below the Equation 3 in the illumination cut section. Is it the normal of the photon at the middle of the longest aixs of the node’box .

The second problem is how to manage the cut to be 4K-8K?
Wang says he pick the Emin as the 12-13 level of the tree. I think this wil lead to more than 10,000 nodes in the worklist. Because there may be about 4k-8k nodes whose ~Ep is above Emin in the 12-13 level only , and may be more in the lower level. I don’t know what dose “if p.children not both in worklist then worklist.add(p)” in algorithm 3 (in the supplementary of the paper ) mean. If the p here means any node , the number of the node in the worklist will be about half of the number of the node in the tree at the begin of the “while not worklist.empty()”. Dose he mean that doing this only for p that who has a leafnode child?

The third problem is what does “else worklist.remove(p) worklist.add(p.children)” at the bottom of the algorithm 3 mean?
~Ep is evaluated using a very ugly method , so the difference between Ep and ~Ep is not surprising and he store Ep instead of ~Ep if the node is added into the cut. In my opinion the difference between them only means that the ~Ep is not accurate, it dosen’t means that the center if this node isn’t a good sample point ,Why dose he substitute p’children for p? Or is this substitute will lead to a better distribution of the sample points? I haved printed the Walter’s“Lightcuts : A scalar approch to illumination”, maybe this paper will give me some inspiration. I have asked one of my fellow ,but he didn’t know yet, so I turn to you for help. Maybe you understand it ,but it’s hard to explain it clearly,. I will be very glad if you can give me some inspiration . I can undersrand you if are busy and have no time to reply.


By: Mathias 12 Apr 2010

Dear WangPeng,

I guess I’m not that far into Wang’s paper yet. I’m currently working on a KNN search based on the KNN search from Zhou. Maybe I can answer some of your questions later this day.

I skim-read your comment and noticed one thing: Zhou’s kd-tree for photon mapping might have a small root node size of up to 32. But those nodes are splitted according to the VVH cost, too. That’s done in the small node stage.

Right now I’m not using this VVH cost in my implementation, so I cannot guess what the average leaf size would be. But with my simple split cost evaluation, I get an average leaf node size of about 1.92 for the Sponza scene, that is 1.92 photons per leaf.

Yes, the balance of the kd-tree might be slightly worse than in Jensen’s method. However, I guess there isn’t a comparable GPU based implementation of it.


By: Mathias 13 Apr 2010

Well, let’s see if I can help you anyhow:

First problem: Uhm “normal at the node center” is quite vague. As the direction of the photons are allready used in w_i and photons don’t have normals, I would guess that it has something to do with the scene geometry. Probably you’d have to find a photon right in the middle of the box and get the normal of the surface where it was generated.

Second problem: Well I’ve no practical experience about the number of nodes for a given E_min. However not all nodes will stay in the cut: In the second for-loop parent nodes are removed when their children are in the work_list, too.

The third loop checks the nodes from bottom-up, starting with the bottom level of the tree and going up. Therefore in each iteration only the nodes p of one level are processed. Only those nodes p whose children are not both represented in the work_list are added to the work_list.

Third problem: It replaces a node by it’s children every time it’s estimated ~E_p value differs too much from the accurate irradiance E_p. This should improve the accuracy of the cut since the children’s errors are smaller than the error of the parent node: r_p^2 is used for area in formula (3) and this overestimated the search neighborhood as noted in the paper. For child nodes this error should be smaller than for the parent node. n_p also might have some influence, but I’m not yet sure about that.

I also have to read the paper from Walter about illumination cuts. As you can see in my latest post, I’m currently working on photon mapping and will start to implement Wang’s method the next days. So I’m happy to discuss the details of this method with you.


By: 王鹏 13 Apr 2010

Second problem:
In the third loop ,if any node p whose children is not both represented in the work_list is added to the work_list. So for every three nodes “p,p.lChild,p.rChild”, there will be at least one node in the worklist. So for every two adjacent level”levela,levela+1″, there will be at least n node in the worklist,here n is the number of node in the levela. So about half of the node in the tree will be added into the worklist. So the number of node in the worklist is much more than 8K.

Third problem:
Yeah the paper says “thereby improve its accuracy”.But I’m confusing about this method. The accuracy means that: the center of the node is a good sample point for the node ;E_p is a good a approximation for every position in the box of the node;the photons in the node have a uniform distribution. I think the differnce between E_p and ~E_p mainly comes from the method that evaluate ~E_p,not from the non-uniform of the distribution of the photons. Because the Box is to inaccurate compared to the sphere. The more accurate method to detect if photons of p have a bad distribution maybe compare E_p of p.lChild with E_p of p.rChild ,but at a higher cost. Maybe for large node p whose child have the same volume, we can compare the number of photons in p’s children to detect the distribution of the photons in p’box. I havent’t seen how he implement the VVH for the small node in the construction of the photonmap. Maybe we can add a variable “distrib” in the node struct,and store the difference between the distribution of the two child node into distrib in the buildup of the photon map. Thus in the illumination cut stage,we can valuate the distrib, if it’s too large ,we replaced p by its children, else we evaluate E_P and add p into the cut.

New problem:
You say “photons per leaf” is 1.92 in your implemention. Let’s assume that for a node’box ,there is 20 boxs surrounding it on average, and we need to find about 150 photons int the tree,if it will get a better search speed if we make the “photons per leaf” be 7?

To get a better result in the final gather stage ,maybe there is lots of work to do with the construction of a better photonmap-tree and a better illumination cut. Your progress in the large project is really very quikly, my roommate who is Zhou’student is implementing the kdtree on GPU using BSGP language, but get stuck in a bug of the large node stage now. He is also a new hand on GPU programming ,but maybe a old hand compared to me who have never written a GPU program. At present I have no time to practice CUDA ,may be I will next month. I hear that it’s very diffucult to debug CUDA program, dind’t you encounter any horrible bug?

By: 王鹏 13 Apr 2010

oh my god
all the text become bold.
I misuse the b> for /b> .
Forgive my foolishness!

By: 王鹏 13 Apr 2010

Maybe the bold text will save your computer some energy in the displaying.O(∩_∩)O

By: Mathias 13 Apr 2010

No problem, I edited your inital post and added the missing slashes for the closing tags. ;-)

Some time ago I printed the paper about the BSGP language, but I’ve not yet used it. Currently I write my stuff directly using CUDA C. I also had several problems with the kd-tree implementation. Special cases like triangles lying direcly in a splitting plane were reasons for that. Yeah debugging isn’t easy. I’m not sure what NVIDIA’s new tools can bring to the table, but using the emulation mode can be annoying.

Second problem: According to the algorithmic description in the suppl. material you seem to be correct. I’m irritated by the “and adding nodes to cover leaf nodes that do not have parents representing them in the cut” (main paper, page 5). However the target size of 4K-8K is for the result, that is after the last step. So either some termination criterion for the third-loop is missing or the last step will get to the final size.

Third problem: Well both E_p and ~E_p are computed using the same photon distribution, and yes, the main difference should be the method of evaluation. The ~E_p uses a larger search neighborhood by taking just r_p^2, where r_p is the maximum side length of the AABB. For standard density estimation done for E_p, we would have some query radius r around the node center and use PI*r^2 instead of r_p^2. r could be much smaller than r_p, so E_p uses a smaller search neighborhood.

Furthermode both r and the number of photons used to evaluate E_p might vary from node to node, depending on standard density estimation algorithm.

Regarding your comments on distribution: Well, I also have no implementation of VVH yet. Your idea might be an extension to the illumination cut algorithm by Wang et al. However I’m not really that experienced with photon mapping and it’s algorithms, yet. So I cannot estimate how your suggestion would work.

New problem: That should depend on the speed of the traversal algorithm. This is the C_ts summand in the VVH formula in Zhou’s kd-tree paper. For a slow traversal algorithm a larger number of photons per leaf would surely lead to better performance. I haven’t performed any experiments regarding the traversal speed of my photon map implementation yet.