Using Embree generated BVH trees for GPU raytracing

Intel released it’s Embree collection of raytracing kernels, with source, sometime ago and I recently had the opportunity to try and compare the included BVH generation library against my own implementation in terms of BVH tree quality. The quality of a scene’s BVH is critical for quick traversal during raytracing and typically a number of techniques, such as the Surface Area Heuristic one I am currently using, is applied during the tree generation to improve it.

It appears that Embree is tuned for CPU-side traversal and by default it produces wide BVH trees, such as BVHs with 4 or 8 children (BVH4 or BVH8) suitable for SIMD acceleration. Such trees are harder to use during a GPU-based traversal. It does provide a sample (bvh_builder) though that showcases how to generate a BVH with 2 children (BVH2) and this is the one I based the Embree integration on . I built the Embree libraries directly from source although Intel provides prebuilt versions if you prefer.

The process of integrating Embree to the toy engine was pretty straightforward, I pretty much followed the bvh_builder tutorial, filling in vector of RTCBuildPrimitive elements with the primitives of the scene, creating a RTCDevice and building a tree of Embree BVH Nodes.

Each RTCBuildPrimitive primitive is described by a min/max bounds of the axis aligned bounding box, a primitive ID what points to the data of this primitive (which have to be stored externally) and a geometry ID that can be used to identify a mesh.

avector<RTCBuildPrimitive> prims;  

RTCBuildPrimitive prim;
  prim.lower_x = bbox.MinBounds.x;
  prim.lower_y = bbox.MinBounds.y;
  prim.lower_z = bbox.MinBounds.z;
  prim.geomID = 0;
  prim.upper_x = bbox.MaxBounds.x;
  prim.upper_y = bbox.MaxBounds.y;
  prim.upper_z = bbox.MaxBounds.z;
  prim.primID = index;

To create the BVH tree one needs to create a RTCBVH object, fill in a struct with appropriate arguments and call rtcBuildBVH to create the bvh.

RTCBVH bvh = rtcNewBVH(g_device);

/* settings for BVH build */
RTCBuildArguments arguments = rtcDefaultBuildArguments();
arguments.byteSize = sizeof(arguments);
arguments.buildFlags = RTC_BUILD_FLAG_NONE;
arguments.buildQuality = quality;
arguments.maxBranchingFactor = 2;
arguments.maxDepth = 1024;
arguments.sahBlockSize = 1;
arguments.minLeafSize = 1;
arguments.maxLeafSize = 1;
arguments.traversalCost = 1.0f;
arguments.intersectionCost = 1.0f;
arguments.bvh = bvh;
arguments.primitives =;
arguments.primitiveCount = prims.size();
arguments.primitiveArrayCapacity = prims.capacity();
arguments.createNode = InnerNode::create;
arguments.setNodeChildren = InnerNode::setChildren;
arguments.setNodeBounds = InnerNode::setBounds;
arguments.createLeaf = LeafNode::create;
arguments.splitPrimitive = splitPrimitive;
arguments.buildProgress = nullptr;
arguments.userPtr = nullptr;

Node* root = (Node*) rtcBuildBVH(&arguments);

A couple of things worth noting, maxBranchingFactor should be set to 2 to create a BVH with 2 children per node. Also a number of callbacks need to be passed that control creation of nodes. I based mine on the ones showcases in the bvh_builder tutorial with a small tweak: Embree’s BVH trees store the bounding boxes of the children in the parent node, while in my implementation I store each node with its bounding box. So I changed the setBounds callback to reflect that.

static void  setBounds (void* nodePtr, const RTCBounds** bounds, unsigned int numChildren, void* userPtr)
	assert(numChildren == 2);
	((InnerNode*)nodePtr)->bounds = merge(*(const BBox3fa*)bounds[0], *(const BBox3fa*)bounds[1]);

I also tweaked the sah() member function of the InnerNode, that calculates the node cost for the Surface Area Heuristic, as well because of this.

float sah() 
	BBox3fa& bounds0 = children[0]->bounds;
	BBox3fa& bounds1 = children[1]->bounds;
	return 1.0f + (area(bounds0)*children[0]->sah() + area(bounds1)*children[1]->sah())/area(merge(bounds0,bounds1));

The buildQuality is another important argument as it determines the quality of the generated tree. Embree supports 3 levels (more information about BVH generation):

  • RTC_BUILD_QUALITY_LOW: fast generation, based on Morton codes, but lower quality, suggested for dynamic scenes
  • RTC_BUILD_QUALITY_MEDIUM: Balanced generation time and tree quality, using binned SAH.
  • RTC_BUILD_QUALITY_HIGH: Slower tree generation but highest tree quality, using SAH and spatial splits.

Actually there is a 4th mode supported, RTC_BUILD_QUALITY_REFIT, which can refit a BVH when only the vertex buffer (positions) changes. I didn’t evaluate this mode in this instance.

The BVH tree creation is invasive and alters the source primitive data, so the bvh_builder example allocates an extra vector and makes a copy of the data.

avector<RTCBuildPrimitive> prims;
prims.reserve(prims_i.size() + extraSpace);

/* we recreate the prims array here, as the builders modify this array */
for (size_t j=0; j<prims.size(); j++) prims[j] = prims_i[j];

NB: Pay attention to the extraSpace variable. The spatial splits SAH path of the High quality mode will attempt to store split primitives at the end of the prims vector, so we need to allocate some extra memory space for this (I doubled the size of the prim vector). If you omit this step Embree will silently fall back to a Medium quality tree. The same will happen if you omit the splitPrimitive() callback in the arguments above.

The part that converts the BVH to the serialised form to be used during GPU tracing remained exactly the same.

I then did some performance test to compare the library with my own, single-threaded, SAH BVH tree quality, using the Sponza scene. The tests took place on my old HD4000 laptop with an i7-4510U@2GHz CPU running on battery.

In terms of BVH generation time, Embree beats my Reference implementation, at all quality levels, by a wide margin especially at lower quality levels. This is expected as BVH generation with Embree is multithreaded.

In terms of the memory required to store the GPU BVH tree buffer, all methods are similar, with the high quality Embree mode taking up some more.

Finally the raytracing cost for the Sponza scene, Embree performs really well, surpassing my Reference implementation in all quality modes, with especially great results at High.

I repeated the test with a more complicated scene, comprised of a number of trees (I didn’t use alpha testing in this instance). This scene has smaller and more uniform triangles and is also more dense with more opportunities for bounding box overlap.

The BVH generation test returned some interesting results, the High quality Embree mode is now slightly slower than my Reference implementation. It is maybe the case that the Spatial-Split SAH BVH algorithm does not cope well with that particular scene.

No surprises from the memory requirements test, the High quality mode occupies slightly more space than the other modes.

The Raytracing cost test returned another set of interesting results. This time around my Reference implementation is faster than the Low quality Embree tree and much closer to the Medium and High modes.

In overall the Embree library produced high quality BVH trees, with fast generation times, which speed up traversal significantly (depending on the scene) and has become part of my toy engine from now on.

Using Embree generated BVH trees for GPU raytracing

Open Twitter DMs, a 2 year retrospective

It’s been two years since I’ve opened my Twitter DMs and invited people to ask graphics related questions and seek advice about how to get into the games industry. I think it’s time for a quick retrospective.

The majority of the questions revolve around how to start learning graphics programming. Nowadays there is a large choice of graphics APIs, graphics frameworks, high quality engines freely available, advanced graphics techniques and the visual bar in modern games is very high. It is understandable that someone trying to learn graphics programming may feel overwhelmed. The many options one has nowadays can also work to their advantage though, I have written some advice on how one can approach learning graphics programming in an older post.

Continue reading “Open Twitter DMs, a 2 year retrospective”
Open Twitter DMs, a 2 year retrospective

A Survey of Temporal Antialiasing Techniques: presentation notes

At Eurographics 2020 virtual conference, Lei Yang did a presentation of the Survey of Temporal Antialiasing Techniques report which included a good overview of TAA and temporal upsampling, its issues and future research.

I have taken some notes while watching it and I am sharing them here in case anyone finds them useful.

Continue reading “A Survey of Temporal Antialiasing Techniques: presentation notes”
A Survey of Temporal Antialiasing Techniques: presentation notes

Optimizing for the RDNA Architecture: presentation notes

AMD recently released a great presentation on RDNA, with a lot of details on the new GPU architecture and optimisation advice.

While watching it I took some notes (like you do in real conferences) and I am sharing them here in case anyone finds them useful. They can be used as a TLDR but I actively encourage you to watch the presentation as well, some parts won’t make much sense without it. I have added some extra notes of my own in brackets [] as well.

Continue reading “Optimizing for the RDNA Architecture: presentation notes”
Optimizing for the RDNA Architecture: presentation notes

GPU architecture resources

I am often get asked in DMs about how GPUs work. There is a lot of information on GPU architectures online, one can start with these:

And then can refer to these for a more in-depth study:

Continue reading “GPU architecture resources”
GPU architecture resources

Validating physical light units

Recently I added support for physical light units to my toy engine, based on Frostbite’s and Filament’s great guides. Switching to physical lights units allows one to use “real-world” light intensities (for example in lux and lumens), camera settings (eg aperture, shutter speed and ISO) as well as mix analytical and captured light sources (HDR environment maps) correctly.

Continue reading “Validating physical light units”
Validating physical light units

Ways to speedup pixel shader execution

Catching up on my Twitter DMs I came across a question about ways to increase the execution speed of pixel/fragment shaders. This is a quite broad issue and the specifics will depend on the particularities of each GPU/platform and the game content but I am expanding on my “brain-dump” style answer in this post in case others find it useful. This is not a comprehensive list, more like a list of high level pointers to get one started.

Continue reading “Ways to speedup pixel shader execution”
Ways to speedup pixel shader execution

Hybrid screen-space reflections

As realtime raytracing is slowly, but steadily, gaining traction, a range of opportunities to mix rasteration-based rendering systems with raytracing are starting to become available: hybrid raytracing where rasterisation is used to provide the hit points for the primary rays, hybrid shadows where shadowmaps are combined with raytracing to achieve smooth or higher detail shadows, hybrid antialiasing where raytracing is used to antialias the edges only, hybrid reflections, where raytracing is used to fill-in the areas that screenspace reflections can’t resolve due to lack of information.

Of these, I found the last one particularly interesting: how well can a limited information lighting technique like SSR be combined with a full-scene aware one like raytracing, so I set about exploring this further.

Continue reading “Hybrid screen-space reflections”
Hybrid screen-space reflections

Readings on the State of the Art in Rendering

Last week at work a junior colleague asked me where do I get the presentations I’ve been reading from. This made me realise that, understandably, it might not be so obvious and common knowledge for people just starting graphics programming so I compiled a list of online resources I am frequently using to study the state of the art in Rendering. Continue reading “Readings on the State of the Art in Rendering”

Readings on the State of the Art in Rendering