We implemented real-time, parallel Catmull-Clark subdivision, handling view-adaptivity without creating any cracks or T-junctions. We programmed our project in CUDA on the Gates 5 machines, which use Nvidia GTX 480s.
Catmull-Clark subdivision is an algorithm often used in graphics applications to make polygonal surfaces smoother. This is achieved by taking the set of faces and vertices in an image and creating a set of new faces within each face, refining the image using more and more polygons to create the illusion of a round surface. Below is an image capturing the general process.
Figure 1: A cube being subdivided into a sphere
The Catmull-Clark algorithm takes a set of vertices and faces comprising a mesh, and outputs a new set of vertices and faces corresponding to a new mesh that has many more vertices and faces, and will appear to be a smoother version of the original mesh upon rendering.
The algorithm is carried out over multiple subdivision steps. In each subdivision step we create a new mesh with more faces, vertices, and edges, which appears more smooth than the last. In our version, the user can determine how many levels to subdivide the image to.
Before we subdivide, we make some assumptions about our input:
1. The face vertices observe a consistent ordering (vertices are organized in a clock-wise manner). This is important when we implement view-adaptivity, since we need to compute the surface normals. An inverted orientation will give an incorrect normal direction
2. The mesh faces are quadrilaterals. Catmull-Clark can handle arbitrary meshes in its most general form (and after one iteration, every face will be a quadrilateral anyway), but in order to implement the algorithm with a compact, simple data-structure on a GPU, we assume a mesh composed entirely of quadrilaterals.
Within each subdivision step we must perform several operations on the current set of faces, vertices, and edges to correctly generate the new mesh. They can be summarized, in basic terms, as follows:
For every face, create a face point: the centroid of all vertices composing the face.
Create 4 new faces, filling in two of the vertices (the face point and a corner point).
Figure 2: A face point, centroid of the 4 quadrilateral vertices
For every edge, create an edge point, a vertex that is the centroid of the endpoints of the edge and the face points of the two adjacent faces.
4 new edges for each edge point, connecting it to the two original
endpoints of the edge, as well as the two adjoining face points.
Add the edge points to faces to complete the new faces.
Figure 3: An edgepoint and the 4 vertices creating it
Figure 4: With the new edge points, the face points are now complete
3. Update Old Vertex Points
For each old vertex given that existed at the start of the subdivision step, adjust its position to accommodate the addition of the new vertices. Set each vertex to the following position: P' = (F + 2R + (n-3)P)/n, where:
- n: no. adjacent edges
- F: average of adjacent face-points
- R: average of adjacent edge mid-points
- P: original point position
Figure 5: An original vertex p' has been moved to p after the update
These 3 steps are repeated for as many rounds as necessary to achieve a satisfactory visual effect.
Opportunity For Parallelization
There is a lot of room for parallelization here from the offset, over the faces, edges and vertices. As we subdivide further, the opportunity for parallelism increases further, as we add 4 new faces/edge on each level (exponential increase in number of face/edges), along with a considerable number of vertices. These portions of the algorithm are essentially data-parallel.
A GPU seems to be an excellent platform for performing these computations, owing to the scope of parallelism and the sheer size of the data set. For instance, a standard mesh, like bigguy, has ~1000 faces/edges/vertices initially. After a few iterations, these numbers increase to over 100,000. A GPU has the resources to handle this immense data-set.
That being said, there is a substantial amount of data that we need to obtain to compute the new values. Specifically, we need to be able to perform queries like:
a) Given a vertex, get all its neighbors
b) Given a vertex, get all the adjacent faces
c) Given an edge, get all its adjacent faces
However, we can't use complex data-structures, such as STL maps, since these data-structures aren't very performant on a GPU, owing to their complexity and size. We need something much simpler, preferably arrays.
Challenges For Parallelization
There are several parts of this process which make parallelization not as straightforward as it seems.
First, there is some need for synchronization. Updating old vertices requires some newly-generated face point and edge point positions to be added to those of the old vertices. Done in parallel, multiple faces, or multiple edges, might attempt to add to the same old vertex, which can lead to race conditions unless the addition is done atomically.
Second, SIMD divergence is an issue here. Adjacent vertices in our arrays are not necessarily adjacent in the image, and adjacent vertices in the image can be stored in very different locations in our arrays. We may not subdivide some faces/vertices/edges at all, and they may be in the same blocks as many that will be subdivided. In other words, the threads in a CUDA block may do a varying amount of work and access random locations in memory.
The number of faces, edges and vertices increase exponentially with subdivision depth, increasing the amount of memory storage. As a result, an increasing number of operations need to be performed on the mesh, which can be computationally expensive.
From a user's perspective, there is a small fraction of the faces/edges/vertices that are visible at a time. At least 50% of the faces are on the back-facing side of the mesh and aren't visible. Additionally, if the user zooms in on a region in the mesh, there would be even fewer visible faces. Thus it would be far more efficient to only subdivide the faces that are actually displayed on the screen.
This motivates an approach for view-adaptivity, which involves determining which parts of the image are actually in the camera's view and then performing subdivision only on those parts. As can be seen in the diagram below, the right side of the mesh is facing away from the camera and not in view, and thus does not need to be subdivided – we can keep it as it is unless the user rotates the view. View-adaptivity involves both filtering out parts of the image that are not on-screen because of zooming, and back-face culling, filtering out the back side of the image that is facing away from the camera.
Figure 7: The part of the animal facing the camera/inside the view-frustum is marked for subdivision
The Problem of Cracks/Hanging-nodes
However, a problem arises with view-adaptive subdivision. If we only subdivide the visible parts of the image, we get causes artifacts such as cracks and hanging nodes in the image, which are undesirable. This happens at the boundary of the planes which are subdivided/not subdivided. We discuss our approach for dealing with this in the following sections.
As previously mentioned, we use the Nvidia GTX 480 GPUs with the CUDA platform for parallel portion of this project. Our initial code base features code that enables reading of an input mesh, OpenGL support, definitions of operations for vectors, and rendering support. The code base was written in C++.
Implementation of Basic Catmull-Clark Algorithm
While we were doing research on the various GPU-based algorithms we could use, we implemented a simple serial version, using the STL data structures, such as vectors, sets and maps. We used this mainly for debugging purposes/checking our results, not for performance reasons.
After finalizing the algorithm to implement, we decided to take the following approach:
a) Implement a serial -version, without view-adaptivity, using the specified data-structures. We clearly refactored our code so that we can easily replace the parallelizable portions with CUDA kernel calls and add the necessary synchronization.
b) We then implemented a CUDA version of the same algorithm. We also started investigating methods to cull faces for view-adaptivity.
c) We drew up a serial version to handle the full algorithm, with view-adaptivity and mechanisms to avoid cracks.
d) We finally implemented the full CUDA version.
Data Structure Details:
The data-structure that we decided to implement is know as a render-dynamic mesh.
We have 3 pairs of arrays, for vertices, faces and edges
a) Vertex-Array: Each vertex contains the vertex position, normal, texture-coordinate and valence (no. adjacent edges)
b) Face-Array: Each face contains the 4 vertex indices of the vertices in the face
c) Edge-Array: Each edge contains the 2 vertex indices of the edge end-points, along with the 2 face indices of the adjacent faces
For each array type, we have an in-array and an out-array. We read in values from the in-array, then write values to the out-array. After a round of subdivision, we swap the in and out buffers, to prepare for the next round.
Figure 6:A graphical overview of the render-dynamic mesh
We still need a way to obtain the face-point and mid-point values of adjacent faces/edges when we're computing the new vertex position. To do this, whenever we compute a face-point/mid-point (in the face-point/edge-point phases), we add those values to the original vertex indices. However, in order to accumulate all of these results, correctly, we need to use atomic addition.
Given the fact that, on average, around 5 faces/edges share the same vertex, we don't expect too much contention (although we need to use atomic add for 8 values for a given vertex; 3 for the position, 3 for the normal vector and 2 for the texture-coordinate).
Culling the faces not visible to the user requires some rather involved mathematical calculations. These details would be familiar to someone with a solid foundation in computer graphics, but we've provided them here since they are interesting and demonstrate some useful techniques in graphics.
From an implementation standpoint, this can be done in two phases:
a) Removing faces outside view-frustum:
A camera's range of view is defined by a view-frustum. We can determine the planes that compose the coordinates of the view-frustum's planes by using the camera's line-of-sight vector, the distance to the near/far clips of the camera, the field-of-view angle, etc. All of these values are obtained from the camera object that we use for rendering purposes. A little 3-d geometry/trigonometry yields the desired vertices and planes. We orient these vertices so that the normals point inwards; the reason for this will become clearer in the next section.
Figure 8: The view-frustum construction
We can then determine which faces lie inside the view frustum. This is done by considering the signed distance from the vertices to the frustum planes. A point that lies on the same side of the plane's normal is said to be a positive distance from the plane and vice-versa for points on the opposite side.
We use the following rules for determining if a plane is inside/intersecting/outside:
a) If for all planes, the vertices are a positive distance away, the plane is said to lie inside the frustum
b) If there are some planes where the vertices are a positive/negative distance away, the plane is said to intersect the frustum
c) If there exists a plane where the vertices are all a negative distance away, the plane is said to lie outside the frustum
We cull the planes that lie outside the frustum and keep the planes that lie inside/intersect the frustum.
Figure 9: Explanation of Frustum-Culling
b) Removing back-faces
We can also remove the back-faces of the mesh. This is done by considering the dot-product of the camera's line-of-sight vector and the face's surface normal. For this to work correctly, we assume that the input mesh has vertices oriented consistently, so that we can compute the surface-normals correctly (correct direction). This also creates further implementation problems when we have to add vertices to the newly subdivided faces; they need to have the same orientation.
Figure 10: Explanation of Backface-culling
It's worth mentioning that these computations can all be done in parallel on the GPU as well. However, while this approach does minimize the number of calculations we have to perform, it doesn't change the fact that we still need to launch the same number of threads as we would have done before. This increases the risk of SIMD divergence, since some threads don't do much work at all.
Avoiding Cracks/Hanging Nodes
Simply subdividing all faces in view, however, does not account for the problem of cracks and T-junctions. We used an algorithm provided in the paper Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces. In addition to subdividing visible faces, we also partially subdivide adjoining faces using two types of templates.
The algorithm can be briefly summarized as follows:
Initialize by marking every other vertex as an 'potentially active' vertex. This ensures that no two adjacent vertices are marked. This is done serially on the initial mesh with a BFS approach.
Determine the edges to be subdivided, and mark their potentially active vertices as 'active'.
For every face, if it has 0 active vertices, do not subdivide it. If it has 1 active vertex, 3-subdivide it (subdivide into 3 quadrilateral faces). If it has 2 active vertices, 4-subdivide it. It cannot have more than 2 because of the alternating assignment of active vertices.
Reassign potentially active vertices for the next round.
Figure 11: Initial alternating 'potentially active' vertex marking
Figure 12: 3-subdivision vs. 4-subdivision
This leads to a nice gradual drop in subdivision level from the visible faces subdivided fully to the faces not subdivided at all.
Figure 13: Comparison of results with naive subdivision vs 3/4 subdivision
Implementing view-adaptivity introduced some new challenges. We could no longer assume that every face subdivides into 4 child faces; it could subdivide into 1, 3, or 4, so it is no longer possible to simply statically allocate and assign indices without wasting space. Additionally, threads subdividing faces need to determine where they can write to in the out face-buffer. The same is true for edges
To solve this problem, we maintained flags indicating whether vertices were active or potentially active, as well as which faces/edges were to be divided further and produce either 3/4 children. We then performed a prefix-sum scan over these flags to determine the correct indices to write to in the out buffers.
We have timed results for serial and parallel Catmull-Clark subdivision. We have tested on the Big Guy mesh.
We have measured the following:
1. Regular Catmull-Clark Subdivision, Serial vs. Parallel
2. Atomics Included vs. Non-Atomic Operations
3. Time Spent in Each of the 3 Steps: Faces, Edges, Vertices
4. Profile of Reads, Writes, Atomics, and Computation
5. Speedup Using Floats Instead of Doubles
6. View Adaptive Serial vs. Parallel Speedup
1. Serial vs. Parallel
We made two graphs for each mesh: the first is the serial vs. parallel times in ms. The second is the graph is the speedup of the parallel version from the serial version. These speedups don't seem to represent the sort of performance we would expect of a GPU. We decided to further profile the algorithm, but we had some initial thoughts regarding why the performance was sub-optimal:
a) Cost of atomics
b) Irregular memory accesses
c) SIMD divergence
It is also worth mentioning that our serial implementation is blazingly fast (taking less than a second to subdivide big-guy on the 6th iteration).
2. Time Spent in Atomics
We measured the percentage of time spent in atomic operations and the time spent in non-atomic operations. Noticeably, atomics are a huge source of slowdown in the parallel version and seem to consistently account for about half of the time of the total subdivision.
We require atomics to add the values of each face point and edge point to their respective neighboring vertices. However, this is actually several atomic additions because we need to add not only the position, but also the normals and texture coordinates, of each vertex. In addition, we use vectors to store these values, which means we have to perform an atomic add on each element of the vector.
3. Time spent in each phase
We measured the time spent in each of the 3 stages of the algorithm in the parallel version: the faces stage, edges stage, and original vertices stage. The results are fairly close to what we predicted; as the edge stage is more involved, it makes sense that it would take longest.
The seemingly disproportionately short time spent in the vertices stage is, we suspect, related to the fact that the vertices stage requires no atomic operations, and also has higher locality: we only need to access memory from a small section in the vertex out buffer. The values required to be added to each vertex to adjust its position have already been added in the face and edges stage.
5. Floats Instead of Doubles
Our starter code used doubles to perform all the vector math. However, we realized that the use of doubles was quite expensive and unnecessary for our purposes, specifically with the way the GTX-480 implemented atomics. So, we switched to using floats, and got significantly better performance, with a time of under 300 ms for 6 subdivision levels. The speedup over the serial version (which we also modified to incorporate floats) improved as well.
6. View Adaptive Serial vs Parallel Speedup
Finally, our view-adaptive approach achieves even better times, both in the serial version and the parallel. Significantly, the GPU view-adaptive implementation over the serial implementation achieves over 3x speedup over the serial version, and in the 6th iteration achieves times of under 150 ms.
We believe that greater speedup in this case can be attributed to the large amount of extra work needed to keep track of 3-subdivided and 4-subdivided faces and edges, and the calculations required to add face and edge points to the right locations. The CPU version would have to perform many of these operations over large arrays sequentially, but our kernel launches can parallelize these operations as well.
We feel that the choice of a GPU is a solid , but the cost of the atomics definitely limits the potential speedup on a GPU, mainly because atomics are quite expensive on a GPU. It may be interesting to see whether CPU implementations using OpenMP or Pthreads would be comparable with a GPU implementation.
Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces. Anjul Patney, Mohamed S. Ebeida, John D. Owens.
View Frustum Culling, from Lighthouse 3D Tutorials.
Catmull-Clark Subdivision: The Basics. http://www.rorydriscoll.com/2008/08/01/catmull-clark-subdivision-the-basics/. CodeItNow, Rory Driscoll.
Models Resource: https://github.com/skaslev/catmull-clark/tree/master/objs.
List of Work done by Each Student
Arjun: Implemented initial Catmull-Clark algorithm, incorporated basic serial version of algorithm described in paper, developed view-adaptive algorithm code, prepared power-point presentation
Sally: Found initial algorithm paper, documents on view-frustum/back-face culling, implemented CUDA version, profiled performance and obtained speedup measurements, prepared writeup