László Szécsi and Balázs Benedek / <str<strong>on</strong>g>Improvements</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>kd</strong>-<strong>tree</strong>Should n be smaller than n 0 , which is possible as n 0 is anaverage value, we stick to <strong>the</strong> linear estimate:f ´nµ n if n n 0´n n 0 µ 1·log 2 q · n 0 if n n 0(6)3.3. ResultsAlthough Havran pointed out <strong>the</strong> difficulties at c<strong>on</strong>structinga general cost functi<strong>on</strong>, his investigati<strong>on</strong>s have also shownthat for a specific scene, <strong>the</strong> time of traversal as a functi<strong>on</strong>of <strong>the</strong> number of patches usually has little deviance from<strong>the</strong> logarithmic curve. Therefore, it is expected that if <strong>the</strong>best coefficients are found, <strong>the</strong> rendering time for a scenemay be decreased. However, as explained above, <strong>the</strong> linearestimate performs over expectances, so <strong>on</strong>ly little speed-upwill occur.We have summarised run times for Bi-directi<strong>on</strong>al PathTracing 4 image computati<strong>on</strong> for several scenes. Time isgiven in sec<strong>on</strong>ds everywhere.Scene HouseNumber of patches 24737Measured n 0 3.2Linear cost run-time 65.42n 0 q 09 q 095 q 0984 61.51 57.12 57.406 68.16 64.48 65.128 70.08 64.32 64.89For o<strong>the</strong>r scenes with fewer objects we obtained less significantresults. It is also to be remarked that <strong>the</strong> time takenby <strong>the</strong> BPT algorithm may also be influenced by factors like<strong>the</strong> actual paging coherence. Obviously, <strong>the</strong>re is little improvement,and <strong>the</strong> method we used to determine q is stillnot general enough to be used for every scene. Firstly, thisis due to <strong>the</strong> fact that <strong>the</strong> formula for traversal time given inour previous article 3 is <strong>on</strong>ly true for a scene large enough.Sec<strong>on</strong>dly, <strong>the</strong> c<strong>on</strong>cept of <strong>the</strong> cost reducti<strong>on</strong> caused by acut may be better, than <strong>the</strong> linear <strong>on</strong>e, but still not perfect.Havran’s measurements would ra<strong>the</strong>r suggest a logarithmiccurve. Therefore, it would be worth testing a few o<strong>the</strong>r formsof cost functi<strong>on</strong>s. However, <strong>the</strong> use of <strong>the</strong> n 0 value to characterise<strong>the</strong> object distributi<strong>on</strong>, or tessellati<strong>on</strong> of <strong>the</strong> scenewould probably be a valuable c<strong>on</strong>cept.4. Tree representati<strong>on</strong>, memory usage and cachecoherence4.1. Previous workThe single most important objective is to assure fast traversal.First, this means that we have to be able to find <strong>the</strong> childrenof a node quickly, and sec<strong>on</strong>d, we have to retrieve <strong>the</strong>data from memory fast. In order to achieve this, we shouldhave <strong>the</strong> most possible amount of data corresp<strong>on</strong>ding t<strong>on</strong>earby cells in <strong>the</strong> cache. That means we have to use <strong>the</strong>minimum amount of storage space per node, store nearbynodes next to each o<strong>the</strong>r, and still be able to find <strong>the</strong> childrenfor a node fast enough.A powerful soluti<strong>on</strong> to store a complete <strong>tree</strong> is <strong>the</strong> compactrepresentati<strong>on</strong>, where every node is an element of anarray, and <strong>the</strong> start of <strong>the</strong> children-array can be found bymultiplying <strong>the</strong> parent index with <strong>the</strong> number of childrenper node. This representati<strong>on</strong> does not use any pointers, and<strong>the</strong>refore it needs <strong>on</strong>ly <strong>the</strong> minimum amount of memory, butwe need to know <strong>the</strong> number of nodes in advance. Fur<strong>the</strong>rmore,for an unbalanced <strong>tree</strong> we need a compact structure asdeep as <strong>the</strong> deepest branch of <strong>the</strong> <strong>tree</strong>, and a large part of <strong>the</strong>array will be empty, causing a tremendous waste of memory.Fur<strong>the</strong>r <strong>on</strong>, we will call this phenomen<strong>on</strong> fragmentati<strong>on</strong>.Ano<strong>the</strong>r approach is to have separately allocated nodesand use pointers to find children. Two pointers per node maymean multiple memory need, and locality is not automaticallyassured. The optimal soluti<strong>on</strong> has to use some pointersto account for <strong>the</strong> problems with <strong>the</strong> naïve compact representati<strong>on</strong>,while keeping its advantages.Such a mid-way soluti<strong>on</strong> is proposed by Havran 2 . It usessmall compact <strong>tree</strong>s, c<strong>on</strong>nected by pointers. This limits fragmentati<strong>on</strong>due to <strong>the</strong> n<strong>on</strong>-balancedness to <strong>the</strong> sub-<strong>tree</strong>s, andif <strong>the</strong>ir memory representati<strong>on</strong> fits into a cache-line, cachecoherence is utilised well (See Figure 1). However, a largenumber of pointers are still used, and dynamic allocati<strong>on</strong> ofsub-<strong>tree</strong>s is assumed.4.2. <strong>kd</strong>-<strong>tree</strong> in pre-allocated memoryThe c<strong>on</strong>cept of compact sub-<strong>tree</strong>s may be developed fur<strong>the</strong>r.More pointers may be eliminated, if such a sub-<strong>tree</strong> isc<strong>on</strong>sidered a node of ano<strong>the</strong>r compact <strong>tree</strong>, where <strong>the</strong> nodeshave twice as many children as number of nodes <strong>on</strong> <strong>the</strong> lastlevel of <strong>the</strong> sub-<strong>tree</strong>. These super-nodes do not have to bec<strong>on</strong>nected by pointers, as <strong>the</strong>y also can be stored in a compactstructure. This way, pointers are completely eliminated,cache coherence is assured, but <strong>the</strong> fragmentati<strong>on</strong> is just asa costly issue as in case of <strong>the</strong> simple compact <strong>tree</strong>.The problem is that we have to store an unbalanced <strong>tree</strong>.If <strong>the</strong> super-nodes are c<strong>on</strong>nected by pointers, as Havran suggests,<strong>the</strong>n <strong>the</strong> fragmentati<strong>on</strong> will be limited to <strong>the</strong> sub-<strong>tree</strong>s.However, <strong>the</strong> compact soluti<strong>on</strong> can also be improved to handleunbalanced <strong>tree</strong>s. Whenever a branch terminates before<strong>the</strong> depth of <strong>the</strong> pre-allocated memory, <strong>the</strong> would-be childrenof <strong>the</strong> leaves become roots of free-space <strong>tree</strong>s. These"holes" can later be used to c<strong>on</strong>tain <strong>the</strong> parts of <strong>the</strong> <strong>tree</strong> thatwould stretch over <strong>the</strong> pre-allocated space. This means that<strong>on</strong>ly <strong>the</strong> nodes <strong>on</strong> <strong>the</strong> very last level need to have pointers,actually pointing somewhere back into <strong>the</strong> array. This mappingis dem<strong>on</strong>strated in Figure 2. It is of course possible anddesirable to use <strong>the</strong> cache-line sized sub-<strong>tree</strong>s mapping (SeeFigure 1.) to store <strong>the</strong> resulting balanced <strong>tree</strong> for better cacheperformance.
László Szécsi and Balázs Benedek / <str<strong>on</strong>g>Improvements</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>kd</strong>-<strong>tree</strong>89 1011 12 13 1412 34 5 6 71516 1718 19 20 21Figure 1: Mapping of a <strong>tree</strong> into an array using cache-linesizedsub-<strong>tree</strong>s.need to have a good estimate of <strong>the</strong> number of <strong>the</strong> nodesto be able to allocate memory in advance. Fortunately, weknow that a <strong>kd</strong>-<strong>tree</strong> uses 6n splitting planes at most. Thisalso means a maximum of 6n leaves. Adding <strong>the</strong> worst-casenumber of pointers, which is exactly <strong>the</strong> number of nodes <strong>on</strong><strong>the</strong> last level, we c<strong>on</strong>clude that an array with 24n elementssuffices.A node itself has to be as tiny as possible. The above structureassumes, that <strong>the</strong> descripti<strong>on</strong> of a splitting plane for an<strong>on</strong>-leaf node, a pointer to <strong>the</strong> list of objects for leaves, anda pointer to ano<strong>the</strong>r node for a redirect node all fit into anelement of <strong>the</strong> array. As <strong>the</strong> plane is described by a not necessarilyprecise floating-point number, all <strong>the</strong>se <strong>on</strong>ly take upa few bytes. We need <strong>on</strong>e extra bit to distinguish betweenleaves and n<strong>on</strong>-leaf nodes.Figure 2: Mapping of an unbalanced <strong>tree</strong> into an array usingpointers <strong>on</strong> <strong>the</strong> last level. Darkened nodes are leaves.The number of <strong>the</strong>se pointers could fur<strong>the</strong>r be decimatedif we make use of <strong>the</strong> fact that leaves <strong>on</strong> <strong>the</strong> last level d<strong>on</strong>ot have children. That way, a node <strong>on</strong> <strong>the</strong> last level isei<strong>the</strong>r a leaf, or a reference to <strong>the</strong> actual positi<strong>on</strong> of <strong>the</strong>child node. This representati<strong>on</strong> allows <strong>the</strong> mapping of a n<strong>on</strong>balanced<strong>tree</strong> and with all <strong>the</strong> needed pointers into a single,pre-allocable array. This structure, also compatible with <strong>the</strong>cache-line mapping, is depicted in Figure 3. Naturally, weFigure 3: <strong>kd</strong>-<strong>tree</strong> using <strong>the</strong> minimal number of pointers.Dark nodes are leaves, hatched nodes are unused.4.3. Estimati<strong>on</strong> of <strong>the</strong> number of necessary pointersThe above figure for <strong>the</strong> memory need can fur<strong>the</strong>r be decreased,if we do not account for <strong>the</strong> worst case, and usesomewhat less memory. Would <strong>the</strong> <strong>tree</strong> exceed its predefinednode count, we will have to terminate <strong>the</strong> build. Ascompromising <strong>the</strong> <strong>tree</strong> c<strong>on</strong>structi<strong>on</strong> algorithm will prove tobe very costly during traversal, a secure size should be chosen,with practically zero chance of overflow. However, if wemake use of <strong>the</strong> fact that no pointers are needed for leaves,a lower figure for <strong>the</strong> number of pointers may be found. Obviously,if <strong>the</strong> lowest level of <strong>the</strong> <strong>tree</strong> would <strong>on</strong>ly c<strong>on</strong>tainpointers, as <strong>the</strong> previously given upper bound suggests, everysingle node above would be referenced. This is impossiblebecause of two reas<strong>on</strong>s: <strong>the</strong> nodes that bel<strong>on</strong>g to <strong>the</strong><strong>tree</strong> originating from <strong>the</strong> root are not referenced, and, moresignificantly, leaves are never referenced.To derive an exact number let us introduce <strong>the</strong> followingnomenclature. Let X be <strong>the</strong> number of pointers <strong>on</strong> <strong>the</strong> lastlevel, and N <strong>the</strong> number of pre-allocated nodes. If <strong>the</strong> arrayis not full, <strong>the</strong> number of pointers is irrelevant. We are <strong>on</strong>lyinterested in <strong>the</strong> case, when every node is used as a cut, a leafor a pointer. Therefore, <strong>the</strong> number of leaves L, <strong>the</strong> numberof cuts C, and <strong>the</strong> number of pointers X add up to <strong>the</strong> size of<strong>the</strong> array.L ·C · X N (7)The number of leaves and cuts are equal.2C · X N (8)Pointers may <strong>on</strong>ly reference cut nodes, and no node can bereferenced more than <strong>on</strong>ce:C X (9)Substituting this back, we get:3X N (10)Therefore, it is not half of <strong>the</strong> nodes needed for pointers in<strong>the</strong> worst case, <strong>on</strong>ly <strong>on</strong>e third. This way, <strong>the</strong> upper bound for