While working on parallax mapping, somebody told me about a cool presentation: Sparse virtual textures. The idea is quite simple: reimplement pagination in your shaders, allowing you to have infinite textures while keeping the GPU memory usage constant.

Goal was set: add SVT support to my renderer!

Step 1 - Hand-made pagination

Pagination overview

To understand how SVT works, it is useful to understand what pagination is.

On most computers, data is stored in the RAM. RAM is a linear buffer, and its first byte is at the address 0, and the last at address N.

For some practical reasons, using the real address is not very convenient. Thus some clever folks invented the segmentation, which then evolved into pagination.

The idea is simple: use a virtual address, that is translated by the CPU into the real RAM address (physical). The whole mechanism is well explained by Intel1

This translation possible thanks to pagetables.

Translating every addresses into a new independant one is costly and not needed. That’s why they divided the whole space in pages. A page is a set of N contiguous bytes. For example, on x86, we often talk about 4kB pages.

What the CPU translate are page addresses. Each block is translated as a contiguous unit. The internal offset remains the same. This means for N bytes, we only have to store N/page_size translations.

pagination recap

Here on the left you have the virtual memory, divided in 4 blocks (pages). Each block is linearly mapped to an entry in the pagetable.

The mapping can be understood as follows:

  • Take your memory address.
    • adress = 9416
  • Split it into a page-aligned value and the rest.
    • 9416 => 8192 + 1224.
    • aligned_adress = 8192
    • rest = 1224
  • Take the aligned value, and divide it by the page size.
    • 8192 / 4096 = 2
    • index = 2
  • This result is the index in the pagetable.
  • Read the pagetable entry at this index, this is your new aligned address:
    • pagetable[2] = 20480
  • Add the rest back to this address:
    • physical_address = 20480 + 1224
  • You have your physical address.

Adding the page concept to the shader

To implement this technique, I’ll need to:

  • find which pages to load
  • load them in the “main memory”
  • add this pagetable/translation technique.

This could be done using compute shaders and linear buffers, but why not use textures directly? This way I can just add a special rendering pass to compute visibility, and modify my pre-existing forward rendering pass to support pagetables.

First step is to build the pagetable lookup system. This is done in GLSL:

  • take the UV coordinates
  • split them into page-aligned address, and the rest
  • compute page index in both X and Y dimensions
  • lookup a texture at the computed index (our pagetable)
  • add to the value the rest
uv coordinates
Showing UV coordinates
page aligned UVs
Showing page-aligned UV coordinates

Computing visibility

The other advantage of pagination is the ability to load/unload parts of the memory at runtime. Instead of loading the whole file, the kernel only loads the required bits (pages), and only fetch new pages when required.

This is done using a pagefault:

  • User tries to access a non-yet loaded address.
  • CPU faults, and send a signal to the kernel (page fault).
  • The kernel determines if this access is allowed, and loads the page.
  • Once loaded, the kernel can resume the user program.

This mechanism requires hardware support: the CPU knows what a pagetable is, and has this interruption system. In GLSL/OpenGL, we don’t have such thing. So what do we do when interrupts don’t exits? We poll!

For us, this means running an initial rendering pass, but instead of rendering the final output with lights and materials, we output the page addresses. (Similar to the illustration image seen above).

This is done by binding a special framebuffer, and doing render-to-texture. Once the pass completed, the output texture can be read, and we can discover which pages are visible.

For this render pass, all materials are replaced with a simple shader:

#version 420 core

/* material definition */
uniform float textureid;
/* Size of a page in pixels. */
uniform float page_size;
/* Size of the pagetable, in pixels (aka how many entries do we have). */
uniform float pagetable_size;
/* Size in pixels of the final texture to load. */
uniform float texture_size;
/* Aspect ratio difference between this pass, and the final pass. */
uniform float svt_to_final_ratio_w; // svt_size / final_size
uniform float svt_to_final_ratio_h; // svt_size / final_size

in vertex_data {
    vec2 uv;
} fs_in;

out vec4 result;

/* Determines which mipmap level the texture should be visible at.
 * uv: uv coordinates to query.
 * texture_size: size in pixels of the texture to display.
 */
float mipmap_level(vec2 uv, float texture_size)
{
    vec2 dx = dFdx(uv * texture_size) * svt_to_final_ratio_w;
    vec2 dy = dFdy(uv * texture_size) * svt_to_final_ratio_h;

    float d = max(dot(dx, dx), dot(dy, dy));
    return 0.5f * log2(d);
}

void main()
{
    /* how many mipmap level we have for the page-table */
    float max_miplevel = log2(texture_size / page_size);

    /* what mipmap level do we need */
    float mip = floor(mipmap_level(fs_in.uv, texture_size));

    /* clamp on the max we can store using the page-table */
    mip = clamp(mip, 0.f, max_miplevel);

    vec2 requested_pixel = floor(fs_in.uv * texture_size) / exp2(mip);
    vec2 requested_page = floor(requested_pixel / page_size);

    /* Move values back into a range supported by our framebuffer. */
    result.rg = requested_page / 255.f;
    result.b = mip / 255.f;

    /* I use the alpha channel to mark "dirty" pixels.
     * On the CPU side, I first check the alpha value for > 0.5,
     * and if yes, consider this a valid page request.
     * I could also use it to store a "material" ID and support
     * multi-material single-pass SVT. */
    result.a = 1.f;
}

Once the page request list retrieved, I can load the textures in the “main memory”.

The main memory is a simple 2D texture, and page allocation is for now simple: first page requested gets the first slot, and so on until memory is full.

main memory texture
“Main memory” texture

Once the page allocated, I need to update the corresponding pagetable entry to point to the correct physical address. This is done by updating the correct pixel in the pagetable:

  • R & G channels store the physical address.
  • B is unused.
  • A marks the entry as valid (loaded) or not.
pagetable
Pagetable texture

Rendering pass

The final pass is quite similar to a classic pass, except instead of binding one texture for diffuse, I bind 2 textures: the pagetable, and the memory.

  • bind the 3D model
  • bind the GLSL program
  • bind the pagetable and main-memory textures.

At this stage, I can display a texture too big to fit in RAM & VRAM.

Step 2: MipMapping

If you look at the previous video, you’ll notice two issues:

  • Red lines showing up near the screen edges.
  • Page load increase when zooming out.

First issue is because texture loading doesn’t block the current pass. This means I might request a page, and not have it ready by the time the final pass is ran. I could render it as black, but wanted to make it visible.

The second issue is because I have a 1:1 mapping between the virtual page size and the texture page size. Zooming out to show the entire plane would require loading the entire texture. Texture which doesn’t fit in my RAM.

The solution to both these issues are mipmaps.

  • A page at mipmap level 0 covers page_size pixels.
  • A page at mipmap level 1 covers page_size * 2 pixels
  • A page at mipmap level N covers the whole texture.

Now, I can load the mipmap level N by default, and if the requested page is not available, I just go up in the mip levels until I find a valid page.

Adding mipmaps also allow me to implement a better memory eviction mechanism:
I can now replace 4 pages with one page a level above.
So if I’m low on memory, I can just downgrade some areas, and save 75% of my memory.

Finally, MipMapping reduces the bandwidth requirements: if the object is far, why load the texture in high resolution? A low-resolution page is enough:

  • less disk load.
  • less memory usage.
  • less latency (since there is less pages to load).
physicaladdresses with MipMapping
Showing physical addresses with MipMapping

Step 3: Complex materials

The initial rendered had PBR materials. Such material had not only an albedo map, but also normal and roughness+metallic maps. To add new textures, several options:

  • New memory textures, new pagetable texture, new pass.
  • simple
  • requires an additional pass. This is not OK.

  • Same memory texture, same pagetable texture.
  • Each page contains in fact the N textures sequentially. So when one page is loaded, N textures are queried and loaded.
  • Easy to implement, but I have to load N textures.

  • Same memory texture, multiple pagetable textures.
  • pagetables are small, 16x16 or 32x32. Overhead is not huge.
  • I can unload some channels for distant objects (normal maps by ex).
  • Drawback is I have now N*2 texture sampling in the shader: one for each texture and its associated pagetable.

Because I like the flexibility of this last option, I chose to implement it. In the final version, each object has 4 textures:

  • memory (1 mip level)
  • albedo pagetable (N mip levels)
  • roughness/metallic pagetable (N mip levels)
  • normal pagetable (N mip levels)

In the following demo, page loading is done in the main thread, but limited to 1 page per frame, making the loading process very visible.

  • Bottom-left graph shows the main memory.
  • Other graphs show the pagetables and their corresponding mip-levels.

Page request : subsampling, random and frame budget.

For each frame, I need to do this initial pass to check texture visibility. Reading this framebuffer on the CPU between each frame is quite slow, and for a 4K output, this is prohibitively expensive.

The good news is: I don’t need a 4K framebuffer in that case! Pages are covering N pixels, so we can just reduce the framebuffer size and hope our pages will still be requested!

The demo above is using a 32x32 framebuffer. Which is very small. If done naïvely, this wouldn’t work: some pages would be caught between 2 rendered pixels, and never loaded.

missing pages
8x8 framebuffer, no jitter.
 

A way to solve that is add some jitter to the initial pass. The page request viewpoint is not exactly the camera’s position, but the camera’s position + some random noise.

This way, we can increase coverage without increasing the framebuffer size.

missing pages
8x8 framebuffer, jitter.
  1. See Intel Architectures Developer’s Manual: Vol. 3A, Chapter 3 


See comments

       I never experimented with machine learning or denoising. I guess having obscure matrices combined together to produce some result scared me a bit.. Surprising for someone who loves computer graphics… 🙃
After failing an interview for an ML-related position (surprising?) I thought enough is enough, time to play catch-up!

For this project, I started with the basics: Andrew NG ML course. After a couple days — and obviously becoming the greatest ML expert in the world — I decided to tackle the easiest problem ever: image denoising!

The goal

Denoising is a complex field, and some very bright people are making a career out of it. Not my goal!

Here I’ll try to explore some classic denoising techniques, implement them, and once used to some of the problems, build a custom model to improve the result.

The input:

challenge image

I believe this should be a good candidate:

  • has a flat shape to check edge preservation.
  • has some “noise” to keep (foliage).
  • has some small structured details (steel beams).
  • has smooth gradients (sky).

Step 1 - sanity check

pixel line

From Wikipedia:

noise is a general term for unwanted […] modifications that a signal may suffer

The graph above represents a line of pixels being part of a smooth shade. In red are 2 bad pixels. They are bad because they interrupt the smoothness of our graph, and thus are perceived as noise.

How can we remove some outliers in that case? Averaging! Each pixel value is averaged in regard to its neighbors. In this case, this would help reduce perceptible noise.

  foreach x, y in image
    neighbors = extract_window_around(image, x, y, window_size=10)
    res = average(neighbors)
    image.set(x, y, res)

smooth, before & after

But in real life, that’s terrible..

real, before & after

The reason for this poor performance is we don’t discriminate valid details from noise. We loose our edges, and all details are lost.

Step 3 - Better average - YUV vs RGB

The previous image was generated by averaging RGB values using a 10-pixels sliding window. Because it was averaging RGB values, it mixed colors. As result, edges were blurred in a very perceptible way, leading to an unpleasant result.

YUV is another color representation, splitting the channels not as red, green, and blue, but color, and luminosity. Colors are represented using polar coordinates, and luminosity is a single linear value.

If we look at the sky, the noise doesn’t seem to alter the color a lot, only the brightness of the blue. So averaging using the same window, but only on the luminance component should give better results:

yuv, smooth yuv, real

Step 4 - selective average

Using YUV vs RGB helped: the sky looks fine, and the green edges look sharper. Sadly, the rest of the image looks bad. The reason is that I still use the same window size for the sky and the tower.

I can improve that solution using a new input: an edge intensity map. Using the well known Sobel operator I can generate the list of areas to avoid.

  edge_map = sobel(image)
  foreach x, y in image
    window_size = lerp(10, 1, edge_map.at(x, y))
    neighbors = extract_window_around(image, x, y, window_size)
    res = average(neighbors)
    image.set(x, y, res)

edge, real

  • ✅ The square edges are preserved.
  • ✅ The sky blur is gone
  • ✅ The Eiffel Tower’s edges seem preserved.
  • ❌ Artifacts visible in the sky (top-right)
  • ❌ The foliage texture is lost.
  • ❌ The metallic structure lost precision.
  • ❌ The grass mowing pattern is completely lost.

Step 5 - ML-based noise detection

In the previous step, I tried to discriminate areas to blur and keep as-is. The issue is my discrimination criteria: edges. I was focusing on keeping edges, but lost good noise like the foliage.

So now I wonder, can I split good noise from bad noise using a classification model?

  foreach x, y in image
    window = extract_window_around(image, x, y, window_size)
    bad_noise_probability = run_model(window)
    blur_window_size = lerp(1, 10, bad_noise_probability)
    res = average_pixels(image, x, y, blur_window_size)
    image.set(x, y, res)

For this model, I tried to go with a naïve approach:

  • select a set of clean images
  • generate their noisy counterpart in an image editor
  • split these images in 16x16 pixel chunks.

model training set extraction

Those would represent my training & test set (6000 items and 600 items). The goal is now from a 16 pixel window, determine if the pixel belongs to noise, or belongs to some details.

Then, I would iterate over my pixels, extract the 16x16 window around, run the model on it, and use this probability to select my blur window. My guess is that we should now be able to differentiate foliage from sky noise.

Here is the model output: in red the parts to clean, in black the parts to keep.

model output

And here is the output:

final result

  • ✅ Edges are preserved.
  • ✅ Steel structure is clear in the middle.
  • ✅ Left foliage looks textured.
  • ❌ Right foliage shadows are still noisy.
  • ❌ Some areas of the steel structure are blurred.
  • ❌ Sky has artifacts.

The model training set is composed of only ~6000 chunks extracted from 4 images (2 good, 2 noisy). Training the same model on a better dataset might be a first solution to improve the noise classification.

This result seems better than the bilateral filtering, so I guess that’s enough for a first step into the ML world. I will stop there for now, and move on to the next project!


See comments

Some friends were registered to this CTF, and since I had some days off, I decided to work a bit on one RE exercise.

The binary is called BadVM:

[nathan@Jyn badvm]$ ./badvm-original
### BadVM 0.1 ###

Veuillez entrer le mot de passe:
toto
Ca mouline ...
Plus qu'un instant ... On avait la réponse depuis le début en faite :>
Perdu ...

It is a stripped, ELF 64 PIE binary. Time to start Binary Ninja. This binary has no anti-debug, nor packing techniques. Just some calls to sleep. Once these calls NOPed, we can start reversing the VM.

The VM is initialized in the function I called load_vm (0xde6). Then, the function at 0xd5f is called, let’s call it vm_trampoline.

This function will choose the next instruction to execute. Load it’s address in rax and call it. vm_trampoline is called at the end of each instruction. Thus, each instruction is a new entry in the backtrace.

This means, when returning from the first call to vm_trampoline, we can read the result and return it. This takes us back to load_vm, and result is checked.

In case of an invalid character in the password, we have an early-exit. Input is checked linearly, no hash or anything, Thus instruction counting works well.

Since I was on holidays, I decided to experiment a bit with lldb, and write a instrument this VM using its API.

Reversing the VM

This VM uses 0x300 bytes long buffer to run. Some points of interest:

  • 0x4: register A (rip)
  • 0x5: register B
  • 0xFF: register C (result)
  • 0x2fc: register D
  • 0x2fe: register E (instruction mask?)

  • 0x32: password buffer (30 bytes)
  • 0x2b: data buffer (xor data, 30 bytes)
  • 0x200: data start (binary’s .data is copied in this area)

Instruction are encoded as follows:

opcode

To select the instruction, the VM contains a jump-table.

jump-table

Here one of the instructions (a ~GOTO):

instruction

Final note: each instruction/function has the following prototype:

prototype

Instrumenting using LLDB

This VM does not check its own code, thus we can freely use software breakpoints. The code is not rewritten, thus offsets are kept. This allow us to simply use LLDB’s python API to instrument and analyse the VM behavior.

First step, create an lldb instance:

def init():
    dbg = lldb.SBDebugger.Create()
    dbg.SetAsync(True)
    console = dbg.GetCommandInterpreter()

    error = lldb.SBError()
    target = dbg.CreateTarget('./badvm', None, None, True, error)
    # check error

    info = lldb.SBLaunchInfo(None)
    process = target.Launch(info, error)
    print("[LLDB] process launched")

Now, we can register out breakpoints. Since vm_trampoline is called before each instruction, we only need this one:

    target.BreakpointCreateByAddress(p_offset + VM_LOAD_BRKP_OFFSET)

Now, we can run. To interact with the binary, we can use LLDB’s events. Registering a listener, we can be notified each time the process stops, or when a breakpoint is hit.

listener = dbg.GetListener()
event = lldb.SBEvent()

if not listener.WaitForEvent(1, event):
    continue

if event.GetType() != EVENT_STATE_CHANGED:
    # handle_event(process, program_offset, vm_memory, event)
    continue

regs = get_gprs(get_frame(process))
if regs['rip'] - program_offset != address:
    print("break location: 0x{:x} (0x{:x})".format(
          regs['rip'] - program_offset, regs['rip']))

To read memory, or registers, we can simply do it like that

process.ReadUnsignedFromMemory(vm_memory + 0, 1, err),

process.selected_thread.frame[frame_number].registers
# registers[0] contains general purpose registers

Now we can implement a pretty-printer to have “readable” instructions. Once everything together, we can dump the execution trace:

mov [0x00], 0xff
mov [0x01], 0x01
mov tmp, [0x00]  	# tmp=0xff
mov [tmp], [0x01]	# src=0x1
mov [0x00], 0x0b
mov [0x01], 0x1d
mov tmp, [0x00]  	# tmp=0xb
mov [tmp], [0x01]	# src=0x1d
mov [0x01], 0x0b
mov tmp, [0x01]  	# tmp=0xb
mov [0x00], [tmp]	# [tmp]=0x1d
mov r5, [0x00]
sub r5, [0x0a]   	# 0x1d - 0x0 = 0x1d
if r5 == 0:
    mov rip, 0x2d
mov [0x01], 0x0a
[...]

Now, we can reverse the program running in the VM:

def validate(password, xor_data):
    if len(password) != len(xor_data):
        return -1

    D = 0
    for i in range(len(xor_data)):
        tmp = (D + 0xAC) % 0x2D
        D = tmp
        if xor_data[i] != chr(ord(password[i]) ^ tmp):
            return i

    return len(xor_data)

And we get the flag:

SCE{1_4m_not_4n_is4_d3s1yn3r}

Conclusion

This VM has no anti-debug, packing or anything special. But it was a funny binary to reverse. To instrument the VM, lldb is useful, but using DynamiRIO would be a more elegant method.


See comments

Working on my 3D game engine is the perfect occasion to reimplement classic algorithms. On today’s menu: self-shadowed steep parrallax-mapping First step, get the classic steep parrallax-mapping.

parrallax final result

Here a two good links to implement this algorithm:

Steep parrallax-mapping allows us to get a pretty good result (10 samples):

parrallax closeup 1 parrallax closeup 2

But something is missing. Let’s implement self-shadows.

Self shadows are only computed on directional lights. The algorithm is very simple.

  • convert light direction in tangent space
  • compute steep parrallax-mapping
  • from the resulting coordinate, ray-march towards the light
  • If there is an intersection, reduce exposition

And then, TADAA

Shader code available here

(2 steps are more than enough for this part.)

parrallax final result


See comments

Virglrenderer provides OpenGL acceleration to a guest running on QEMU.

My current GSoC project is to add support for the Vulkan API.

Vulkan is drastically different to OpenGL. Thus, this addition is not straight-forward. My current idea is to add an alternative path for Vulkan. Currently, two different states are kept, one for OpenGL, and one for Vulkan. Commands will either go to the OpenGL or Vulkan front-end.

For now, only compute shaders are supported. The work is divided in two parts: a Vulkan ICD in MESA, and a new front-end for Virgl and vtest.

If you have any feedback, do not hesitate !

This experiment can be tested using this repository. If you have an Intel driver in use, you might be able to use the Dockerfile provided.

Each part is also available independently:


See comments