* externals: Add oaksim submodule
Used for emitting ARM64 assembly
* common: Implement aarch64 ABI
Utilize oaknut to implement a stack frame.
* tests: Allow shader-jit tests for x64 and a64
Run the shader-jit tests for both x86_64 and arm64 targets
* video_core: Initialize arm64 shader-jit backend
Passes all current unit tests!
* shader_jit_a64: protect/unprotect memory when jit-ing
Required on MacOS. Memory needs to be fully unprotected and then
re-protected when writing or there will be memory access errors on
MacOS.
* shader_jit_a64: Fix ARM64-Imm overflow
These conditionals were throwing exceptions since the immediate values
were overflowing the available space in the `EOR` instructions. Instead
they are generated from `MOV` and then `EOR`-ed after.
* shader_jit_a64: Fix Geometry shader conditional
* shader_jit_a64: Replace `ADRL` with `MOVP2R`
Fixes some immediate-generation exceptions.
* common/aarch64: Fix CallFarFunction
* shader_jit_a64: Optimize `SantitizedMul`
Co-authored-by: merryhime <merryhime@users.noreply.github.com>
* shader_jit_a64: Fix address register offset behavior
Based on https://github.com/citra-emu/citra/pull/6942
Passes unit tests.
* shader_jit_a64: Fix `RET` address offset
A64 stack is 16-byte aligned rather than 8. So a direct port of the x64
code won't work. Fixes weird branches into invalid memory for any
shaders with subroutines.
* shader_jit_a64: Increase max program size
Tuned for A64 program size.
* shader_jit_a64: Use `UBFX` for extracting loop-state
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit_a64: Optimize `SUB+CMP` to `SUBS`
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit_a64: Optimize `CMP+B` to `CBNZ`
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit_a64: Use `FMOV` for `ONE` vector
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit_a64: Remove x86-specific documentation
* shader_jit_a64: Use `UBFX` to extract exponent
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit_a64: Remove redundant MIN/MAX `SRC2`-NaN check
Special handling only needs to check SRC1 for NaN, not SRC2.
It would work as follows in the four possible cases:
No NaN: No special handling needed.
Only SRC1 is NaN: The special handling is triggered because SRC1 is NaN, and SRC2 is picked.
Only SRC2 is NaN: FMAX automatically picks SRC2 because it always picks the NaN if there is one.
Both SRC1 and SRC2 are NaN: The special handling is triggered because SRC1 is NaN, and SRC2 is picked.
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit/tests:: Add catch-stringifier for vec2f/vec3f
* shader_jit/tests: Add Dest Mask unit test
* shader_jit_a64: Fix Dest-Mask `BSL` operand order
Passes the dest-mask unit tests now.
* shader_jit_a64: Use `MOVI` for DestEnable mask
Accelerate certain cases of masking with MOVI as well
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* shader_jit/tests: Add source-swizzle unit test
This is not expansive. Generating all `4^4` cases seems to make Catch2
crash. So I've added some component-masking(non-reordering) tests based
on the Dest-Mask unit-test and some additional ones to test
broadcasts/splats and component re-ordering.
* shader_jit_a64: Fix swizzle index generation
This was still generating `SHUFPS` indices and not the ones that we wanted for the `TBL` instruction. Passes all unit tests now.
* shader_jit/tests: Add `ShaderSetup` constructor to `ShaderTest`
Rather than using the direct output of `CompileShaderSetup` allow a
`ShaderSetup` object to be passed in directly. This enabled the ability
emit assembly that is not directly supported by nihstro.
* shader_jit/tests: Add `CALL` unit-test
Tests nested `CALL` instructions to eventually reach an `EX2`
instruction.
EX2 is picked in particular since it is implemented as an even deeper
dispatch and ensures subroutines are properly implemented between `CALL`
instructions and implementation-calls.
* shader_jit_a64: Fix nested `BL` subroutines
`lr` was getting writen over by nested calls to `BL`, causing undefined
behavior with mixtures of `CALL`, `EX2`, and `LG2` instructions.
Each usage of `BL` is now protected with a stach push/pop to preserve
and restore teh `lr` register to allow nested subroutines to work
properly.
* shader_jit/tests: Allocate generated tests on heap
Each of these generated shader-test objects were causing the stack to
overflow. Allocate each of the generated tests on the heap and use
unique_ptr so they only exist within the life-time of the `REQUIRE`
statement.
* shader_jit_a64: Preserve `lr` register from external function calls
`EMIT` makes an external function call, and should be preserving `lr`
* shader_jit/tests: Add `MAD` unit-test
The Inline Asm version requires an upstream fix:
https://github.com/neobrain/nihstro/issues/68
Instead, the program code is manually configured and added.
* shader_jit/tests: Fix uninitialized instructions
These `union`-type instruction-types were uninitialized, causing tests
to indeterminantly fail at times.
* shader_jit_a64: Remove unneeded `MOV`
Residue from the direct-port of x64 code.
* shader_jit_a64: Use `std::array` for `instr_table`
Add some type-safety and const-correctness around this type as well.
* shader_jit_a64: Avoid c-style offset casting
Add some more const-correctness to this function as well.
* video_core: Add arch preprocessor comments
* common/aarch64: Use X16 as the veneer register
https://developer.arm.com/documentation/102374/0101/Procedure-Call-Standard
* shader_jit/tests: Add uniform reading unit-test
Particularly to ensure that addresses are being properly truncated
* common/aarch64: Use `X0` as `ABI_RETURN`
`X8` is used as the indirect return result value in the case that the
result is bigger than 128-bits. Principally `X0` is the general-case
return register though.
* common/aarch64: Add veneer register note
`LR` is generally overwritten by `BLR` anyways, and would also be a safe
veneer to utilize for far-calls.
* shader_jit_a64: Remove unneeded scratch register from `SanitizedMul`
* shader_jit_a64: Fix CALLU condition
Should be `EQ` not `NE`. Fixes the regression on Kid Icarus.
No known regressions anymore!
---------
Co-authored-by: merryhime <merryhime@users.noreply.github.com>
Co-authored-by: JosJuice <JosJuice@users.noreply.github.com>
* video_core: Abstract shader generators.
* shader: Extract common generator structures and move generators to specific namespaces.
* shader: Minor fixes and clean-up.
* code: Prepare frontend for vulkan support
* citra_qt: Add vulkan options to the GUI
* vk_instance: Collect tooling info
* renderer_vulkan: Add vulkan backend
* qt: Fix fullscreen and resize issues on macOS. (#47)
* qt: Fix bugged macOS full screen transition.
* renderer/vulkan: Fix swapchain recreation destroying in-use semaphore.
* renderer/vulkan: Make gl_Position invariant. (#48)
This fixes an issue with black artifacts in Pokemon games on Apple GPUs.
If the vertex calculations differ slightly between render passes, it can
cause parts of model faces to fail depth test.
* vk_renderpass_cache: Bump pixel format count
* android: Custom driver code
* vk_instance: Set moltenvk configuration
* rasterizer_cache: Proper surface unregister
* citra_qt: Fix invalid characters
* vk_rasterizer: Correct special unbind
* android: Allow async presentation toggle
* vk_graphics_pipeline: Fix async shader compilation
* We were actually waiting for the pipelines regardless of the setting, oops
* vk_rasterizer: More robust attribute loading
* android: Move PollEvents to OpenGL window
* Vulkan does not need this and it causes problems
* vk_instance: Enable robust buffer access
* Improves stability on mali devices
* vk_renderpass_cache: Bring back renderpass flushing
* externals: Update vulkan-headers
* gl_rasterizer: Separable shaders for everyone
* vk_blit_helper: Corect depth to color convertion
* renderer_vulkan: Implement reinterpretation with copy
* Allows reinterpreteration with simply copy on AMD
* vk_graphics_pipeline: Only fast compile if no shaders are pending
* With this shaders weren't being compiled in parallel
* vk_swapchain: Ensure vsync doesn't lock framerate
* vk_present_window: Match guest swapchain size to vulkan image count
* Less latency and fixes crashes that were caused by images being deleted before free
* vk_instance: Blacklist VK_EXT_pipeline_creation_cache_control with nvidia gpus
* Resolves crashes when async shader compilation is enabled
* vk_rasterizer: Bump async threshold to 6
* Many games have fullscreen quads with 6 vertices. Fixes pokemon textures missing with async shaders
* android: More robust surface recreation
* renderer_vulkan: Fix dynamic state being lost
* vk_pipeline_cache: Skip cache save when no pipeline cache exists
* This is the cache when loading a save state
* sdl: Fix surface initialization on macOS. (#49)
* sdl: Fix surface initialization on macOS.
* sdl: Fix render window events not being handled under Vulkan.
* renderer/vulkan: Fix binding/unbinding of shadow rendering buffer.
* vk_stream_buffer: Respect non coherent access alignment
* Required by nvidia GPUs on MacOS
* renderer/vulkan: Support VK_EXT_fragment_shader_interlock for shadow rendering. (#51)
* renderer_vulkan: Port some recent shader fixes
* vk_pipeline_cache: Improve shadow detection
* vk_swapchain: Add missing check
* renderer_vulkan: Fix hybrid screen
* Revert "gl_rasterizer: Separable shaders for everyone"
Causes crashes on mali GPUs, will need separate PR
This reverts commit d22d556d30ff641b62dfece85738c96b7fbf7061.
* renderer_vulkan: Fix flipped screenshot
---------
Co-authored-by: Steveice10 <1269164+Steveice10@users.noreply.github.com>
* rasterizer_cache: Sentence surfaces
* gl_texture_runtime: Remove runtime side allocation cache
* rasterizer_cache: Adjust surface scale during reinterpreration
* Fixes pixelated outlines. Also allows to remove the d24s8 specific hack and is more generic in general
* rasterizer_cache: Remove Expand flag
* Begone!
* rasterizer_cache: Cache framebuffers with surface id
* rasterizer_cache: Sentence texture cubes
* renderer_opengl: Move texture mailbox to separate file
* Makes renderer_opengl cleaner overall and allows to report removal threshold from runtime instead of hardcoding. Vulkan requires this
* rasterizer_cache: Dont flush cache on layout change
* rasterizer_cache: Overhaul framebuffer management
* video_core: Remove duplicate
* rasterizer_cache: Sentence custom surfaces
* Vulkan cannot destroy images immediately so this ensures we use our garbage collector for that purpose
* common: Move dynamic library to common
* This is so that video_core can use it
* logging: Add vulkan log target
* common: Allow defered library loading
* Also add some comments to the functions
* renderer_vulkan: Add vulkan initialization code
* renderer_vulkan: Address feedback
* rasterizer_cache: Switch to template
* Eliminates all opengl references in the rasterizer cache headers
thus completing the backend abstraction
* rasterizer_cache: Switch to page table
* Surface storage isn't particularly interval sensitive so we can use a page table to make it faster
* rasterizer_cache: Move sampler management out of rasterizer cache
* rasterizer_cache: Remove shared_ptr usage
* Switches to yuzu's slot vector for improved memory locality.
* rasterizer_cache: Rework reinterpretation lookup
* citra_qt: Per game texture filter
* rasterizer_cache: Log additional settings
* gl_texture_runtime: Resolve shadow map comment
* rasterizer_cache: Don't use float for viewport
* gl_texture_runtime: Fix custom allocation recycling
* rasterizer_cache: Minor cleanups
* Cleanup texture cubes when all the faces have been unregistered from the cache
* custom_tex_manager: Allow multiple hash mappings per texture
* code: Move slot vector to common
* rasterizer_cache: Prevent texture cube crashes
* rasterizer_cache: Improve mipmap validation
* CanSubRect now works properly when validating multi-level surfaces, for example Dark Moon validates a 4 level surface from a 3 level one and it works
* gl_blit_handler: Unbind sampler on reinterpretation
* common: Add thread pool from yuzu
* Is really useful for asynchronous operations like shader compilation and custom textures, will be used in following PRs
* core: Improve ImageInterface
* Provide a default implementation so frontends don't have to duplicate code registering the lodepng version
* Add a dds version too which we will use in the next commit
* rasterizer_cache: Rewrite custom textures
* There's just too much to talk about here, look at the PR description for more details
* rasterizer_cache: Implement basic pack configuration file
* custom_tex_manager: Flip dumped textures
* custom_tex_manager: Optimize custom texture hashing
* If no convertions are needed then we can hash the decoded data directly removing the needed for duplicate decode
* custom_tex_manager: Implement asynchronous texture loading
* The file loading and decoding is offloaded into worker threads, while the upload itself still occurs in the main thread to avoid having to manage shared contexts
* Address review comments
* custom_tex_manager: Introduce custom material support
* video_core: Move custom textures to separate directory
* Also split the files to make the code cleaner
* gl_texture_runtime: Generate mipmaps for material
* custom_tex_manager: Prevent memory overflow when preloading
* externals: Add dds-ktx as submodule
* string_util: Return vector from SplitString
* No code benefits from passing it as an argument
* custom_textures: Use json config file
* gl_rasterizer: Only bind material for unit 0
* Address review comments
* rasterizer_cache: Remove custom texture code
* It's a hacky buggy mess, will be reimplemented later when the cache is in a better state
* rasterizer_cache: Refactor surface upload/download
* Switch to the texture_codec header which was written as part of the vulkan backend by steveice and me
* Move most of the upload logic to the rasterizer cache and out of the surface object
* Scaled uploads/downloads have been disabled for now since they require more runtime infrastructure
* rasterizer_cache: Refactor runtime interface
* Remove aspect enum which is the same as SurfaceType
* Replace Subresource with specific structures for each operation (blit/copy/clear). This mimics moderns APIs vulkan much better
* Pass the surface to the runtime instead of the texture
* Implement CopyTextures with glCopyImageSubData which is available on 4.3 and gles.
This function also has an overload for cubes which will be removed later.
* rasterizer_cache: Move texture allocation to the runtime
* renderer_opengl: Remove TextureDownloaderES
* It's overly compilcated and unused at the moment. Will be replaced with a simple compute shader in a later commit
* rasterizer_cache: Split CachedSurface
* This commit splits CachedSurface into two classes, SurfaceBase which contains the backend agnostic functions and Surface which is the opengl specific part
* For now the cache uses the opengl surface directly and there are a few ugly casts with watchers, those will be taken care of when the template convertion and watcher removal are added respectively
* rasterizer_cache: Move reinterpreters to the runtime
* rasterizer_cache: Move some pixel format function to the cpp file
* rasterizer_cache: Common texture acceleration functions
* They don't contain any backend specific code so they shouldn't be duplicated
* rasterizer_cache: Remove BlitSurfaces
* It's better to prefer copy/blit in the caller anyway
* rasterizer_cache: Only allocate needed levels
* rasterizer_cache: Move texture runtime out of common dir
* Also shorten the util header filename
* surface_params: Cleanup code
* Add more comments, organize it a bit etc
* rasterizer_cache: Move texture filtering to the runtime
* rasterizer_cache: Move to VideoCore
* renderer_opengl: Reimplement scaled uploads/downloads
* Instead of looking up for temporary textures, each allocation now contains both a scaled and unscaled handle
This allows the scale operations to be done inside the surface object itself and improves performance in general
* In particular the scaled download code has been expanded to use ARB_get_texture_sub_image when possible
which is faster and more convenient than glReadPixels. The latter is still relevant for OpenGLES though.
* Finally allocations are now given a handy debug name that can be viewed from renderdoc.
* rasterizer_cache: Remove global state
* gl_rasterizer: Abstract common draw operations to Framebuffer
* This also allows to cache framebuffer objects instead of always swapping the textures, something that particularly benefits mali gpus
* rasterizer_cache: Implement multi-level surfaces
* With this commit the cache can now directly upload and use mipmaps
without needing to sync them with watchers. By using native mimaps
directly this also adds support for mipmap for cube
* Texture cubes have also been updated to drop the watcher requirement
* host_shaders: Add CMake integration for string shaders
* Improves build time shader generation making it much less prone to errors.
Also moves the presentation shaders here to avoid embedding them to the cpp file.
* Texture filter shaders now make explicit use of uniform bindings for better vulkan compatibility
* renderer_opengl: Emulate lod bias in the shader
* This way opengles can emulate it correctly
* gl_rasterizer: Respect GL_MAX_TEXTURE_BUFFER_SIZE
* Older Bifrost Mali GPUs only support up to 64kb texture buffers. Citra would try to allocate a much larger buffer the first 64kb of which would work fine but after that the driver starts misbehaving and showing various graphical glitches
* rasterizer_cache: Cleanup CopySurface
* renderer_opengl: Keep frames synchronized when using a GPU debugger
* rasterizer_cache: Rename Surface to SurfaceRef
* Makes it clear that surface is a shared_ptr and not an object
* rasterizer_cache: Cleanup
* Move constructor to the top of the file
* Move FindMatch to the top as well and remove the Invalid flag which was redudant;
all FindMatch calls used it expect from MatchFlags::Copy which ignores it anyway
* gl_texture_runtime: Make driver const
* gl_texture_runtime: Fix RGB8 format handling
* The texture_codec header, being written with vulkan in mind converts RGB8 to RGBA8. The backend wasn't adjusted to account for this though and treated the data as RGB8.
* Also remove D16 convertions, both opengl and vulkan are required to support this format so these are not needed
* gl_texture_runtime: Reduce state switches during FBO blits
* glBlitFramebuffer is only affected by the scissor rectangle so just disable scissor testing instead of resetting our entire state
* surface_params: Prevent texcopy that spans multiple levels
* It would have failed before as well, with multi-level surfaces it triggers the assert though
* renderer_opengl: Centralize texture filters
* A lot of code is shared between the filters thus is makes it sense to centralize them
* Also fix an issue with partial texture uploads
* Address review comments
* rasterizer_cache: Use leading return types
* rasterizer_cache: Cleanup null checks
* renderer_opengl: Add additional logging
* externals: Actually downgrade glad
* For some reason I missed adding the files to git
* surface_params: Do not check for levels in exact match
* Some games will try to use the base level of a multi level surface. Checking for levels forces another surface to be created and a copy to be made which is both unncessary and breaks custom textures
---------
Co-authored-by: bunnei <bunneidev@gmail.com>
* externals: Update dynarmic
* settings: Introduce GraphicsAPI enum
* For now it's OpenGL only but will be expanded upon later
* citra_qt: Introduce backend agnostic context management
* Mostly a direct port from yuzu
* core: Simplify context acquire
* settings: Add option to create debug contexts
* renderer_opengl: Abstract initialization to Driver
* This commit also updates glad and adds some useful extensions which we will use in part 2
* Rasterizer construction is moved to the specific renderer instead of RendererBase.
Software rendering has been disable to achieve this but will be brought back in the next commit.
* video_core: Remove Init/Shutdown methods from renderer
* The constructor and destructor can do the same job
* In addition move opengl function loading to Qt since SDL already does this. Also remove ErrorVideoCore which is never reached
* citra_qt: Decouple software renderer from opengl part 1
* citra: Decouple software renderer from opengl part 2
* android: Decouple software renderer from opengl part 3
* swrasterizer: Decouple software renderer from opengl part 4
* This commit simply enforces the renderer naming conventions in the software renderer
* video_core: Move RendererBase to VideoCore
* video_core: De-globalize screenshot state
* video_core: Pass system to the renderers
* video_core: Commonize shader uniform data
* video_core: Abstract backend agnostic rasterizer operations
* bootmanager: Remove references to OpenGL for macOS
OpenGL macOS headers definitions clash heavily with each other
* citra_qt: Proper title for api settings
* video_core: Reduce boost usage
* bootmanager: Fix hide mouse option
Remove event handlers from RenderWidget for events that are
already handled by the parent GRenderWindow.
Also enable mouse tracking on the RenderWidget.
* android: Remove software from graphics api list
* code: Address review comments
* citra: Port per-game settings read
* Having to update the default value for all backends is a pain so lets centralize it
* android: Rename to OpenGLES
---------
Co-authored-by: MerryMage <MerryMage@users.noreply.github.com>
Co-authored-by: Vitor Kiguchi <vitor-kiguchi@hotmail.com>
* This commit aims to both continue the rasterizer cache cleanup by
separating CachedSurface into a dedicated header and to start weeding
out the raw OpenGL code from the cache.
* The latter is achieved by abstracting most texture operations in a new
class called TextureRuntime. This has many benefits such as making it easier
to port the functionality to other graphics APIs and the removal of the need
to pass (read/draw) framebuffer handles everywhere. The filterer and
reinterpreter get their own sets of FBOs due to this, something that
might be a performance win since it reduces the state switching
overhead on the runtime FBOs.
video_core: disable depth/stencil texture download on OpenGL ES
Disable deptch stencil shader in texture_downloader_es for now
enable_depth_stencil
DepthStencil
remove GL_DEBUG_OUTPUT_SYNCHRONOUS
* video_core/renderer_opengl/gl_rasterizer_cache: Create Format Reinterpretation Framework
Adds RGBA4 -> RGB5A1 reinterpretation commonly used by virtual console
If no matching surface can be found, ValidateSurface checks for a surface in the cache which is reinterpretable to the requested format.
If that fails, the cache is checked for any surface with a matching bit-width. If one is found, the region is flushed.
If not, the region is checked against dirty_regions to see if it was created entirely on the GPU.
If not, then the surface is flushed.
Co-Authored-By: James Rowe <jroweboy@users.noreply.github.com>
Co-Authored-By: Ben <b3n30@users.noreply.github.com>
temporary change to avoid merge conflicts with video dumping
* re-add D24S8->RGBA8 res_scale hack
* adress review comments
* fix dirty region check
* check for surfaces with invalid pixel format, and break logic into separate functions
* video_core/renderer_opengl: Move SurfaceParams into its own file
Some of its enums are needed outside of the rasterizer cache
and trying to use it caused circular dependencies.
* video_core/renderer_opengl: Overhaul the texture filter framework
This should make it less intrusive.
Now texture filtering doesn't have any mutable global state.
The texture filters now always upscale to the internal rendering resolution.
This simplifies the logic in UploadGLTexture and it simply takes the role of BlitTextures at the end of the function.
This also prevent extra blitting required when uploading to a framebuffer surface with a mismatched size.
* video_core/renderer_opengl: Use generated mipmaps for filtered textures
The filtered guest mipmaps often looked terrible.
* core/settings: Remove texture filter factor
* sdl/config: Remove texture filter factor
* qt/config: Remove texture filter factor
This uses the mailbox model to move pixel downloading to its own thread, eliminating Nvidia's warnings and (possibly) making use of GPU copy engine.
To achieve this, we created a new mailbox type that is different from the presentation mailbox in that it never discards a rendered frame.
Also, I tweaked the projection matrix thing so that it can just draw the frame upside down instead of having the CPU flip it.
* Add Anaglyph 3D
Change 3D slider in-game
Change shaders while game is running
Move shader loading into function
Disable 3D slider setting when stereoscopy is off
The rest of the shaders
Address review issues
Documentation and minor fixups
Forgot clang-format
Fix shader release on SDL2-software rendering
Remove unnecessary state changes
Respect 3D factor setting regardless of stereoscopic rendering
Improve shader resolution passing
Minor setting-related improvements
Add option to toggle texture filtering
Rebase fixes
* One final clang-format
* Fix OpenGL problems
video_core: shorten GetGLSLVersionString
video_core: make GLES version and extensions consistent
video_core: move some logic to LoadShader
video_core: deduplicate fragment shader precision specifier
Those implementations are quite costly, so there is no need to inline them to the caller.
Ressource deletion is often a performance bug, so in this way, we support to add breakpoints to them.