Linux: io_uring #60

Closed
opened 2023-05-03 10:36:33 +00:00 by reece · 10 comments
Owner

Implement support for io_uring in place of the current aio if the kernel version is over 5.19

Implement support for io_uring in place of the current aio if the kernel version is over 5.19
Author
Owner

5.5* min

5.5* min
Author
Owner

or linux performance can eat shit and we can go back to looking into freebsd support like i wanted to
image
image
image

or linux performance can eat shit and we can go back to looking into freebsd support like i wanted to ![image](/attachments/e84eb101-8379-45e2-8c1c-8e1275ad15ca) ![image](/attachments/9087738f-3016-4ab6-a1f1-ad52fa3a968f) ![image](/attachments/d54af428-fefa-44f4-adf4-e934511c39e4)
reece locked as Resolved and limited conversation to collaborators 2023-05-04 19:45:24 +00:00
Author
Owner

What actual schizos
> says my name
> "ANONYMOUSLY TROLLING" from a unique email
ok. Will block Linux development until these retards write an IO subsystem worth half a damn

What actual schizos \> *says my name* \> "ANONYMOUSLY TROLLING" from a unique email ok. Will block Linux development until these retards write an IO subsystem worth half a damn
Author
Owner

https://twitter.com/axboe/status/1654211976191700992
rent-free / removing linux support (?).

imagine being almost 50 subtweeting people you called 14 by email bc they called your interface a regression. ironically, aurora runtime and other components will always be lengther than any of his linux kernel contributions. talk about productivity copium coming from a dude who works at a company that only values cmakelist contributions

https://twitter.com/axboe/status/1654211976191700992 rent-free / removing linux support (?). imagine being almost 50 subtweeting people you called 14 by email bc they called your interface a regression. ironically, aurora runtime and other components will always be lengther than any of his linux kernel contributions. talk about productivity copium coming from a dude who works at a company that only values cmakelist contributions
Author
Owner

image
image
image
image
similar quirks exist under windows https://learn.microsoft.com/en-GB/troubleshoot/windows/win32/asynchronous-disk-io-synchronous#asynchronous-io-still-appears-to-be-synchronous
identical quirks include: block-on-metadata update, unsupported file system or file system feature

all i see is "muh syscall overhead" copium, when in reality linux programs, if not optimized, hammer syscalls all the god damn time. as if a few calls into the kernel per io loop is that much of a concern. i fail to see how using ring buffers are even an optimization. its not like physical hardware just went "lol heres a ring buffer," now we no longer need to fire msi-x interrupts and deal with other fencing mechanisms. we still need those submit/forced read barriers. so far as optimizing 'work is already done, you know this, plz no expensive sync' conditions are concerned, we dont need to enter the kernel to spin-poll or try-poll for completion events under either subsystem. not to mention io_urings write head sync is no different than our batched submit. this "hurrhurr you need a faster cpu to saturate the IO device" just sounds like theoretical copium of those naively assuming "io_uring = no/little syscalls = more time in userland = faster." maybe im being too kind calling dumb cunts, who think a syscall is a context switch in the traditional sense of storing a context for lengthy yield, "naive" (as opposed to simply pushing a few registers, either in hardware or in software, to do a quick task in a privileged state under a foreign ABI)

the last remaining point is, "hurdur why cant i do arbitrary buffered io in the kernel." go map a file and let the page faults, swap lock interface, and paging configuration deal with it. done. im surprised arbitrary file access isn't a userland concept given a crt or runtime abstraction of mmap/file sections or unbuffered direct io in the worst case. it cannot be understated how many limitations there are in every kernels async/overlapped io subsystem. async truly is a concept reserved for seeking in large blocked database structures and sequential piping of data, in the best case scenario; it's not a replacement for the glibc hack of "threads go brrr" or libuvs philosophy of, well, "threads go brr"

on that note, why dont we have a configurable buffered direct-io/mmap file stream with callbacks and configurable cache pages? why not share that interface with a seekable reader interface adapter while we're at it? seems like a useful feature to have. it'd beat what other runtimes consider a "buffered" [input/output] stream

more on topic, if we can already do all the io ops we care about asynchronously on a single thread, namely non-blocking socket connect, non-blocking socket acceptance, polled read, polled write, file block read, file block write; then why the hell should we even care for linux's
lets see
select
epoll
aio/io_submit
io_uring
linux's 4th non-blocking mechanism intended to increase io performance... 4th attempt and linux is still not as good as the BSD and NT family of operating systems. NT got it on their first attempt, and BSD on their first real implementation of kevent/kqueue alongside POSIX SID: Issue 5.

any performance improvement in io_uring is probably placebo or that of microbenchmarkers trying to prove themselves in a particular niche. maybe they intentionally crippled aio and have no intention of improving it now they have a new shiny toy to shill. idk. either way, its not enough to justify refactoring away (sometimes shared) loop sources of io transactions.

![image](/attachments/66a33cd5-e714-4c29-97d8-70ab5cda9977) ![image](/attachments/5f0b0941-6239-4c4f-8a89-ba1d5d91d46f) ![image](/attachments/cbbcf4d5-92cb-4ff8-93a0-971baf27ff95) ![image](/attachments/2ec7a4a6-571d-4020-8e13-fef043533276) similar quirks exist under windows https://learn.microsoft.com/en-GB/troubleshoot/windows/win32/asynchronous-disk-io-synchronous#asynchronous-io-still-appears-to-be-synchronous identical quirks include: block-on-metadata update, unsupported file system or file system feature all i see is "muh syscall overhead" copium, when in reality linux programs, if not optimized, hammer syscalls all the god damn time. as if a few calls into the kernel per io loop is that much of a concern. i fail to see how using ring buffers are even an optimization. its not like physical hardware just went "lol heres a ring buffer," now we no longer need to fire msi-x interrupts and deal with other fencing mechanisms. we still need those submit/forced read barriers. so far as optimizing 'work is already done, you know this, plz no expensive sync' conditions are concerned, we dont need to enter the kernel to spin-poll or try-poll for completion events under either subsystem. not to mention io_urings write head sync is no different than our batched submit. this "hurrhurr you need a faster cpu to saturate the IO device" just sounds like theoretical copium of those naively assuming "io_uring = no/little syscalls = more time in userland = faster." maybe im being too kind calling dumb cunts, who think a syscall is a context switch in the traditional sense of storing a context for lengthy yield, "naive" (as opposed to simply pushing a few registers, either in hardware or in software, to do a quick task in a privileged state under a foreign ABI) the last remaining point is, "hurdur why cant i do arbitrary buffered io in the kernel." go map a file and let the page faults, swap lock interface, and paging configuration deal with it. done. im surprised arbitrary file access isn't a userland concept given a crt or runtime abstraction of mmap/file sections or unbuffered direct io in the worst case. it cannot be understated how many limitations there are in every kernels async/overlapped io subsystem. async truly is a concept reserved for seeking in large blocked database structures and sequential piping of data, in the best case scenario; it's not a replacement for the glibc hack of "threads go brrr" or libuvs philosophy of, well, "threads go brr" on that note, why dont we have a configurable buffered direct-io/mmap file stream with callbacks and configurable cache pages? why not share that interface with a seekable reader interface adapter while we're at it? seems like a useful feature to have. it'd beat what other runtimes consider a "buffered" [input/output] stream more on topic, if we can already do all the io ops we care about asynchronously on a single thread, namely non-blocking socket connect, non-blocking socket acceptance, polled read, polled write, file block read, file block write; then why the hell should we even care for linux's lets see ~~select~~ ~~epoll~~ ~~aio/io_submit~~ io_uring **linux's 4th non-blocking** mechanism intended to increase io performance... 4th attempt and linux is still not as good as the BSD and NT family of operating systems. NT got it on their first attempt, and BSD on their first real implementation of kevent/kqueue alongside POSIX SID: Issue 5. any performance improvement in io_uring is probably placebo or that of microbenchmarkers trying to prove themselves in a particular niche. maybe they intentionally crippled aio and have no intention of improving it now they have a new shiny toy to shill. idk. either way, its not enough to justify refactoring away (sometimes shared) loop sources of io transactions.
Author
Owner

some more objective data: https://atlarge-research.com/pdfs/2022-systor-apis.pdf

you have to configure the kernel to employ the ideology of "threads go brrrr" (the one thing i cant fucking stand), just to get close to the performance of a highly optimized driver and software development kit centered around NVMe drives. we may as well employ an a hidden spdk abstraction, should the need arise for what i suppose would be a highly niche academic use case.

overall, we're talking like 133k/ops under aio as opposed to 171k/ops @ 64 batches/submission

latency is basically neck and neck
image

to go any faster, you need to start thinking about tweaking the queue, spamming kernel threads, and other bullshit inappropriate for our layer of abstraction. what are we to do, start spamming kernel threads for each singular userland thread with a TLS IO context to get close to an SDK that's actually optimized for NVMe I/O? why the hell cant they just make an optimized io subsystem and interface that just werks like NT?

everything points towards, just use spdf if you care about io performance, for, i dont know, particle and fluid simulation in userland via simd optimizations?

casual use cases are neck and neck, if not equal
image

aio is fine. use spdfs interfaces directly if u actually need to optimize for niche high performance disk io. linshit interfaces will never be good enough for high performance io. or in the words of the aforementioned paper,
"but ultimately it cannot match the throughput and scalability of SPDK (Figure 5). Finally, iou is consistently the worst-performing configuration of io_uring, suggesting that polling is one of the key ingredients to unleashing the full potential of io_uring"

i doubt any aurt user will get close to hitting these constraints.

some more objective data: https://atlarge-research.com/pdfs/2022-systor-apis.pdf you have to configure the kernel to employ the ideology of "threads go brrrr" (the one thing i cant fucking stand), just to get close to the performance of a highly optimized driver and software development kit centered around NVMe drives. we may as well employ an a hidden spdk abstraction, should the need arise for what i suppose would be a highly niche academic use case. overall, we're talking like 133k/ops under aio as opposed to 171k/ops @ 64 batches/submission latency is basically neck and neck ![image](/attachments/79a50904-ea8d-4cd6-818f-78d1168c4816) to go any faster, you need to start thinking about tweaking the queue, spamming kernel threads, and other bullshit inappropriate for our layer of abstraction. what are we to do, start spamming kernel threads for each singular userland thread with a TLS IO context to get close to an SDK that's actually optimized for NVMe I/O? why the hell cant they just make an optimized io subsystem and interface that just werks like NT? everything points towards, just use spdf if you care about io performance, for, i dont know, particle and fluid simulation in userland via simd optimizations? casual use cases are neck and neck, if not equal ![image](/attachments/fae234f3-4402-44d1-b0ff-f08d7ce0d63c) aio is fine. use spdfs interfaces directly if u actually need to optimize for niche high performance disk io. linshit interfaces will never be good enough for high performance io. or in the words of the aforementioned paper, "but ultimately it cannot match the throughput and scalability of SPDK (Figure 5). Finally, iou is consistently the worst-performing configuration of io_uring, suggesting that polling is one of the key ingredients to unleashing the full potential of io_uring" i doubt any aurt user will get close to hitting these constraints.
Author
Owner

casual use cases are neck and neck, if not equal
image

i doubt any aurt user will get close to hitting these constraints.

also note there are diminishing returns with spamming the io queue. bandwidth != io/ops. 6 (~1=QD) * 64 ~= 300. not even QD=128 gets this close.

i also couldnt think of a reason why we would need hundreds of thousands of disk io operations per second. sure, i could comprehend a really shitty database needing to serve unindexed data at a high rate, but even then i cannot mentally comprehend more than tens of thousands of operation per second. aio, per thread, doesn't fall off until an order of mangitude ahead of that. i dont even think this runtime is optimized for such high volume of requests; i care more about higher bandwidth client applications. there's real no use case where these marginal differences between apis matter to us... maybe it matters to the one dude trying to prove pi while squatting in an NSA cray supercomputing basement, but not to us

> > casual use cases are neck and neck, if not equal > ![image](/attachments/fae234f3-4402-44d1-b0ff-f08d7ce0d63c) > > i doubt any aurt user will get close to hitting these constraints. also note there are diminishing returns with spamming the io queue. bandwidth != io/ops. 6 (~1=QD) * 64 ~= 300. not even QD=128 gets this close. i also couldnt think of a reason why we would need hundreds of thousands of disk io operations per second. sure, i could comprehend a really shitty database needing to serve unindexed data at a high rate, but even then i cannot mentally comprehend more than tens of thousands of operation per second. aio, per thread, doesn't fall off until an order of mangitude ahead of that. i dont even think this runtime is optimized for such high volume of requests; i care more about higher bandwidth client applications. there's real no use case where these marginal differences between apis matter to us... maybe it matters to the one dude trying to prove pi while squatting in an NSA cray supercomputing basement, but not to us
Author
Owner

casual use cases are neck and neck, if not equal
image

i doubt any aurt user will get close to hitting these constraints.

also note there are diminishing returns with spamming the io queue. bandwidth != io/ops. 6 (~1=QD) * 64 ~= 300. not even QD=128 gets this close.

i also couldnt think of a reason why we would need thousands of disk io operations per second. sure, i could comprehend a database needing to serve indexed data across the risk at the rate, but even then im mentally comprehending tens of thousands of operation per second. aio, per thread, doesn't fall off until an order of mangitude ahead of that. i dont even think this runtime is optimized for such high volume of requests; i care more about higher bandwidth client applications

now that i say this, 6k op/s thread * n threads... now what these redditors were saying makes sense. aio might be able to out perform io_uring without a ton of kernel threads doing blocking polls behind the scenes

6k op/s thread scales upto QD=2/2 batched ops, to peak at around 12.71GB/s of throughput
hammering io_uring with QD=32/32 [!!!] batched ops per tick, barely gets past 12.795GB/s
image
https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/

exactly what i predicted... it literally does not matter out side of "LOOOOOK MOM I WROTE A BENCHMARK." it only matters if you're a brainlet ex-oracle developer spamming the disk with an unnecessary amount of arbitrary and tiny db access operations

> > > > > casual use cases are neck and neck, if not equal > > ![image](/attachments/fae234f3-4402-44d1-b0ff-f08d7ce0d63c) > > > > > i doubt any aurt user will get close to hitting these constraints. > > > > also note there are diminishing returns with spamming the io queue. bandwidth != io/ops. 6 (~1=QD) * 64 ~= 300. not even QD=128 gets this close. > > i also couldnt think of a reason why we would need thousands of disk io operations per second. sure, i could comprehend a database needing to serve indexed data across the risk at the rate, but even then im mentally comprehending tens of thousands of operation per second. aio, per thread, doesn't fall off until an order of mangitude ahead of that. i dont even think this runtime is optimized for such high volume of requests; i care more about higher bandwidth client applications now that i say this, 6k op/s thread * n threads... now what these redditors were saying makes sense. aio might be able to out perform io_uring without a ton of kernel threads doing blocking polls behind the scenes 6k op/s thread scales upto QD=2/2 batched ops, to peak at around 12.71GB/s of throughput hammering io_uring with QD=32/32 [!!!] batched ops per tick, barely gets past 12.795GB/s ![image](/attachments/5e2589c7-a2df-454a-982e-19c8a8f2f58b) https://kb.blockbridge.com/technote/proxmox-aio-vs-iouring/ exactly what i predicted... it literally does not matter out side of "LOOOOOK MOM I WROTE A BENCHMARK." it only matters if you're a brainlet ex-oracle developer spamming the disk with an unnecessary amount of arbitrary and tiny db access operations
reece closed this issue 2023-05-05 10:54:02 +00:00
Author
Owner

sperging aside, block sizes and worker pool (or multiple overlapped submission) sizes matter. the underlying interface does not. only spdf scales. everything else is in the same ballpark. if anything, io_uring does not scale to our use case.

sperging aside, block sizes and worker pool (or multiple overlapped submission) sizes matter. the underlying interface does not. only spdf scales. everything else is in the same ballpark. if anything, io_uring does not scale to our use case.
Author
Owner

image_2023-07-20_19-32-35

Reece, [20/07/2023 19:33] how is an io interface responsible for any vulnerability

Reece, [20/07/2023 19:33] but still

Reece, [20/07/2023 19:34] imagine an io interface costing so many developer resources and bug bounty payouts that its considered unusable

![image_2023-07-20_19-32-35](/attachments/a690fe44-31bd-4bbc-84fd-d0ded7655a03) Reece, [20/07/2023 19:33] how is an io interface responsible for any vulnerability Reece, [20/07/2023 19:33] but still Reece, [20/07/2023 19:34] imagine an io interface costing so many developer resources and bug bounty payouts that its considered unusable
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: AuroraSupport/AuroraRuntime#60
No description provided.