AuroraRuntime/Include/Aurora/Threading/WakeOnAddress.hpp

/***
    Copyright (C) 2023 J Reece Wilson (a/k/a "Reece"). All rights reserved.

    File: WakeOnAddress.hpp
    Date: 2023-3-11
    Author: Reece
    Note:
        This API can be configured to run in one of two modes - Emulation and Wrapper modes

        In Emulation Mode:
              1: Wakes occur in FIFO order so long as the thread is in the kernel
              2: uWordSize can be any length not exceeding 32 bytes
          otherwise Wrapper Mode:
              1: Wakes are orderless
              2: uWordSize must be less than or equal to 8 bytes (todo: no?)
              3: only the least significant 32bits are guaranteed to be used as wake signals
              4: The special EWaitMethod variants will suffer a performance hit
          in either mode:
              1: WaitOnAddress[...] can wake at anytime the wakeup method is successful
              2: WaitOnAddress[...] can drop any wakeup if the wakeup method would fail

        By default:
            Windows XP - Windows 7 => Emulation Mode
            Windows 10+            => Wrapper Mode
            Linux                  => Emulation Mode; however, Wrapper Mode is available
            **************************************************************************************
            All platforms : ThreadingConfig::bPreferEmulatedWakeOnAddress = !AuBuild::kIsNtDerived
            **************************************************************************************

        Also note: Alongside Wrapper Mode, there is an internal set of APIs that allow for 32-bit word WoA support for
                   AuThread primitives. These are only used if the operating system has a futex interface available at
                   runtime. MacOS, iOS, and <= Windows 7 support requires these paths to be disabled. In other cases,
                   the internal wrapper and Wrapper Mode should use this path to quickly yield to kernel

                   Generally speaking, AuThreadPrimitives will use the futex layer or some OS specific mechanism to
                   bail out into the kernels' thread scheduler as quickly as possible.
                   In any mode, AuThreadPrimitives will go from: Primitive -> kernel/platform; or
                                                                 Primitive -> WoA Internal Wrapper -> kernel/platform
                   In ThreadingConfig::bPreferEmulatedWakeOnAddress mode, AuThreading::WaitOnAddress -> Emulation Mode.
                   In !ThreadingConfig::bPreferEmulatedWakeOnAddress mode, AuThreading::WaitOnAddress -> Wrapper Mode -> [...]
                                                                           [...] -> Internal Wrapper -> kernel/platform
                   In any mode, the futex reference primitives including AuBarrier, AuInitOnce, AuFutexMutex, etc,
                    will always go from: inlined header template definition -> relinked symbol -> AuThreading::WaitOnAddress
                    -> [...].

                  Note that some edge case platforms can follow AuThreadPrimitives *.Generic -> Internal Wrapper -> [...]
                                                                [...] -> AuThreading::WaitOnAddress -> Emulation Mode.
                  This is only the case when, we lack OS specific wait paths for our primitives; and lack a native
                  wait on address interface to develop the internal wrapper. Fortunately, only more esoteric UNIX machines
                  require these. Further platform support can be added with this; only a semaphore or conditionvar/mutex
                  pair is required to bootstrap this path.

        Memory note: Weakly ordered memory architecture is an alien concept. AuAtomicXXX operations ensure all previous stores
                     are visible across all cores (useful for semaphore increment and mutex-unlock operations), and that loads
                     are evaluated in order. For all intents and purposes, you should treat the au ecosystem like any
                     other strongly ordered processor and program pair. For memeworthy lockless algorithms, you can use
                     spec-of-the-year atomic word containers and related methods; we dont care about optimizing some midwits
                     weakly-ordered cas spinning and ABA-hell container, thats genuinely believed to be the best thing ever.
                     Sincerely, you are doing something wrong if you're write-locking a container for any notable length of
                     time, and more often than not, lock-free algorithms are bloated to all hell, just to end up losing to
                     read/write mutex guarded algorithms in most real world use cases - using an atomic pointer over lock bits
                     makes no difference besides from the amount of bugs you can expect to creep into your less flexible code.

               tldr: Dont worry about memory ordering or ABA. Use the provided locks, AuAtomic ops, and thread primitives as expected.
                     (you'll be fine. trust me bro.)

        Configuration reminder:
                    NT 6.2+ platforms may be optimized for the expected defacto case of EWaitMethod::eNotEqual / no "-Special".
                    If you're implementing special primitives or using AuFutexSemaphore with timeline acquisitions, remember to
                    set ThreadingConfig::bPreferEmulatedWakeOnAddress=true at Aurora::RuntimeStart

        Compilation / WOA_STRICTER_FIFO:
                    Stricter FIFO guarantees are available when AuWakeOnAddress.cpp is compiled with WOA_STRICTER_FIFO.
                    Note that this will disable TryWaitOnAddress-like APIs and worsen the expected average case.

        You will never be First In, First Out 100% of the time under this flawed API design:
                    Due to the nature of an atomic comparison within locked signal/wait paths, what amounts to a hidden condition
                    condvar pattern, the target address must be read under an internal lock to verify the initial sleep condition.
                    Every futex-like API does this. You cannot meet the API contract otherwise - it's the inherit nature of futex/
                    waitonaddress-like APIs as they exist in Windows, Linux, BSD-likes, and NaCl - they're just storageless condition
                    variables. You simply cannot have signal / waits, where there's no ordering whatsoever between signal and wake,
                    and where the pCompare condition isn't tested before each yield. You *need* atomicity or a lock to ensure wakes
                    are followed by signals; and for each sleep, you *need* to test what amounts to the condition you'd find under
                    a traditional condition-mutex while loop. Should that condition pass early, every futex impl bails out early.
                    Regardless of what Linux and NaCl developers will tell you, no futex API is ever truly first in-first out, due
                    to this inherit design flaw for real time programming. Anybody who claims otherwise is selling you a toll bridge.
                    The only valid workaround for this is to develop your own competing API that does away with the comparsion and
                    relies solely on bucketed semaphores of per-address ownership, under an incapatible set of APIs as follows:
                     { Wait(pAddress, uTimeout); Signal(pAddress); and Release(pAddress); }
                       Emphasis on the release operation and missing comparison parameter.
                       Requires: (1) IOU counter under Signal, (2) Wait to stall for acks of in-order head dequeues, and (3) a
                                 no fast path mandate.
                                 (1) needs to be paired with a Release in order to not leak active semaphores.

***/
#pragma once

namespace Aurora::Threading
{
    // Specifies to break a thread context yield when volatile pTargetAddress [... EWaitMethod operation ...] constant pCompareAddress
    AUE_DEFINE(EWaitMethod, (
        eNotEqual, eEqual, eLessThanCompare, eGreaterThanCompare, eLessThanOrEqualsCompare, eGreaterThanOrEqualsCompare, eAnd, eNotAnd
    ))

    AUKN_SYM void WakeAllOnAddress(const void *pTargetAddress);

    AUKN_SYM void WakeOnAddress(const void *pTargetAddress);

    // WakeAllOnAddress with a uNMaximumThreads which may or may not be respected
    AUKN_SYM void WakeNOnAddress(const void *pTargetAddress,
                                 AuUInt8 uNMaximumThreads);

    // On systems with processors of shared execution pipelines, these try-series of operations will spin (eg: mm_pause) for a configurable
    //  amount of time, or enter a low power mode, so long as the the process-wide state isn't overly contested. This means you can use these
    //  arbitrarily without worrying about an accidental thundering mm_pause herd. If you wish to call WaitOnAddress[...] afterwards, you should
    //  report you already spun via optAlreadySpun. If the application is configured to spin later on, this hint may be used to prevent a double spin.
    AUKN_SYM bool TryWaitOnAddress(const void *pTargetAddress,
                                   const void *pCompareAddress,
                                   AuUInt8 uWordSize);

    AUKN_SYM bool TryWaitOnAddressSpecial(EWaitMethod eMethod,
                                          const void *pTargetAddress,
                                          const void *pCompareAddress,
                                          AuUInt8 uWordSize);

    // On systems with processors of shared execution pipelines, these try-series of operations will spin (eg: mm_pause) for a configurable
    //  amount of time, or enter a low power mode, so long as the the process-wide state isn't overly contested. This means you can use these
    //  arbitrarily without worrying about an accidental thundering mm_pause herd. If you wish to call WaitOnAddress[...] afterwards, you should
    //  report you already spun via optAlreadySpun. If the application is configured to spin later on, this hint may be used to prevent a double spin.
    //  In the case of a pTargetAddress != pCompareAddress condition, the optional check parameter is used to verify the wake condition.
    //  Otherwise, spinning will continue.
    AUKN_SYM bool TryWaitOnAddressEx(const void *pTargetAddress,
                                     const void *pCompareAddress,
                                     AuUInt8 uWordSize,
                                     const AuFunction<bool(const void *, const void *, AuUInt8)> &check);

    // See: TryWaitOnAddressEx
    AUKN_SYM bool TryWaitOnAddressSpecialEx(EWaitMethod eMethod,
                                            const void *pTargetAddress,
                                            const void *pCompareAddress,
                                            AuUInt8 uWordSize,
                                            const AuFunction<bool(const void *, const void *, AuUInt8)> &check);

    // Relative timeout variant of nanosecond resolution eNotEqual WoA. 0 = indefinite.
    // In Wrapper Mode, it is possible to bypass the WoA implementation, and bail straight into the kernel.
    // For improved order and EWaitMethod, do not use Wrapper Mode.
    AUKN_SYM bool WaitOnAddress(const void *pTargetAddress,
                                const void *pCompareAddress,
                                AuUInt8 uWordSize,
                                AuUInt64 qwNanoseconds,
                                AuOptional<bool> optAlreadySpun = {} /*hint: do not spin before switching. subject to global config.*/);

    // Relative timeout variant of nanosecond resolution WoA. 0 = indefinite
    // Emulation Mode over Wrapper Mode is recommended for applications that heavily depend on these wait functions.
    AUKN_SYM bool WaitOnAddressSpecial(EWaitMethod eMethod,
                                       const void *pTargetAddress,
                                       const void *pCompareAddress,
                                       AuUInt8 uWordSize,
                                       AuUInt64 qwNanoseconds,
                                       AuOptional<bool> optAlreadySpun = {} /*hint: do not spin before switching. subject to global config.*/);

    // Absolute timeout variant of nanosecond resolution eNotEqual WoA. Nanoseconds are in steady clock time. 0 = indefinite
    // In Wrapper Mode, it is possible to bypass the WoA implementation, and bail straight into the kernel.
    // For improved order and EWaitMethod, do not use Wrapper Mode.
    AUKN_SYM bool WaitOnAddressSteady(const void *pTargetAddress,
                                      const void *pCompareAddress,
                                      AuUInt8 uWordSize,
                                      AuUInt64 qwNanoseconds,
                                      AuOptional<bool> optAlreadySpun = {} /*hint: do not spin before switching. subject to global config.*/);

    // Absolute timeout variant of nanosecond resolution WoA. Nanoseconds are in steady clock time. 0 = indefinite
    // Emulation Mode over Wrapper Mode is recommended for applications that heavily depend on these wait functions.
    AUKN_SYM bool WaitOnAddressSpecialSteady(EWaitMethod eMethod,
                                             const void *pTargetAddress,
                                             const void *pCompareAddress,
                                             AuUInt8 uWordSize,
                                             AuUInt64 qwNanoseconds,
                                             AuOptional<bool> optAlreadySpun = {} /*hint: do not spin before switching. subject to global config.*/);

    // C++ doesn't allow for implicit casting between nonvolatile and volatile pointers.
    // The following stubs unify the above APIs for non-volatile marked atomic containers.
    // Whether the underlying data of "pTargetAddress" is thread-locally-volatile or not is upto the chosen compiler intrin used to load/store and/or whether you upcast to volatile later on.

    inline void WakeAllOnAddress(const volatile void *pTargetAddress)
    {
        return WakeAllOnAddress((const void *)pTargetAddress);
    }

    inline void WakeOnAddress(const volatile void *pTargetAddress)
    {
        return WakeOnAddress((const void *)pTargetAddress);
    }

    inline void WakeNOnAddress(const volatile void *pTargetAddress,
                               AuUInt8 uNMaximumThreads)
    {
        return WakeNOnAddress((const void *)pTargetAddress, uNMaximumThreads);
    }

    inline bool TryWaitOnAddress(const volatile void *pTargetAddress,
                                 const void *pCompareAddress,
                                 AuUInt8 uWordSize)
    {
        return TryWaitOnAddress((const void *)pTargetAddress, pCompareAddress, uWordSize);
    }

    inline bool TryWaitOnAddressSpecial(EWaitMethod eMethod,
                                        const volatile void *pTargetAddress,
                                        const void *pCompareAddress,
                                        AuUInt8 uWordSize)
    {
        return TryWaitOnAddressSpecial(eMethod, (const void *)pTargetAddress, pCompareAddress, uWordSize);
    }

    inline bool TryWaitOnAddressEx(const volatile void *pTargetAddress,
                                   const void *pCompareAddress,
                                   AuUInt8 uWordSize,
                                   const AuFunction<bool(const void *, const void *, AuUInt8)> &check)
    {
        return TryWaitOnAddressEx((const void *)pTargetAddress, pCompareAddress, uWordSize, check);
    }

    inline bool TryWaitOnAddressSpecialEx(EWaitMethod eMethod,
                                          const volatile void *pTargetAddress,
                                          const void *pCompareAddress,
                                          AuUInt8 uWordSize,
                                          const AuFunction<bool(const void *, const void *, AuUInt8)> &check)
    {
        return TryWaitOnAddressSpecialEx(eMethod, (const void *)pTargetAddress, pCompareAddress, uWordSize, check);
    }

    inline bool WaitOnAddress(const volatile void *pTargetAddress,
                              const void *pCompareAddress,
                              AuUInt8 uWordSize,
                              AuUInt64 qwNanoseconds,
                              AuOptional<bool> optAlreadySpun = {})
    {
        return WaitOnAddress((const void *)pTargetAddress, pCompareAddress, uWordSize, qwNanoseconds, optAlreadySpun);
    }

    inline bool WaitOnAddressSpecial(EWaitMethod eMethod,
                                     const volatile void *pTargetAddress,
                                     const void *pCompareAddress,
                                     AuUInt8 uWordSize,
                                     AuUInt64 qwNanoseconds,
                                     AuOptional<bool> optAlreadySpun = {})
    {
        return WaitOnAddressSpecial(eMethod, (const void *)pTargetAddress, pCompareAddress, uWordSize, qwNanoseconds, optAlreadySpun);
    }

    inline bool WaitOnAddressSteady(const volatile void *pTargetAddress,
                                    const void *pCompareAddress,
                                    AuUInt8 uWordSize,
                                    AuUInt64 qwNanoseconds,
                                    AuOptional<bool> optAlreadySpun = {})
    {
        return WaitOnAddressSteady((const void *)pTargetAddress, pCompareAddress, uWordSize, qwNanoseconds, optAlreadySpun);
    }

    inline bool WaitOnAddressSpecialSteady(EWaitMethod eMethod,
                                           const volatile void *pTargetAddress,
                                           const void *pCompareAddress,
                                           AuUInt8 uWordSize,
                                           AuUInt64 qwNanoseconds,
                                           AuOptional<bool> optAlreadySpun = {})
    {
        return WaitOnAddressSpecialSteady(eMethod, (const void *)pTargetAddress, pCompareAddress, uWordSize, qwNanoseconds, optAlreadySpun);
    }
}