The recent advances in multi-dimensional or stacked memory devices have led to a significant resurgence in research and effort associated with exploring more expressive memory operations in order to improve application throughput. The goal of these efforts is to provide memory operations in the logic layer of a stacked device that provide pseudo processing near memory capabilities to reduce the bandwidth required to perform common operations across concurrent applications. One such area of concern in applications is the ability to provide high performance, low latency mutexes and associated barrier synchronization techniques. Previous attempts at performing cache-based mutex optimization and tiered barrier synchronization provide some degree of application speedup, but still induce sub-optimal scenarios such as cache line contention and large degrees of message traffic. However, several previous architectures have presented techniques that extend the core physical address storage with additional, more expressive bit storage in order to provide fine-grained concurrency mechanisms in hardware. This work presents a novel methodology and associated implementation for providing in-situ extended memory operations in an HMC Gen2 device. The methodology provides a single lock, or tag bit for every 64-bit word in memory using the in-situ storage. Further, we present an address inversion technique that enables the tag-bit operations to execute their respective read-arbitrate-commit operations concurrently with a statistically low collision between the tagbit storage and the data storage. We conclude this work with results from utilizing the commands to perform a traditional multi-threaded mutex algorithm as well as a multi-threaded static tree barrier that exhibit sub-linear scaling.