Current microprocessor architectures rely upon multi-level data caches and low degrees of concurrency to solve a wide range of applications. These architectures are well suited to efficiently executing applications that support memory access patterns with spatial and/or temporal locality. However, data intensive applications often access memory in an irregular manner that prevents optimal use of the memory hierarchy. In this work, we introduce GoblinCore-64 (GC64), a novel architecture that supports large-scale data intensive high performance computing workloads using a unique memory hierarchy coupled to a latency-hiding micro architecture. The GC64 infrastructure is a hierarchical set of modules designed to support concurrency and latency hiding. The memory hierarchy is constructed using an on-chip scratchpad and Hybrid Memory Cube 3D memories. The RISC-V based instruction set includes support for scatter/gather memory operations, task concurrency and task management. We demonstrate GC64 using standard benchmarks that include NAS, HPCG, BOTS and the GAP Benchmark Suite. We find that GC64 accelerates these workloads by up to 14X per core and improves bandwidth by 3.5X.