跳转至

ARM 公版核微架构演进

背景

ARM 公版核微架构的演进频繁,型号又比较多,相关信息散落在各种地方,为了方便查阅,在这里做一个收集。

2025 年

C1-Ultra

  • Inside Arm's New C1‑Ultra CPU: Double‑Digit IPC Gains Again!
    • C1-Ultra: successor to Cortex X925
    • Branch prediction: Additional tracking for local/per-PC history
    • 33% increase in L1 I-Cache available bandwidth
    • Out of order window size growth: Up to 25% growth, Up to ~2K instruction in flight
    • 2x L1 data cache capacity (128KB)
    • Data prefetchers: array-indexing coverage
  • Arm® C1-Ultra Core Technical Reference Manual
    • Implementation of the Scalable Vector Extension (SVE) with a 128-bit vector length and Scalable Vector Extension 2 (SVE2)
    • Implementation of the Scalable Matrix Extension (SME) and Scalable Matrix Extension 2 (SME2), and support for the C1-SME2 unit
    • configure the L2 cache to be 2048KB or 3072KB
    • A 64KB, 4-way set associative L1 instruction cache with 64-byte cache lines
    • A fully associative L1 instruction Translation Lookaside Buffer (TLB) with native support for 4KB, 16KB, 64KB, and 2MB page sizes
    • A 128KB, 4-way set associative cache with 64-byte cache lines
    • A fully associative L1 data TLB with native support for 4KB, 16KB, and 64KB page sizes and 2MB and 512MB block sizes
    • L2 cache is private to the core and can be configured to be 2MB 8-way set associative or 3MB 12-way set associative
    • L1 instruction TLB, Fully associative, 128 entries
    • L1 data TLB, Fully associative, 96 entries
    • L1 Statistical Profiling Extension (SPE) TLB, Located in the SPE block, VA to PA translations of any page and block size, 1 entry
    • L1 TRace Buffer Extension (TRBE) TLB, VA to PA translations of any page and block size, 1 entry
    • L2 TLB, Shared by instructions and data, 8-way set associative, 2048 entries
    • L1 instruction cache, 64KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT), Pseudo-Least Recently Used (LRU) cache replacement policy for L1, 32 bytes per cycle interface with L2
    • L1 data cache, 128KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT), Re-Reference Interval Prediction (RRIP) replacement policy, 4×64-bit read paths and 4×64-bit write paths for the integer execute pipeline, 4×128-bit read paths and 4×128-bit write paths for the vector execute pipeline
    • L2 cache, 2MB 8-way set associative with 4 banks or 3MB 12-way set associative with 4 banks, Physically Indexed, Physically Tagged (PIPT), Dynamic biased cache replacement policy, One CHI Issue E compliant interfaces with 256-bit read and write DAT channel widths

C1-Pro

  • Arm Lumex C1-Pro CPU Core: What You Need to Know
    • C1-Pro: successor to Cortex-A725
    • Larger direction predictor and branch history
    • 2x capacity 0-cycle BTB
    • 16x capacity 1-cycle BTB
    • 50% more L1 Instruction TLB capacity
    • Increase effective L1D cache bandwidth
    • Lower latency L2 TLB hit
    • New indirect prefetcher

2024 年

Cortex X925

  • Arm Unveils 2024 CPU Core Designs, Cortex X925, A725 and A520: Arm v9.2 Redefined For 3nm Archive
    • Decode & Dispatch: 10-wide
    • SIMD/FP execution: 6x 128b
    • Integer ALU pipelines: 1- and 2-cycle operations
    • Integer multiply execution: 4x versus Cortex-X4
    • FP compare execution: 2x versus Cortex-X4
    • >2x increase in SIMD/FP issue queues
    • 2x increase in max instruction-window capacity(注:Cortex X4 是 384x2,推测 Cortex X925 是 768x2)
    • Sign-extension instruction elimination
    • Branch prediction: 2x instruction window size
    • Instruction Fetch: 2x increase in L1 I$ available bandwidth, 2x increase in L1 iTLB size, Fold out unconditional direct branches
    • 3 -> 4 load pipelines
    • 2x increase in L1 D$ available bandwidth
    • 25-40% in back-end OoO growth
  • Arm® Cortex-X925 Core Technical Reference Manual
    • Implementation of the Scalable Vector Extension (SVE) with a 128-bit vector length and Scalable Vector Extension 2 (SVE2)
    • configure the L2 cache to be 2048KB or 3072KB
    • A 64KB, 4-way set associative L1 instruction cache with 64-byte cache lines
    • A fully associative L1 instruction Translation Lookaside Buffer (TLB) with native support for 4KB, 16KB, 64KB, and 2MB page sizes
    • A 64KB, 4-way set associative cache with 64-byte cache lines
    • A fully associative L1 data TLB with native support for 4KB, 16KB and 64KB page sizes and 2MB and 512MB block sizes
    • L2 cache is private to the core and can be configured to be 2MB 8-way set associative or 3MB 12-way set associative
    • L1 instruction TLB, Caches entries at the 4KB, 16KB, 64KB, or 2MB granularity of Virtual Address (VA) to Physical Address (PA) mapping only, Fully associative, 128 entries
    • L1 data TLB, Caches entries at the 4KB, 16KB, 64KB, 2MB, or 512MB granularity of VA to PA mappings only, Fully associative, 96 entries
    • L2 TLB, Shared by instructions and data, VA to PA mappings for 4KB, 16KB, 64KB, 2MB, 32MB, 512MB, and 1GB block sizes, Intermediate Physical Address (IPA) to PA mappings for: 2MB and 1GB block sizes in a 4KB translation granule, 32MB block size in a 16KB translation granule, 512MB block size in a 64KB granule; Intermediate PAs (descriptor PAs) obtained during a translation table walk, 8-way set associative, 2048 entries
    • L1 instruction cache, 64KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT)
    • The Cortex®-X925 core supports the AArch64 prefetch memory instructions, PRFM PLI, into the L1 instruction cache or L2 cache
    • L1 data cache, 64KB, 4-way set associative, Virtually Indexed, Physically Tagged (VIPT) behaving as Physically Indexed, Physically Tagged (PIPT), Re-Reference Interval Prediction (RRIP) replacement policy, 4×64-bit read paths and 4×64-bit write paths for the integer execute pipeline, 4×128-bit read paths and 4×128-bit write paths for the vector execute pipeline
    • L2 cache, 2MB 8-way set associative with 4 banks or 3MB 12-way set associative with 4 banks, Physically Indexed, Physically Tagged (PIPT)
  • Arm® Cortex-X925 Core Software Optimization Guide

2023 年

Cortex X4

Cortex A720

2022 年

Cortex X3

  • Arm Unveils Next-Gen Flagship Core: Cortex-X3
    • 50% larger L1 + L2 (new) BTB capacity
    • 10x larger L0 BTB capacity
    • New predictor dedicated for indirect branches
    • Double return-stack capacity (32 entries)
    • Mop cache 50% capacity (1.5K entries)
    • Removed 1 pipeline stage in Mop Cache fetch, 10->9 cycles for a branch mispredict
    • Increase decode bandwidth: 5->6
    • Integer ALUs increase 4->6: 2->4 single-cycle (SX), 2 single-/multi-cycle (MX)
    • ROB/MCQ: 288x2 -> 320x2
    • Integer load bandwdith: 24B -> 32B
    • Additional data prefetch engines: Spatial, Pointer/Indirect
  • Arm® Cortex‑X3 Core Technical Reference Manual

Neoverse V2

  • Arm Neoverse V2 platform: Leadership Performance and Power Efficiency for Next-Generation Cloud Computing, ML and HPC Workloads
    • 6-wide/8-wide front-end
    • 64KB ICache
    • 320+ OoO window
    • 8-wide dispatch
    • 8-wide retire
    • 2 LS + 1 LD / cycle
    • 64KB DCache
    • 6-ALU + 2-branch
    • Quad 128-bit low latency SIMD datapath
    • L2 10-cycle load-to-use, 128B/cycle, private L2 cache 1 or 2 MB
    • Two predicted branches per cycle
    • Predictor acts as ICache prefetcher
    • 64kB, 4-way set-associative L1 instruction cache
    • Two-level Branch Target Buffer
    • 8 table TAGE direction predictor with staged output
    • 10x larger nanoBTB
    • Split main BTB into two levels with 50% more entries
    • TAGE: 2x larger tables with 2-way associativity, Longer history
    • Indirect branches: Dedicated predictor
    • Fetch bandwidth: Doubled instruction TLB and cache BW
    • Fetch Queue: Doubled from 16 to 32 entries
    • Fill Buffer: Increased size from 12 to 16 entries
    • Decode bandwidth: Increased decoder lanes from 5 to 6, Increased Decode Queue from 16 to 24 entries
    • Rename checkpoints: Increased from 5 to 6 total checkpoints, Increased from 3 to 5 vector checkpoints
    • Late read of physical register file – no data in IQs
    • Result caches with lazy writeback
    • Added two more single-cycle ALUs
    • Larger Issue Queues, SX/MX: Increased from 20 to 22 entries, VX: Increased from 20 to 28 entries
    • Predicate operations: Doubled predicate bandwidth
    • Zero latency MOV; Subset of register-register and immediate move operations execute with zero latency
    • Instruction fusion: More fusion cases, including CMP + CSEL/CSET
    • Two load/store pipes + one load pipe
    • 4 x 8B result busses (integer)
    • 3 x 16B result busses (FP, SVE, Neon)
    • ST to LD forwarding at L1 hit latency
    • RST and MB to reduce tag and data accesses
    • Fully-associative L1 DTLB with multiple page sizes
    • 64kB 4-way set associative Dcache
    • TLB Increased from 40 to 48 entries
    • Replacement policy Changed from PLRU to dynamic RRIP
    • Larger Queues: Store Buffer, ReadAfterRead, ReadAfterWrite
    • Efficiency: VA hash based store to load forwarding
    • Multiple prefetching engines training on L1 and L2 accesses: Spatial Memory Streaming, Best Offset, Stride, Correlated Miss Cache, Page
    • New PF engines: Global SMS – larger offsets than SMS, Sampling Indirect Prefetch – pointer dereference, TableWalk – Page Table Entrie
    • Private unified Level 2 cache, 8-way SA, 4 independent banks
    • 64B read or write per 2 cycles per bank = 128B/cycle total
    • 96-entry Transaction Queue
    • Inclusive with L1 caches for efficient data and instruction coherency
    • AMBA CHI interface with 256b DAT channels
    • Capacity 2MB/8-way with latency of 1MB (10-cycle ld-to-use)
    • Replacement policy 6-state RRIP (up from 4)
  • Hot Chips 2023: Arm’s Neoverse V2
  • Arm® Neoverse™ V2 Core Technical Reference Manual
  • Arm Neoverse V2 Software Optimization Guide

2021 年

Cortex X2

2020 年

Neoverse N2

  • Arm Neoverse N2: Arm’s 2nd generation high performance infrastructure CPUs and system IPs
    • Branch Prediction, 2x 8 instrs (up to 2 taken per cycle), 2x improvement
    • Nano BTB (0 cyc taken-branch bubble), 64 entry, 4x improvement
    • Conditional branch direction state, 1.5x improvement
    • Main BTB, 8K entry, 1.33x improvement
    • Alt-Path Branch Prediction
    • 64KB Instruction cache
    • 1.5K entry Mop Cache
    • 16-entry Fetch Queue, 1.33x improvement
    • Fetch Width: 4 instr from i$, 5 instr from MOP$, Up to 1.5x improvement
    • Early branch redirect: uncond + cond
    • Decode width: 4 (I-cache) or 5 (Mop cache), Up to 1.25x improvement
    • Branch predict up to 16-inst/cycle, 2-taken/cycle
    • New Macro-op (MOP) cache with 1.5k entries
    • 50% larger branch direction predicton
    • 33% larger BTB with shorter average latency
    • Early re-steering for conditional branches that miss the BTB
    • Rename width: 5 instrs, 1.2x improvement
    • Rename Checkpointing: Yes
    • ROB size: 160+, 1.25x improvement
    • ALUs: 4, 1.33x improvement
    • Branch resolution: 2 per cycle, 2x improvement
    • Overall Pipeline Depth: 10 cycles, 1.1x improvement
    • 64KB L1 Data cache
    • Private 512KB/1MB L2 Cache
    • AGU: 2-LD/ST + 1 LD, 1.5x improvement
    • L1 LD Hit bandwidth: 3x 16B/cycle, 1.5x improvement
    • Store data B/W: 32B/cycle, 2x improvement
    • L2 bandwidth: 64B read + 64B write, 2x improvement
    • L2 transactions: 64, 1.3x improvement
    • Data Prefetch Engines: Stride, spatial/region, stream, temporal
    • Correlated Miss Caching (CMC) prefetching

Neoverse V1

  • SW defined cars: HPC, from the cloud to the dashboard for an amazing driver experience
    • Faster run-ahead for prefetching into the I$ (2x32B bandwidth)
    • 33% larger BTBs (8K entry)
    • 6x nano BTB (96 entry), zero-cycle bubble
    • 2x number of concurrent code regions tracked in front-end
    • Introduction of Mop Cache, L0 decoded instruction cache (3K entry)
    • high dispatch bandwidth, 8-instrs per cycle, 2x increase, I$ decode bandwidth increased from 4x to 5x
    • Lower latency decode pipeline by 1 stage
    • OoO window size, 2x+ ROB (256 entry + compression)
    • Increase superscalar integer execution bandwidth, 1->2 Branch Execution, 3->4 ALU
    • 2x vector/fp bandwidth, 2x256b – SVE (new), 4x128b – Neon/FP
    • 3rd LD AGU/pipe (50% incr), LS LS LD
    • LD/ST data bandwidth, LD: 2x16B -> 3x16B, LD (SVE): 2x32B, ST: 16B -> 32B (2x), broken out into separate issue pipes
    • Number of outstanding external memory transactions (48->96)
    • MMU capacity 1.2K->2K entry (67% incr)
    • L2 latency reduced by 1 cycle for 1M (now 10cyc load to use)
    • 11+ stage accordion pipeline
    • 8-wide front-end / 15-wide issue
    • Four 64-bit integer ALUs + two dedicated Branch units
    • 2x 256-bit SVE datapaths
    • 4x 128-bit Neon/FP datapaths
    • 3x load / store addr
    • 3x load data & 2x store data pipeline
    • 8-wide Instruction fetch
    • 5-8 wide decode / rename
    • pipeline: P1 P2 F1 F2 DE1 RR RD I0 I1 I2 ...

Cortex X1

Cortex A78

2019 年

Neoverse N1

  • The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC
    • 4-wide front-end
    • dispatching/committing up to 8 instructions per cycle
    • three ALUs, a branch execution unit, two Advanced SIMD units, and two load/store execution units
    • minimum misprediction penalty is 11-cycle
    • fetch up to 4 instructions per cycle
    • large 6K-entry main branch target buffer with 3-cycle access latency
    • 64-entry micro-BTB and a 16-entry nano-BTB
    • 12-entry fetch queue
    • fully associative 48-entry instruction TLB
    • 4-way set-associative 64KB I-cache
    • I-cache can deliver up to 16B of instructions per cycle
    • up to 8 outstanding I-cache refill requests
    • 4-wide decoder
    • renaming unit can receive up to 4 macro-ops per cycle
    • up to 8 micro-operations can be dispatched into the out-of-order engine each cycle
    • The commit queue can track up to 128 micro operations
    • up to 8 micro-ops can be committed per cycle
    • a distributed issue queue with more than 100 micro-operations
    • 4 integer execution pipelines, 2 load/store pipelines, and 2 Advanced SIMD pipelines
    • 64kB 4-way set associative L1 data cache, 4-cycle load to use latency and a bandwidth of 32 bytes/cycle
    • The core-private 8-way set associative L2 cache is up to 1MB in size and has a load-to-use latency of 11 cycles
    • can also be configured with smaller L2 cache sizes of 256kB and 512kB with a load-to-use latency of 9 cycles
    • L2 cache connects to the system via an AMBA 5 CHI interface with 16-byte data channels
    • L3 cluster cache can be up to 2MB, with a load-to-use latency ranging between 28 and 33 cycles
    • up to 256MB of shared system-level cache

评论