ECE 652 / CPS 650

Advanced Computer Architecture II

Spring 2025

Professor Daniel J. Sorin

                  

>> Objectives <<

 

The objective of this course is to provide students with an understanding of parallel computer architectures.  Students will read research papers, lead in-class discussions of papers, perform a research project, and present their research projects both in written and oral formats.

 

The course focuses on both the design and evaluation of multicore processors. The main themes of this course are: parallel programming, shared memory multiprocessors, memory consistency models, interconnection networks, high availability systems, interactions with current microprocessor and I/O technology, novel architectures, and emerging technologies.  The evaluation portion of this course will focus on metrics, modeling, simulation, and workloads for benchmarking.

 

Prerequisites: ECE 552, CPS 550, or consent of instructor.

 

>> Class Location and Hours <<

 

Class meets Monday/Wednesday/Friday from 10:20-11:10am.

Location: Languages 320

 

>> Instructor <<

 

Professor Daniel J. Sorin

Office: 403 Wilkinson

Office Hours: TBD

Email: sorin@ee.duke.edu

 

>> Textbooks <<

 

The emphasis of the class will be discussions of research papers, but we will also use the following two textbooks (free PDF downloads from Duke IP address):

1)      Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. "A Primer on Memory Consistency and Cache Coherence." 2nd edition.
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2020.

2)      Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh.  "On Chip Networks," 2nd edition. 
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2017.

>> Assignments and Grading <<

 

This is a graduate level class that will not require "busy work."  This class will, however, require that students learn the reading material and learn how to present research in both written and oral formats (see Hill and Patterson for useful advice for presentations).  Communication is very important in this class.  Students who struggle with reading and writing are encouraged to take this course but should expect to work hard and to improve their communication skills in the process.  

 

 

 

 

Students are responsible for:

You may NOT use AI to help you do this.  Doing so will be considered academic misconduct and will result in your class grade being lowered by a full letter grade (e.g., from A to B or from B+ to C+).

You may NOT use AI to help you do this.  Doing so will be considered academic misconduct and will result in your class grade being lowered by a full letter grade (e.g., from A to B or from B+ to C+).

 

 

The project is a semester-long assignment that should reflect the goal of being no more than "a stone's throw" away from a research paper.  As such, the project will require:

 

 

Deadlines will be enforced except under extreme circumstances.  I would prefer that you turn in something not quite done on the due date rather than waiting until after the deadline to try to finish it.  Any project that is late by less than 24 hours will lose 50%. 

Any project that is more than 24 hours late will receive a zero.

 

Academic Misconduct: I will not tolerate academically dishonest work.  This includes using AI to write paper reports or presentations, cheating on the final exam, and plagiarism on the project.

 

Be careful on the project to cite prior work and to give proper credit to others' research.

>> Topics and Readings <<

 

This list could change!

 

Theme

Topic

Readings

Recommended Optional Readings

Intro to Multiprocessing

Parallelism, Goals, & Challenges

"Limits of Instruction-Level Parallelism"  (Wall, ASPLOS 1991)

 

Programming Models & Parallel Programming

"The PARSEC Benchmark Suite: Characterization and Architectural Implications" (Bienia et al., PACT 2008)

"The SPLASH-2 Programs: Characterization and Methodological Considerations" (Woo et al., ISCA 1995)

"The Problem with Threads" (Lee, Computer 2006)

Execution Models

(none)

 

Memory Consistency

Shared Memory & Coherence Definitions

Sorin et al. Textbook: Chapters 1-3

 

Consistency Models

Sorin et al. Textbook: Chapters 3-5

 

Heterogeneous-race-free memory models” (Hower et al., ASPLOS 2014)

 

Consistency Optimizations

"Is SC + ILP = RC?" (Gniady et al., ISCA 1999)

"InvisiFence: Performance-transparent Memory Ordering in Conventional Multiprocessors" (Blundell et al., ISCA 2009)

 

"Two Techniques to Enhance the Performance of Memory Consistency Models" (Gharachorloo et al., ICPP 1991)

 

Cache Coherence

Coherence Basics

Sorin et al. Textbook, Chapters 1-2 (already covered) & 6-9

 

Snooping Cache Coherence

"Timestamp Snooping: An Approach for Extending SMPs" (Martin et al., ASPLOS 2000)

"Starfire: Extending the SMP Envelope" (Charlesworth, IEEE Micro 1998)

"Multicast Snooping: A New Coherence Method Using a Multicast Address Network" (Bilir et al., ISCA 1999)

Directory Cache Coherence

"Architecture and Design of AlphaServer GS320" (Gharachorloo et al., ASPLOS 2000)

 

“Cuckoo Directory: A Scalable Directory for Many-Core Systems” (Ferdman et al., HPCA 2011)

"Why On-Chip Cache Coherence is Here to Stay" (Martin et al., CACM 2012)

"The Stanford DASH Multiprocessor" (Lenoski et al., Computer 1992)

 

Advanced Topics in Coherence

"Token Coherence: Decoupling Performance and Correctness" (Martin et al., ISCA 2003)


"DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism" (Choi et al., PACT 2011)

"TSO-CC: Consistency directed cache coherence for TSO" (Elver and Nagarajan, HPCA 2014)

“Cache Coherence for GPU Architectures” (Singh et al., HPCA 2013)

"Fractal Coherence: Scalably Verifiable Cache Coherence" (Zhang, Lebeck, and Sorin, MICRO 2010)

 

Synchronization & Transactional Memory

Synchronization Optimizations

"Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution" (Rajwar and Goodman, MICRO 2001)

“Efficient Synchronization: Let Them Eat QOLB” (Kagi, Burger, and Goodman)

Hardware TM, TM Software

"LogTM: Log-based Transactional Memory" (Moore et al., HPCA 2006)

Transactional Memory, 2nd edition (Harris, Larus, Rajwar)

Interconnection Networks

Interconnection Network Basics

Enright Jerger et al. textbook

"Flattened Butterfly Topology for On-Chip Networks" (Kim et al., MICRO 2007)

“Modular Routing Design for Chiplet-Based Systems” (Yin et al., ISCA 2018)

 

Deadlock Avoidance

Enright Jerger et al. textbook

Drain: Deadlock Removal for Arbitrary Irregular Networks” (Parasar et al., HPCA 2020)

"Virtual Channel Flow Control" (Dally, IEEE TPDS 1992)

"A Survey of Wormhole Routing Techniques in Direct Networks"

Evaluation Tools & Methodology

Evaluation: Metrics & Modeling

Roofline: An Insightful Visual Performance Model for Multicore Architectures” (Williams, Waterman, and Patterson, CACM 2009)

"Cost-Effective Parallel Computing" (Wood and Hill, Computer 1995)

"Analytic Evaluation of Shared-Memory Parallel Systems with ILP Processors" (Sorin et al., ISCA 1998)

“Computer Architecture Performance Evaluation Methods” (Eeckhout)

Evaluation: Simulation

FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud” (Karandikar et al., ISCA 2018)

“The gem5 Simulator” (Binkert et al., CAN 2011)

Evaluation: Workloads

"Simulating a $2M Commercial Server on a $2K PC" (Alameldeen et al., Computer 2003)

"Memory System Characterization of Commercial Workloads" (Barroso et al., ISCA 1998)

Other Architectures

Vector Machines

"Tarantula: A Vector Extension to the Alpha Architecture" (Espasa et al., ISCA 2002)

"Introduction to the Cell Multiprocessor" (Kahle et al, IBM J. R&D 2005) 

"The Cray-1 Computer System" (Russell, CACM 1978)

 

GPU/GPGPU

“Ultra-Performance Pascal GPU and NVLink Interconnect” (Foley and Danskin, IEEE Micro 2017)

 

Scalable, Non-Coherent Multiprocessors

"Synchronization and Communication in the Cray T3E Multiprocessor" (Scott et al., ASPLOS 1996)

"The Network Architecture of the Connection Machine CM-5" (Leiserson et al., SPAA 1992)

ML/AI Accelerators

Efficiently Scaling Transformer Inference” (Pope et al., MLSYS 2023)

 

"TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings” (Jouppi et al., ISCA 2023)

 

Dataflow

"Executing a Program on the MIT Tagged-Token Dataflow Architecture" (Arvind and Nikhil, IEEE Trans. on Computers 1990)

 

WaveScalar (Swanson et al., MICRO 2003)

Tiled Architectures

"Baring It All to Software: Raw Machines" (Waingold et al., Computer 1997)

“Scaling to the End of Silicon with EDGE Architectures” (Burger, Keckler, and McKinley, IEEE Computer 2004)

Tilera Tile64 [website only, not a technical paper]

“Power and Energy Characterization of an Open Source 25-core Manycore Processor” (McKeown et al., HPCA 2018)

Supercomputing, aka High Performance Computing (HPC)

Anton 3: twenty microseconds of molecular dynamics simulation before lunch” (Shaw et al., SC 2021)

"Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer" (Allen et al., ISJ 2001)