|
ECE 652 / CPS 650
|
|
Advanced Computer Architecture II
|
|
Spring 2025
|
|
Professor Daniel J. Sorin
|
>> Objectives <<
The
objective of this course is to provide students with an understanding of
parallel computer architectures. Students will
read research papers, lead in-class discussions of papers, perform a
research project, and present their research projects both in written and oral
formats.
The course focuses on both the
design and evaluation of multicore processors. The main themes of this course are: parallel programming, shared memory multiprocessors,
memory consistency models, interconnection networks, high availability systems,
interactions with current microprocessor and I/O technology, novel
architectures, and emerging technologies. The evaluation portion of this
course will focus on metrics, modeling, simulation, and workloads for
benchmarking.
Prerequisites: ECE 552, CPS 550, or consent of
instructor.
>> Class Location and Hours <<
Class
meets Monday/Wednesday/Friday from 10:20-11:10am.
Location: Languages 320
>> Instructor <<
Professor Daniel
J. Sorin
Office: 403 Wilkinson
Office Hours: TBD
Email: sorin@ee.duke.edu
>> Textbooks <<
The emphasis of the class will be discussions of
research papers, but we will also use the following two textbooks (free PDF
downloads from Duke IP address):
1) Vijay
Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. "A Primer on
Memory Consistency and Cache Coherence." 2nd edition.
Synthesis Lectures on Computer Architecture, Morgan & Claypool
Publishers, 2020.
2) Natalie
Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. "On Chip
Networks," 2nd edition.
Synthesis Lectures on Computer Architecture, Morgan & Claypool
Publishers, 2017.
>> Assignments and Grading <<
This is a graduate level class that will not require "busy
work." This class will, however, require that students learn the
reading material and learn how to present research in both written and oral
formats (see Hill
and Patterson for useful advice for presentations). Communication is
very important in this class. Students who struggle with reading and
writing are encouraged to take this course but should expect to work hard and
to improve their communication skills in the process.
Students are responsible for:
- Writing
paper “summaries” - 10% of grade [these are not just summaries –
please see first set of slides for details]
You
may NOT use AI to help you do this. Doing
so will be considered academic misconduct and will result in your class grade
being lowered by a full letter grade (e.g., from A to B or from B+ to C+).
- Presenting
two papers - 10% of grade
You
may NOT use AI to help you do this.
Doing so will be considered academic misconduct and will result in your
class grade being lowered by a full letter grade (e.g., from A to B or from B+
to C+).
- Midterm
exam, WEDNESDAY, MARCH 4 - 15% of grade
- Final
exam - 25% of
grade
- Individual
or group project - 40% of grade
The project is a semester-long assignment that should
reflect the goal of being no more than "a stone's throw" away from a
research paper. As such, the project will require:
- written
proposal (no more than 3 pages), due Monday,
March 2
- written
progress report (no more than 3 pages), due
Wednesday, March 25
- final
document in conference/journal format (no more than 10 pages), hardcopy due in class on Wednesday, April 15
- final
presentations (in class), in class TBD
Deadlines will be enforced
except under extreme circumstances. I would prefer that you turn in
something not quite done on the due date rather than waiting until after the
deadline to try to finish it. Any project that is late by less than 24
hours will lose 50%.
Any project that is more than 24 hours late will receive a
zero.
Academic Misconduct: I will not tolerate academically dishonest work. This
includes using AI to write paper reports or presentations, cheating on the
final exam, and plagiarism on the project.
Be careful on the project to cite
prior work and to give proper credit to others' research.
>> Topics and Readings <<
This list could change!
|
Theme
|
Topic
|
Readings
|
Recommended
Optional Readings
|
|
Intro to
Multiprocessing
|
Parallelism, Goals, & Challenges
|
"Limits of Instruction-Level
Parallelism" (Wall, ASPLOS 1991)
|
|
|
Programming Models & Parallel
Programming
|
|
"The
PARSEC Benchmark Suite: Characterization and Architectural Implications"
(Bienia et al., PACT 2008)
"The
SPLASH-2 Programs: Characterization and Methodological Considerations"
(Woo et al., ISCA 1995)
"The
Problem with Threads" (Lee, Computer 2006)
|
|
Execution Models
|
(none)
|
|
|
Memory Consistency
|
Shared Memory &
Coherence Definitions
|
Sorin et al. Textbook: Chapters 1-3
|
|
|
Consistency Models
|
Sorin et al. Textbook: Chapters 3-5
“Heterogeneous-race-free
memory models” (Hower et al., ASPLOS 2014)
|
|
|
Consistency Optimizations
|
"Is
SC + ILP = RC?" (Gniady et al., ISCA 1999)
"InvisiFence:
Performance-transparent Memory Ordering in Conventional Multiprocessors"
(Blundell et al., ISCA 2009)
|
"Two
Techniques to Enhance the Performance of Memory Consistency Models"
(Gharachorloo et al., ICPP 1991)
|
|
Cache Coherence
|
Coherence Basics
|
Sorin et al. Textbook, Chapters 1-2 (already covered)
& 6-9
|
|
|
Snooping Cache Coherence
|
"Timestamp
Snooping: An Approach for Extending SMPs" (Martin et al., ASPLOS
2000)
|
"Starfire: Extending
the SMP Envelope" (Charlesworth, IEEE Micro 1998)
"Multicast Snooping: A New
Coherence Method Using a Multicast Address Network" (Bilir et al.,
ISCA 1999)
|
|
Directory Cache
Coherence
|
"Architecture
and Design of AlphaServer GS320" (Gharachorloo et al., ASPLOS 2000)
“Cuckoo
Directory: A Scalable Directory for Many-Core Systems” (Ferdman et al.,
HPCA 2011)
"Why On-Chip
Cache Coherence is Here to Stay" (Martin et al., CACM 2012)
|
"The Stanford
DASH Multiprocessor" (Lenoski et al.,
Computer 1992)
|
|
Advanced Topics in
Coherence
|
"Token
Coherence: Decoupling Performance and Correctness" (Martin et al.,
ISCA 2003)
"DeNovo:
Rethinking the Memory Hierarchy for Disciplined Parallelism" (Choi
et al., PACT 2011)
"TSO-CC: Consistency
directed cache coherence for TSO" (Elver and Nagarajan, HPCA 2014)
“Cache Coherence for GPU
Architectures” (Singh et al., HPCA 2013)
"Fractal
Coherence: Scalably Verifiable Cache
Coherence" (Zhang, Lebeck, and Sorin, MICRO 2010)
|
|
|
Synchronization
& Transactional Memory
|
Synchronization
Optimizations
|
"Speculative
Lock Elision: Enabling Highly Concurrent Multithreaded Execution"
(Rajwar and Goodman, MICRO 2001)
|
“Efficient
Synchronization: Let Them Eat QOLB” (Kagi,
Burger, and Goodman)
|
|
Hardware TM, TM Software
|
"LogTM:
Log-based Transactional Memory" (Moore et al., HPCA 2006)
|
Transactional
Memory, 2nd edition (Harris, Larus, Rajwar)
|
|
Interconnection
Networks
|
Interconnection Network Basics
|
Enright Jerger et al. textbook
"Flattened Butterfly
Topology for On-Chip Networks" (Kim et al., MICRO 2007)
“Modular Routing Design
for Chiplet-Based Systems” (Yin et al., ISCA
2018)
|
|
|
Deadlock Avoidance
|
Enright Jerger et al. textbook
“Drain: Deadlock
Removal for Arbitrary Irregular Networks” (Parasar
et al., HPCA 2020)
|
"Virtual Channel
Flow Control" (Dally, IEEE TPDS 1992)
"A Survey of Wormhole
Routing Techniques in Direct Networks"
|
|
Evaluation Tools & Methodology
|
Evaluation: Metrics & Modeling
|
“Roofline:
An Insightful Visual Performance Model for Multicore Architectures”
(Williams, Waterman, and Patterson, CACM 2009)
|
"Cost-Effective
Parallel Computing" (Wood and Hill, Computer 1995)
"Analytic
Evaluation of Shared-Memory Parallel Systems with ILP Processors"
(Sorin et al., ISCA 1998)
“Computer
Architecture Performance Evaluation Methods” (Eeckhout)
|
|
Evaluation: Simulation
|
“FireSim:
FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud”
(Karandikar et al., ISCA 2018)
|
“The gem5 Simulator”
(Binkert et al., CAN 2011)
|
|
Evaluation: Workloads
|
"Simulating
a $2M Commercial Server on a $2K PC" (Alameldeen et al., Computer
2003)
|
"Memory
System Characterization of Commercial Workloads" (Barroso et al.,
ISCA 1998)
|
|
Other Architectures
|
Vector Machines
|
"Tarantula:
A Vector Extension to the Alpha Architecture" (Espasa
et al., ISCA 2002)
"Introduction to the
Cell Multiprocessor" (Kahle et al, IBM J. R&D 2005)
|
"The Cray-1 Computer
System" (Russell, CACM 1978)
|
|
GPU/GPGPU
|
“Ultra-Performance
Pascal GPU and NVLink Interconnect” (Foley and
Danskin, IEEE Micro 2017)
|
|
|
Scalable, Non-Coherent Multiprocessors
|
"Synchronization
and Communication in the Cray T3E Multiprocessor" (Scott et al.,
ASPLOS 1996)
|
"The Network
Architecture of the Connection Machine CM-5" (Leiserson
et al., SPAA 1992)
|
|
ML/AI Accelerators
|
“Efficiently
Scaling Transformer Inference” (Pope et al., MLSYS 2023)
"TPU v4: An Optically
Reconfigurable Supercomputer for Machine Learning with Hardware Support for
Embeddings” (Jouppi et al., ISCA 2023)
|
|
|
Dataflow
|
"Executing a Program
on the MIT Tagged-Token Dataflow Architecture" (Arvind and Nikhil,
IEEE Trans. on Computers 1990)
|
“WaveScalar” (Swanson et al., MICRO 2003)
|
|
Tiled Architectures
|
"Baring
It All to Software: Raw Machines" (Waingold
et al., Computer 1997)
“Scaling
to the End of Silicon with EDGE Architectures” (Burger, Keckler, and
McKinley, IEEE Computer 2004)
|
Tilera Tile64 [website only, not a
technical paper]
“Power
and Energy Characterization of an Open Source 25-core Manycore Processor”
(McKeown et al., HPCA 2018)
|
|
Supercomputing, aka High Performance Computing (HPC)
|
“Anton 3: twenty
microseconds of molecular dynamics simulation before lunch” (Shaw et al.,
SC 2021)
|
"Blue
Gene: A Vision for Protein Science Using a Petaflop Supercomputer"
(Allen et al., ISJ 2001)
|