>> Objectives <<

The objective of this course is to provide students with an understanding of parallel computer architectures. Students will read research papers, lead in-class discussions of papers, perform a research project, and present their research projects both in written and oral formats.

The course focuses on both the design and evaluation of multicore processors. The main themes of this course are: parallel programming, shared memory multiprocessors, memory consistency models, interconnection networks, high availability systems, interactions with current microprocessor and I/O technology, novel architectures, and emerging technologies. The evaluation portion of this course will focus on metrics, modeling, simulation, and workloads for benchmarking.

Prerequisites: ECE 552, CPS 550, or consent of instructor.

>> Class Location and Hours <<

Class meets Monday/Wednesday/Friday from 10:20-11:10am.

Location: Languages 320

>> Instructor <<

Professor Daniel J. Sorin

Office: 403 Wilkinson

Office Hours: TBD

Email: sorin@ee.duke.edu

>> Textbooks <<

The emphasis of the class will be discussions of research papers, but we will also use the following two textbooks (free PDF downloads from Duke IP address):

1) Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. "A Primer on Memory Consistency and Cache Coherence." 2^nd edition.
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2020.

2) Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. "On Chip Networks," 2nd edition.
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2017.

>> Assignments and Grading <<

This is a graduate level class that will not require "busy work." This class will, however, require that students learn the reading material and learn how to present research in both written and oral formats (see Hill and Patterson for useful advice for presentations). Communication is very important in this class. Students who struggle with reading and writing are encouraged to take this course but should expect to work hard and to improve their communication skills in the process.

Students are responsible for:

Writing paper “summaries” - 10% of grade [these are not just summaries – please see first set of slides for details]

You may NOT use AI to help you do this. Doing so will be considered academic misconduct and will result in your class grade being lowered by a full letter grade (e.g., from A to B or from B+ to C+).

Presenting two papers - 10% of grade

You may NOT use AI to help you do this. Doing so will be considered academic misconduct and will result in your class grade being lowered by a full letter grade (e.g., from A to B or from B+ to C+).

Midterm exam, WEDNESDAY, MARCH 4 - 15% of grade
Final exam - 25% of grade
Individual or group project - 40% of grade

The project is a semester-long assignment that should reflect the goal of being no more than "a stone's throw" away from a research paper. As such, the project will require:

written proposal (no more than 3 pages), due Monday, March 2
written progress report (no more than 3 pages), due Wednesday, March 25
final document in conference/journal format (no more than 10 pages), hardcopy due in class on Wednesday, April 15
final presentations (in class), in class TBD

Deadlines will be enforced except under extreme circumstances. I would prefer that you turn in something not quite done on the due date rather than waiting until after the deadline to try to finish it. Any project that is late by less than 24 hours will lose 50%.

Any project that is more than 24 hours late will receive a zero.

Academic Misconduct: I will not tolerate academically dishonest work. This includes using AI to write paper reports or presentations, cheating on the final exam, and plagiarism on the project.

Be careful on the project to cite prior work and to give proper credit to others' research.

>> Topics and Readings <<

This list could change!

Theme	Topic	Readings	Recommended Optional Readings
Intro to Multiprocessing	Parallelism, Goals, & Challenges	"Limits of Instruction-Level Parallelism" (Wall, ASPLOS 1991)
	Programming Models & Parallel Programming		"The PARSEC Benchmark Suite: Characterization and Architectural Implications" (Bienia et al., PACT 2008) "The SPLASH-2 Programs: Characterization and Methodological Considerations" (Woo et al., ISCA 1995) "The Problem with Threads" (Lee, Computer 2006)
	Execution Models	(none)
Memory Consistency	Shared Memory & Coherence Definitions	Sorin et al. Textbook: Chapters 1-3
	Consistency Models	Sorin et al. Textbook: Chapters 3-5 “Heterogeneous-race-free memory models” (Hower et al., ASPLOS 2014)
	Consistency Optimizations	"Is SC + ILP = RC?" (Gniady et al., ISCA 1999) "InvisiFence: Performance-transparent Memory Ordering in Conventional Multiprocessors" (Blundell et al., ISCA 2009)	"Two Techniques to Enhance the Performance of Memory Consistency Models" (Gharachorloo et al., ICPP 1991)
Cache Coherence	Coherence Basics	Sorin et al. Textbook, Chapters 1-2 (already covered) & 6-9
	Snooping Cache Coherence	"Timestamp Snooping: An Approach for Extending SMPs" (Martin et al., ASPLOS 2000)	"Starfire: Extending the SMP Envelope" (Charlesworth, IEEE Micro 1998) "Multicast Snooping: A New Coherence Method Using a Multicast Address Network" (Bilir et al., ISCA 1999)
	Directory Cache Coherence	"Architecture and Design of AlphaServer GS320" (Gharachorloo et al., ASPLOS 2000) “Cuckoo Directory: A Scalable Directory for Many-Core Systems” (Ferdman et al., HPCA 2011) "Why On-Chip Cache Coherence is Here to Stay" (Martin et al., CACM 2012)	"The Stanford DASH Multiprocessor" (Lenoski et al., Computer 1992)
	Advanced Topics in Coherence	"Token Coherence: Decoupling Performance and Correctness" (Martin et al., ISCA 2003) "DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism" (Choi et al., PACT 2011) "TSO-CC: Consistency directed cache coherence for TSO" (Elver and Nagarajan, HPCA 2014) “Cache Coherence for GPU Architectures” (Singh et al., HPCA 2013) "Fractal Coherence: Scalably Verifiable Cache Coherence" (Zhang, Lebeck, and Sorin, MICRO 2010)
Synchronization & Transactional Memory	Synchronization Optimizations	"Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution" (Rajwar and Goodman, MICRO 2001)	“Efficient Synchronization: Let Them Eat QOLB” (Kagi, Burger, and Goodman)
Synchronization & Transactional Memory	Hardware TM, TM Software	"LogTM: Log-based Transactional Memory" (Moore et al., HPCA 2006)	Transactional Memory, 2nd edition (Harris, Larus, Rajwar)
Interconnection Networks	Interconnection Network Basics	Enright Jerger et al. textbook "Flattened Butterfly Topology for On-Chip Networks" (Kim et al., MICRO 2007) “Modular Routing Design for Chiplet-Based Systems” (Yin et al., ISCA 2018)
Interconnection Networks	Deadlock Avoidance	Enright Jerger et al. textbook “Drain: Deadlock Removal for Arbitrary Irregular Networks” (Parasar et al., HPCA 2020)	"Virtual Channel Flow Control" (Dally, IEEE TPDS 1992) "A Survey of Wormhole Routing Techniques in Direct Networks"
Evaluation Tools & Methodology	Evaluation: Metrics & Modeling	“Roofline: An Insightful Visual Performance Model for Multicore Architectures” (Williams, Waterman, and Patterson, CACM 2009)	"Cost-Effective Parallel Computing" (Wood and Hill, Computer 1995) "Analytic Evaluation of Shared-Memory Parallel Systems with ILP Processors" (Sorin et al., ISCA 1998) “Computer Architecture Performance Evaluation Methods” (Eeckhout)
	Evaluation: Simulation	“FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud” (Karandikar et al., ISCA 2018)	“The gem5 Simulator” (Binkert et al., CAN 2011)
	Evaluation: Workloads	"Simulating a $2M Commercial Server on a $2K PC" (Alameldeen et al., Computer 2003)	"Memory System Characterization of Commercial Workloads" (Barroso et al., ISCA 1998)
Other Architectures	Vector Machines	"Tarantula: A Vector Extension to the Alpha Architecture" (Espasa et al., ISCA 2002) "Introduction to the Cell Multiprocessor" (Kahle et al, IBM J. R&D 2005)	"The Cray-1 Computer System" (Russell, CACM 1978)
	GPU/GPGPU	“Ultra-Performance Pascal GPU and NVLink Interconnect” (Foley and Danskin, IEEE Micro 2017)
	Scalable, Non-Coherent Multiprocessors	"Synchronization and Communication in the Cray T3E Multiprocessor" (Scott et al., ASPLOS 1996)	"The Network Architecture of the Connection Machine CM-5" (Leiserson et al., SPAA 1992)
	ML/AI Accelerators	“Efficiently Scaling Transformer Inference” (Pope et al., MLSYS 2023) "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings” (Jouppi et al., ISCA 2023)
	Dataflow	"Executing a Program on the MIT Tagged-Token Dataflow Architecture" (Arvind and Nikhil, IEEE Trans. on Computers 1990)	“WaveScalar” (Swanson et al., MICRO 2003)
	Tiled Architectures	"Baring It All to Software: Raw Machines" (Waingold et al., Computer 1997) “Scaling to the End of Silicon with EDGE Architectures” (Burger, Keckler, and McKinley, IEEE Computer 2004)	Tilera Tile64 [website only, not a technical paper] “Power and Energy Characterization of an Open Source 25-core Manycore Processor” (McKeown et al., HPCA 2018)
	Supercomputing, aka High Performance Computing (HPC)	“Anton 3: twenty microseconds of molecular dynamics simulation before lunch” (Shaw et al., SC 2021)	"Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer" (Allen et al., ISJ 2001)