ECE 652 / CPS 650

Advanced Computer Architecture II

Spring 2014

Professor Daniel J. Sorin




The objective of this course is to provide students with an understanding of parallel computer architectures.  Students will read research papers, 

lead in-class discussions of papers, perform a research project, and present their research projects both in written and oral formats.

The course focuses on both the design and evaluation of multiprocessor systems. The main design themes of this course are: parallel programming, system organizations, shared memory multiprocessors, memory consistency models, interconnection networks, high availability systems, interactions with current microprocessor and I/O technology, novel architectures, and emerging technologies.  The evaluation portion of this course will focus on metrics, modeling, simulation, and workloads for benchmarking.

Prerequisites: ECE 552, CPS 550, or consent of instructor.


Class Location and Hours


Class meets Monday/Wednesday/Friday from 8:45am - 9:35am.

Location: Hudson Hall 115A



Professor Daniel J. Sorin

Office: 209C Hudson Hall

Office Hours: Monday 9:30-10:30, Weds 2:00-3:00

Email: Email Address of Daniel Sorin



The emphasis of the class will be discussions of research papers, but we will also use the following textbook (free PDF download from Duke IP address):

Daniel J. Sorin, Mark D. Hill, and David A. Wood. "A Primer on Memory Consistency and Cache Coherence." Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, May 2011.


 Assignments and Grading


This is a graduate level class that will not require "busy work."  This class will, however, require that students learn the reading material and learn

how to present research in both written and oral formats (see Hill and Patterson for useful advice for presentations).  Communication is very 

important in this class.  Students who struggle with reading and writing are encouraged to take this course but should expect to work hard and to 

improve their communication skills in the process.  

Students are responsible for:

The project is a semester-long assignment that should reflect the goal of being no more than "a stone's throw" away from a research paper.  As

such, the project will require:

Deadlines will be enforced except under extreme circumstances.  I would prefer that you turn in something not quite done on the due date rather than waiting until after the deadline to try to finish it.  Any project that is late by less than 24 hours will lose 50%.  Any project that is more than 24 hours late will receive a zero.

Academic Misconduct: I will not tolerate academically dishonest work.  This includes cheating on the final exam and plagiarism on the project.  

Be careful on the project to cite prior work and to give proper credit to others' research. 

 Topics and Readings

Readings in italics are optional material.   This list of readings is subject to change (with sufficient warning).




Introduction to Multiprocessing


Parallelism, Goals, & Challenges

parallelism, limits, Amdahl’s Law

"Limits of Instruction-Level Parallelism"  (Wall, ASPLOS 1991)

Programming Models & Parallel Programming

shared memory, threads, tasks, PRAM, etc.

synchronization basics: locks, barriers, etc.


"The Problem with Threads" (Lee, Computer 2006)

"Parallel Programming Must Be Deterministic" (Bocchino et al., HotPar 2009)

"The PARSEC Benchmark Suite: Characterization and Architectural Implications" (Bienia et al., PACT 2008)

"The SPLASH-2 Programs: Characterization and Methodological Considerations" (Woo et al., ISCA 1995)

Execution Models



Shared Memory: Memory Consistency

Coherence Basics + 
Consistency Basics

Textbook: Chapters 1-3

Consistency Models: 

SC, TSO/x86, XC

Textbook: Chapters 3-5

Consistency Optimizations

speculation, Scheurich's optimization

"Two Techniques to Enhance the Performance of Memory Consistency Models" (Gharachorloo et al., ICPP 1991)

"Is SC + ILP = RC?" (Gniady et al., ISCA 1999)

"InvisiFence: Performance-transparent Memory Ordering in Conventional Multiprocessors" (Blundell et al., ISCA 2009)

Shared Memory: Cache-Coherence

Coherence Basics

Textbook, Chapters 1-2 (already covered) & 6-9

Snooping Cache Coherence

"Starfire: Extending the SMP Envelope" (Charlesworth, IEEE Micro 1998)

"Multicast Snooping: A New Coherence Method Using a Multicast Address Network" (Bilir et al., ISCA 1999)

"Timestamp Snooping: An Approach for Extending SMPs" (Martin et al., ASPLOS 2000)

Directory Cache Coherence

"The Stanford DASH Multiprocessor" (Lenoski et al., Computer 1992)

"Architecture and Design of AlphaServer GS320" (Gharachorloo et al., ASPLOS 2000)

"An Evaluation of Directory Schemes for Cache Coherence" (Agarwal et al., ISCA 1988)

Coherence in the Age of Multicores

"Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor" (Conway et al., IEEE Micro 2010)

"Why On-Chip Cache Coherence is Here to Stay" (Martin, Hill, and Sorin, CACM 2012)

Advanced Topics in Coherence

token coherence, COMA, coherence domains


"Token Coherence: Decoupling Performance and Correctness" (Martin et al., ISCA 2003)

"Virtual Hierarchies to Support Server Consolidation" (Marty and Hill, ISCA 2007)

"Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA"  (Falsafi et al., ISCA 1997)

"Fractal Coherence: Scalably Verifiable Cache Coherence" (Zhang, Lebeck, and Sorin, MICRO 2010)

"A New Perspective for Efficient Virtual-Cache Coherence" (Kaxiras and Ros, ISCA 2013)

Synchronization Optimizations & Transactional Memory

Synchronization Optimizations



"Efficient Synchronization: Let Them Eat QOLB" (Kagi et al., ISCA 1997)

"Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution" (Rajwar and Goodman, MICRO 2001)

Hardware TM, TM Software

"LogTM: Log-based Transactional Memory" (Moore et al., HPCA 2006)

"STAMP: Stanford Transactional Applications for Multi-Processing" (Minh et al., IISWC 2008)

Transactional Memory, 2nd edition [very nice short book]

Interconnection Networks

Interconnection Network Basics

topology, routing, flow control

"The Alpha 21364 Network Architecture" (Mukherjee et al., Hot Interconnects 2001)

"Flattened Butterfly Topology for On-Chip Networks" (Kim et al., MICRO 2007)

On-Chip Networks [very nice short book]

Deadlock Avoidance

virtual channels, turn model, hot-potato routing

"Virtual Channel Flow Control" (Dally, IEEE TPDS 1992)

"A Survey of Wormhole Routing Techniques in Direct Networks" [includes "Turn Model" concept]

Evaluation Tools and Methodology

Evaluation: Metrics & Modeling

scalability, throughput, why not IPC?

mathematical modeling of performance

"Cost-Effective Parallel Computing" (Wood and Hill, Computer 1995)

"Analytic Evaluation of Shared-Memory Parallel Systems with ILP Processors" (Sorin et al., ISCA 1998)

Evaluation: Simulation

precision vs. performance

full-system, parallel host

"Simics: A Full System Simulation Platform" (Magnusson et al., Computer 2002)

"RAMP: A Research Accelerator for Multiple Processors" (Wawrzynek et al., Tech Report 2006)

"The Wisconsin Wind Tunnel: Virtual Prototyping of Parallel Computers" (Reinhardt et al., SIGMETRICS 1993)

Evaluation: Workloads

scientific vs. commercial, TLP, importance of benchmark selection

"Memory System Characterization of Commercial Workloads" (Barroso et al., ISCA 1998)

"Simulating a $2M Commercial Server on a $2K PC" (Alameldeen et al., Computer 2003)

Reliability and Availability

Fault Tolerant Computers



"IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective" (Spainhower et al., IBM J. R&D 1999)

"Dynamic Verification of Sequential Consistency" (Meixner and Sorin, ISCA 2005)

"Fault-Tolerant Systems in Commercial Applications" [survey of classic FT systems]

"SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery" (Sorin et al., ISCA 2002)

Other Architectures




Vector Machines

"The Cray-1 Computer System" (Russell, CACM 1978)

"Tarantula: A Vector Extension to the Alpha Architecture" (Espasa et al., ISCA 2002)

"Introduction to the Cell Multiprocessor" (Kahle et al, IBM J. R&D 2005) 


"Larrabee: A Many-Core x86 Architecture for Visual Computing" (Seiler et al., SIGGRAPH 2008)

Scalable, Non-Coherent Multiprocessors

message passing: Paragon, CM5, active messages

shared physical memory: Cray T3E

"The Network Architecture of the Connection Machine CM-5" (Leiserson et al., SPAA 1992)

"Synchronization and Communication in the Cray T3E Multiprocessor" (Scott et al., ASPLOS 1996)


"Executing a Program on the MIT Tagged-Token Dataflow Architecture" (Arvind and Nikhil, IEEE Trans. on Computers 1990)

Tiled Architectures

"Baring It All to Software: Raw Machines" (Waingold et al., Computer 1997)

Tilera Tile64 [website only, not a technical paper]


"Anton, A Special-Purpose Machine for Molecular Dynamics Simulation" (Shaw et al., ISCA 2007)

"Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer" (Allen et al., ISJ 2001)

Interactions with Processors and I/O

Microarchitectural Effects

parallelism: ILP, MLP, TLP

"An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors" (Pai et al., ASPLOS 1996)


"Making Network Interfaces Less Peripheral" (Mukherjee et al., Computer 1998)


Quantum Computing

"A Practical Architecture for Reliable Quantum Computers" (Oskin et al., Computer 2002)


"Circuit and System Architecture for DNA-Guided Self-Assembly of Nanoelectronics" (Patwardhan et al., FNANO 2004)