Jump to ContentJump to Main Navigation
Introduction to Parallel ComputingA practical guide with examples in C$
Users without a subscription are not able to see the full content.

Wesley Petersen and Peter Arbenz

Print publication date: 2004

Print ISBN-13: 9780198515760

Published to Oxford Scholarship Online: November 2020

DOI: 10.1093/oso/9780198515760.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE (oxford.universitypressscholarship.com). (c) Copyright Oxford University Press, 2022. All Rights Reserved. An individual user may print out a PDF of a single chapter of a monograph in OSO for personal use.date: 19 May 2022

Shared Memory Parallelism

Shared Memory Parallelism

4 (p.136) Shared Memory Parallelism
Introduction to Parallel Computing

Wesley Petersen

Peter Arbenz

Oxford University Press

Shared memory machines typically have relatively few processors, say 2–128. An intrinsic characteristic of these machines is a strategy for memory coherence and a fast tightly coupled network for distributing data from a commonly accessible memory system. Our test examples were run on two HP Superdome clusters: Stardust is a production machine with 64 PA-8700 processors, and Pegasus is a 32 CPU machine with the same kind of processors. The HP9000 is grouped into cells, each with 4 CPUs, a common memory/cell, and connected to a CCNUMA crossbar network. The network consists of sets of 4×4 crossbars and is shown in Figure 4.2. An effective bandwidth test, the EFF_BW benchmark [116], groups processors into two equally sized sets. Arbitrary pairings are made between elements from each group, Figure 4.3, and the cross-sectional bandwidth of the network is measured for a fixed number of processors and varying message sizes. The results from the HP9000 machine Stardust are shown in Figure 4.4. It is clear from this figure that the cross-sectional bandwidth of the network is quite high. Although not apparent from Figure 4.4, the latency for this test (the intercept near Message Size = 0) is not high. Due to the low incremental resolution of MPI_Wtime, multiple test runs must be done to quantify the latency. Dr Byrde’s tests show that minimum latency is ≳ 1.5μs. A clearer example of a shared memory architecture is the Cray X1 machine, shown in Figures 4.5 and 4.6. In Figure 4.6, the shared memory design is obvious. Each multi-streaming processor (MSP) shown in Figure 4.5 has 4 processors (custom designed processor chips forged by IBM), and 4 corresponding caches. Although not clear from available diagrams, vector memory access apparently permits cache by-pass; hence the term streaming in MSP. That is, vector registers are loaded directly from memory: see, for example, Figure 3.4. On each board (called nodes) are 4 such MSPs and 16 memory modules which share a common (coherent) memory view. Coherence is only maintained on each board, but not across multiple board systems.

Keywords:   BLAS, LAPACK, LINPACK, OpenMP, libraries, pragma, pthreads, shared memory, speedup

Oxford Scholarship Online requires a subscription or purchase to access the full text of books within the service. Public users can however freely search the site and view the abstracts and keywords for each book and chapter.

Please, subscribe or login to access full text content.

If you think you should have access to this title, please contact your librarian.

To troubleshoot, please check our FAQs , and if you can't find the answer there, please contact us .