Skip to Main Content U.S. Department of Energy
Center for Adaptive Supercomputing - Multithreaded Architectures

Cray XMT Advanced Application Programming Course

This course is for advanced users only, however, the website also allows you to enroll for the beginners course.

Click here to register. Under the Cray XMT Series section, choose "Cray XMT Advanced Application Programming" on February 5th-6th. This class is located in Columbia, Maryland.

XMTAP Course Description

  • Designed for people who are familiar with applications support or software development for Cray XMT series computer systems
  • Trains students on hardware and software architectures, programming models, and optimization of user codes that run on Cray XMT series systems
  • Lecture material is reinforced by lab exercises that use practical code examples
  • Students have opportunities to optimize or debug the codes

Course Outline

  • Introduction
    • Brief review of XMT's performance tools
    • Characteristics, limiting factors of the XMT architecture
  • Breadth-first search
    • Original version
  • Parallelization approach
  • Work scheduling approach
  • Measuring its performance with Apprentice, Dashboard and hardware counters
    • Modifying BFS for better performance
  • Loop structure based on number of available threads
    • Performance/scalability achieved
    • Pros/cons: performance vs. elegance
    • A recursive parallel implementation of BFS
  • Using loop futures
  • SSCA #2
    • Overview of the benchmark
    • Kernel 0, graph generation
  • Using a hash function to remove duplicates
  • Use of Snapshot/Restore to save time while performance tuning
    • Kernel 4, betweenness centrality
  • Fundamentals of the algorithm
  • Parallelizing multilevel loop nesting
  • Automatic and manual loop collapse
  • Monitoring the memory subsystem for saturation
  • Using explicit futures
  • Using loop futures
  • Inverse-mapping words to documents using a hash function
    • Tuning for scalability
    • Collecting statistics on the input data to diagnose performance
  • The Toy_better benchmark
    • Original implementation
    • Diagnosing performance using Apprentice 2
    • Work scheduling
    • Looking at compiler-generated assembly code using the debugger
  • PNNL's "Triadics" code
    • Original implementation's performance issues
    • Implementing with loop futures


Research and Development


Recent News

PNNL Contacts