|
Research Computing
>> Training and Publications
>> Using Parallel Architectures
Using Parallel Architectures for Business Research
Parallel vs. Serial Code
- Performance
- Why Parallelization?
- Technological limitations on how fast a single process can run
- Progress in CPU speed has slowed down a lot
- Pentium 4 (2000): 3.8 GHz
- Core2 i7 (2011): 3.5 GHz (but much faster!)
- Many physical limitations (power, heating, …)
- Progress in memory speed has slowed down more
- Progress in disk speed has slowed down more yet
- Solution: parallelization
- Parallel vs. Serial code
- What to Expect (Amdahl's Law)
- \(\text{SPEEDUP} = \frac{1}{\frac{\text{%PARALLEL}}{\text{NUM_CPU}}+\text{%SEQUENTIAL}}\)
- 8 CPUs, 90\% parallel \(\Rightarrow\) speedup = 4.7
- 1000 CPUs, 90\% parallel \(\Rightarrow\) speedup = 9.9
Facilities at Northwestern
- Quest: 7056 Intel cores, ~40GB RAM
- SSCC: 132 AMD cores, up to 64GB RAM
- all of the above + lots of statistical software!
- Multi-core PC's
- skew4: 8 cores + 64GB RAM
- skew5: 24 cores + 256GB RAM
- Your desktop!
- GPU's
What is inside a typical PC (node)?
- CPUs
- Cores
- Cache (L1, L2, L3, …)
- Control logic
- RAM
- I/O: disk, network
- GPU (graphics card)
Types of Parallelism
- By physical location
- Parallelism within a CPU core: vectorization
- Multiple cores in a CPU: shared memory
- Multiple CPUs in a node (PC): shared memory
- Multiple nodes in a cluster: distributed memory
- Data parallelism (aka domain decomposition, SPMD)
- Run the same analysis for different stocks/days
- Task parallelism
- Monte-Carlo: run the same simulation with different random sequences
- Solve the same model with different parameters
- Partially parallelizeable:
- Solve a problem iteratively on a grid
Software for Computational Economics
- Statistical languages (SAS, Stata, R)
- Regressions, statistical analysis, statistical graphics…
- Matrix languages (Matlab, Ox, Gauss)
- Simulations, signal processing…
- Symbolic software (Maple, Mathematica)
- Computing close-form solutions
- General-purpose interpreters (Python/numpy/scipy)
- General-purpose compilers (Fortran, C, C++)
- usually the fastest and most flexible
- Fortran 90 is very similar in syntax to Matlab!
- No "comprehensive" toolbox system as in Matlab
- Most libraries are not as well documented
- A few semi-automated code conversion tools exist
Different ways to implement parallelization
- Vectorization
- Usually is done for you already. Writing vectorized code helps.
- Automatic parallelization (Stata/MP, SAS, Matlab)
- Specify the parallel resources available, and the computer does
everything else for you
- Ideal case! But only if it works…
- Parallelized library functions (Gauss, Matlab, cuBLAS, ScaLAPACK)
- Likely to be highly optimized
- Might not be available, or not the most efficient for your problem
- Job-level parallelization
- Run multiple copies of the code on different cores or nodes
- Easy for data-parallel tasks (but I/O matters!)
- We have a set of functions to simplify this in Matlab
- Guided parallelization (OpenMP, MPI, Matlab Parallel Toolbox, PGI directives)
- Explicitly tell software how to parallelize the code
- Can be as easy as adding a comment or replacing a keyword
- Gets tricky in complicated cases
- Low-level parallelization: do everything manually
Package-specific details
- What is available varies a lot
- Lowest-level languages (C, Fortran) have the most options
- Higher-level languages sometimes offer easy facilities
- Combining languages is often optimal
- If the job can be split into parts, also very easy
- All languages have easy functionality to:
- save a dataset in a text format
- execute a program in any other language
- load results from a file
- Sometimes you can directly read a file from another program
- If saving/loading is not an option, things get trickier…
High-level languages (SAS, Stata, Matlab, etc.)
Stata/MP
- Most functions are parallelized
- You have to pay (a lot) more for parallel licenses!
- Max. 8 cores
- Combining with other languages:
- Directly calling C code (via plugins) is very easy
- But inefficient when called many times (eg. nl/mle)
- save/exec/load is easy and works with any Stata
SAS
- Some procedures are parallelized across multiple cores
- sometimes the gain is small
- specify "options threads cpucount=actual;"
- Procedures: SORT, SUMMARY, MEANS, REPORT, TABULATE, SQL, GLM,
LOESS, REG, ROBUSTREG
- Procedures in other languages:
- possible but quite tricky
- save/exec/load typically preferrable
Matlab
- Requires Matlab Parallel Toolbox
- Available in SSCC, Quest, Kellogg desktop installations
- Limited to max. 8 cores
- can do more, but with a (very) expensive server
- Invoke "matlabpool open" to enable parallel processing
- Some operations (matrix multiplication, optimization routines) are
partially parallelized
- Implements several approaches to parallelization (parfor, coarrays)
- But, somewhat inefficient and sometimes tricky
- Also has GPU functionality
Low-level languages (C, Fortran)
- Trivial: a compiler option to ask it to try parallelizing loops
- Usually fails: resulting parallel code can be slower than sequential
- Easy: OpenMP
- Tell the compiler which loops to parallelize
- Requires very minimal code modification
- Works well, but limited to single machine
- Less easy: Parallelized libraries
- Replace matrix multiplications etc. in your code with library calls
- MKL (Intel PC), ACML (AMD PC, AMD GPU), ScaLAPACK (clusters), cuBLAS (Nvidia GPU), etc.
- May also help in sequential code
- Sometimes tricky to install or link
- gfortran has an option to auto-generate BLAS calls
- Intel MKL link line advisor:
http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/
- Trickier: MPI
- Scales to large clusters
- Relies on process communication
Online library directories and collections
GPU computing
- Designed for graphics processing in video games
- Has a lot of cores and very fast memory
- Each core is somewhat primitive
- Works best for applying the same function to a large array
- Speedups of 200x vs. CPU have been claimed (but treat such claims with care!)
- Major manufacturers: Nvidia (GeForce/Tesla) and ATI (Radeon)
- Nvidia is far more popular for general-purpose computing
- Programming GPU's efficiently is very tricky!
- CUDA C or PGI Fortran
- A lot of different GPU models
- Code optimized for one can fail or work slowly on another
- High-level (easier) options:
- Matlab Parallel Toolbox
- Matlab/Jacket
- C/Fortran: PGI directives
- Third-party packages and libraries (R, cuBLAS, etc.)
Parallelization Methods: Summary
| Method | Single machine | Cluster | GPU |
| Automatic parallelization | | | |
| - SAS, Matlab, Stata/MP | limited | - | - |
| - C, Fortran | works poorly | - | - |
| Parallel libraries | x | some | some |
| Job-level parallelization | x | x | - |
| Manual parallelization | | | |
| Matlab | max. 8 cores | Matlab Server | x |
| Matlab/PBS script | - | x | - |
| Gauss | x | - | - |
| R | packages | packages | packages |
| Mathematica | x | - | plugin |
| OpenMP (C, Fortran) | x | - | - |
| PGI directives (C/Fortran) | - | - | x |
| MPI (C, Fortran) | x | x | - |
|