Complexity and parallelization

Dr. Alexander Fisher

Duke University

Computational complexity

Measuring efficiency

flops or floating point operations measure the efficiency of an algorithm. flops consist of the binary floating point operations: addition, subtraction, multiplication, division and comparison. Individual floating point operations are performed by a single “core” of your computer’s CPU or GPU.

We use “big O” \(\mathcal{O}(n)\) notation to denote the complexity of an algorithm. For example,

matrix-vector multiplication A %*% b, where \(A\) is \(m \times n\) and \(b\) is \(n \times 1\) takes \(2mn\) or \(\mathcal{O}(mn)\) flops.
matrix-matrix multiplication A %*% B, where \(A\) is \(m \times n\) and \(B\) is \(n \times p\) takes \(2mnp\) or \(\mathcal{O}(mnp)\) flops.

Notice that in reporting complexity of each example we drop the leading constant “2”.

A hierarchy of computational complexity (let \(n\) be the problem size):

exponential order: \(\mathcal{O}(b^n)\)	NP-hard (horrible)
polynomial order: \(\mathcal{O}(n^q)\)	doable
\(\mathcal{O}(n \log n)\)	fast
linear order: \(\mathcal{O}(n)\)	fast
log order: \(\mathcal{O}(\log n)\)	super fast

Note

Some references count multiplication followed by an addition (fused multiply-add, FMA) as one flop.

{This slide adapted from notes by Dr. Hua Zhou}

Measuring efficiency (example)

Suppose you wish to calculate a likelihood \(L(x|\theta)\) for \(n\) iid observations: \(x = \{x_i\}; i \in \{1, \ldots, n \}\). The likelihood looks like

\[ L(x|\theta) = \prod_i^n f(x_i | \theta) \] where \(f\) is some density function dependent on parameters \(\theta\).

\(L(x|\theta)\) has \(\mathcal{O}(n)\) complexity, i.e. scales linearly with the number of data points.

Benchmarking

`bench`

d = tibble(
  x = runif(10000),
  y=runif(10000)
)
(b = bench::mark(
  d[d$x > 0.5, ],
  d[which(d$x > 0.5), ],
  subset(d, x > 0.5),
  filter(d, x > 0.5)
))

# A tibble: 4 × 6
  expression                 min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 d[d$x > 0.5, ]        102.92µs 116.71µs     7998.  252.16KB     21.7
2 d[which(d$x > 0.5), ]  88.08µs     99µs     9353.   271.9KB     51.1
3 subset(d, x > 0.5)    145.17µs 160.38µs     5938.   288.2KB     35.0
4 filter(d, x > 0.5)      2.05ms   2.15ms      454.    2.01MB     12.7

`bench` - relative results

summary(b, relative = TRUE)

# A tibble: 4 × 6
  expression              min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 d[d$x > 0.5, ]         1.17   1.18      17.6      1        1.71
2 d[which(d$x > 0.5), ]  1      1         20.6      1.08     4.03
3 subset(d, x > 0.5)     1.65   1.62      13.1      1.14     2.76
4 filter(d, x > 0.5)    23.2   21.7        1        8.17     1

Parallelization

What is parallelization?

“parallelization” or “parallel computing” means deploying an algorithm’s calculations across several cores of a computer to perform computation at the same time

Terminology

CPU: central processing unit, primary component of a computer that processes instructions
Core: an individual processor within a CPU, more cores can improve performance and efficiency
Forking: a copy of the current R session is moved to new cores.
- Not available on Windows
- Less overhead and easy to implement
Sockets: a new R session is launched on each core.
- Available on all systems
- Each process on each core is unique

Package `parallel`

base R package
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
- detectCores
- pvec
- mclapply
- mcparallel & mccollect

pvec

Parallelization of a vectorized function call. Forking takes time.

library(parallel)
system.time(pvec(1:1e7, sqrt, mc.cores = 1))

## user  system elapsed 
## 0.031   0.032   0.063 

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

## user  system elapsed 
## 0.249   0.197   0.323 

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

## user  system elapsed 
## 0.156   0.225   0.204

?proc.time for info
User CPU time: the CPU time spent by the current process, in our case, the R session
System CPU time: the CPU time spent by the OS on behalf of the current running process

Note that the wall time may be the less than the sum total (user + system) since parallelized processes accumulate user/system time at the same time.

pvec - `bench::system_time`

library(bench)
system_time(pvec(1:1e7, sqrt, mc.cores = 1))

## process    real 
## 83.5ms  83.2ms 

system_time(pvec(1:1e7, sqrt, mc.cores = 4))

## process    real 
## 266ms   312ms 

system_time(pvec(1:1e7, sqrt, mc.cores = 8))

## process    real 
## 249ms   262ms

`mclapply`

Parallelized lapply

system.time(rnorm(1e6))

## user  system elapsed 
## 0.047   0.004   0.051 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

## user  system elapsed 
## 0.055   0.032   0.049 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

## user  system elapsed 
## 0.058   0.039   0.036

`mclapply`

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

## user  system elapsed 
## 0.064   0.068   0.039 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

## user  system elapsed 
## 0.068   0.084   0.046 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

## user  system elapsed 
## 0.067   0.078   0.045

`mcparallel`

Asynchronously evaluation of an R expression in a separate process

m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)

List of 2
 $ pid: int 12106
 $ fd : int [1:2] 4 7
 - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

str(n)

List of 2
 $ pid: int 12107
 $ fd : int [1:2] 5 9
 - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

`mccollect`

Checks mcparallel objects for completion

str(mccollect(list(m,n,o)))

List of 3
 $ 12106: num [1:1000000] 1.62 0.904 -1.865 -2.384 -0.16 ...
 $ 12107: num [1:1000000] 0.224 0.241 0.733 0.532 0.129 ...
 $ 12108: num [1:1000000] 0.2199 0.0562 0.3705 0.998 6.7013 ...

mccollect - waiting

p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)

$`12109`
[1] 0.001003654

mccollect(p, wait = FALSE) # will signal the job as terminating

Warning in selectChildren(jobs, timeout): cannot wait for child 12109 as it does
not exist

NULL

doMC & foreach

Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop.

Package doMC is a parallel backend for the foreach package - a package that allows you to execute for loops in parallel.

library(doMC)

Loading required package: foreach


Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Loading required package: iterators

Core functions:

doMC::registerDoMC sets the number of cores for the parallel backend to be used with foreach
foreach, %dopar%, %do%

doMC serves as an interface between foreach and parallel Since parallel only works with systems that support forking, these functions will not work properly on Windows.

Set workers

To get started, set the number of cores with registerDoMC().

# check cores set up
getDoParWorkers()

[1] 1

# set 4 cores
registerDoMC(4)
getDoParWorkers()

[1] 4

Serial and parallel with foreach()

Sequential

%do% single execution

foreach(i = 1:4) %do% 
  sort(runif(n = 1e7, max = i))[1]

[[1]]
[1] 1.653098e-08

[[2]]
[1] 1.634471e-07

[[3]]
[1] 2.214219e-07

[[4]]
[1] 2.998859e-07

times(2) %do%
  sort(runif(n = 1e7))[1]

[1] 1.515727e-07 8.847564e-09

Parallel

%dopar% multicore execution

foreach(i = 1:4) %dopar%
  sort(runif(n = 1e7, max = i))[1]

[[1]]
[1] 1.001172e-08

[[2]]
[1] 1.653098e-07

[[3]]
[1] 5.587935e-09

[[4]]
[1] 1.529232e-06

times(2) %dopar%
  sort(runif(n = 1e7))[1]

[1] 2.444722e-08 1.024455e-08

Time comparison

Sequential

system.time({
foreach(i = 1:4) %do% 
  sort(runif(n = 1e7, max = i))[1]

times(2) %do%
  sort(runif(n = 1e7))[1]
})

##  user  system elapsed 
## 3.371   0.143   3.507

Parallel

system.time({
foreach(i = 1:4) %dopar% 
  sort(runif(n = 1e7, max = i))[1]

times(2) %dopar%
  sort(runif(n = 1e7))[1]
})

##  user  system elapsed 
## 2.917   0.603   1.376

Things to know about `foreach`

foreach can iterate across more than one value, but it doesn’t do length coercion.

Note: foreach is iterating over both simultaneously. This is not a nested for loop.

foreach(i = 1:3, j = 1:3) %do% {
  sqrt(i^2+j^2)   
}

[[1]]
[1] 1.414214

[[2]]
[1] 2.828427

[[3]]
[1] 4.242641

Note: foreach coerces the longer vector to be shorter!

foreach(i = 1:5, j = 1:2) %do% {
  sqrt(i^2+j^2)   
}

[[1]]
[1] 1.414214

[[2]]
[1] 2.828427

`foreach` bookkeeping

foreach does some bookkeeping for you and returns a list by default. Compare this to the traditional for loop that does no bookkeeping.
you can easily customize the bookkeeping.

foreach(i = 1:5, .combine='c') %do% {
  sqrt(i)
}

[1] 1.000000 1.414214 1.732051 2.000000 2.236068

foreach(i = 1:5, .combine='cbind') %do% {
  sqrt(i)
}

     result.1 result.2 result.3 result.4 result.5
[1,]        1 1.414214 1.732051        2 2.236068

foreach(i = 1:5, .combine='+') %do% {
  sqrt(i)
}

[1] 8.382332

nested `foreach`

The %:% operator is the nesting operator, used for creating nested foreach loops.

foreach(i = 1:4, .combine = "c") %:% 
  foreach(j = 0:1, .combine = "c") %dopar% 
    {i ^ j}

[1] 1 1 1 2 1 3 1 4

foreach(i = 1:4, .combine = "data.frame") %:% 
  foreach(j = 0:1, .combine = "c") %dopar% 
    {i ^ j}

  result.1 result.2 result.3 result.4
1        1        1        1        1
2        1        2        3        4

foreach(i = 1:4, .combine = "c") %:% 
  foreach(j = 0:1, .combine = "+") %dopar% 
    {i ^ j}

[1] 2 3 4 5

furrr

`future`

A “future” is an abstraction for a value that may be available at some point in the future.

The purpose of the future package is to provide a very simple and uniform way of evaluating R expressions asynchronously using various resources available to the user.

See the CRAN future documentation for further reading.

furrr and purrr

furrr functions are just like purrr functions but begin with future_

Example

library(furrr)
library(tidyverse)

map_dbl(mtcars, mean)

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

future_map_dbl(mtcars, mean)

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

plan(multisession, workers = 8)
future_map_dbl(mtcars, mean)

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

Not sure we are running in parallel?

system.time({map_dbl(mtcars, ~ {Sys.sleep(.1); mean(.x)})})

   user  system elapsed 
  0.004   0.000   1.138

system.time({future_map_dbl(mtcars, ~ {Sys.sleep(.1); mean(.x)})})

   user  system elapsed 
  0.043   0.002   0.258

plan(sequential)

Example

How could you parallelize the text mining of lab 4?
Demo

Complexity and parallelization

Computational complexity

Measuring efficiency

Measuring efficiency (example)

Benchmarking

bench

bench - relative results

Parallelization

What is parallelization?

Terminology

Package parallel

pvec

pvec - bench::system_time

mclapply

mclapply

mcparallel

mccollect

mccollect - waiting

doMC & foreach

doMC & foreach

Set workers

Serial and parallel with foreach()

Sequential

Parallel

Time comparison

Sequential

Parallel

Things to know about foreach

foreach bookkeeping

nested foreach

furrr

future

furrr and purrr

Example

Example

`bench`

`bench` - relative results

Package `parallel`

pvec - `bench::system_time`

`mclapply`

`mclapply`

`mcparallel`

`mccollect`

Things to know about `foreach`

`foreach` bookkeeping

nested `foreach`

`future`