Complexity and parallelization

Dr. Alexander Fisher

Duke University

Computational complexity

Measuring efficiency

flops or floating point operations measure the efficiency of an algorithm. flops consist of the binary floating point operations: addition, subtraction, multiplication, division and comparison. Individual floating point operations are performed by a single “core” of your computer’s CPU or GPU.

We use “big O” \(\mathcal{O}(n)\) notation to denote the complexity of an algorithm. For example,

  • matrix-vector multiplication A %*% b, where \(A\) is \(m \times n\) and \(b\) is \(n \times 1\) takes \(2mn\) or \(\mathcal{O}(mn)\) flops.

  • matrix-matrix multiplication A %*% B, where \(A\) is \(m \times n\) and \(B\) is \(n \times p\) takes \(2mnp\) or \(\mathcal{O}(mnp)\) flops.

Notice that in reporting complexity of each example we drop the leading constant “2”.

A hierarchy of computational complexity (let \(n\) be the problem size):

exponential order: \(\mathcal{O}(b^n)\) NP-hard (horrible)
polynomial order: \(\mathcal{O}(n^q)\) doable
\(\mathcal{O}(n \log n)\) fast
linear order: \(\mathcal{O}(n)\) fast
log order: \(\mathcal{O}(\log n)\) super fast

Note

Some references count multiplication followed by an addition (fused multiply-add, FMA) as one flop.

{This slide adapted from notes by Dr. Hua Zhou}

Measuring efficiency (example)

Suppose you wish to calculate a likelihood \(L(x|\theta)\) for \(n\) iid observations: \(x = \{x_i\}; i \in \{1, \ldots, n \}\). The likelihood looks like

\[ L(x|\theta) = \prod_i^n f(x_i | \theta) \] where \(f\) is some density function dependent on parameters \(\theta\).

\(L(x|\theta)\) has \(\mathcal{O}(n)\) complexity, i.e. scales linearly with the number of data points.

Benchmarking

bench

d = tibble(
  x = runif(10000),
  y=runif(10000)
)
(b = bench::mark(
  d[d$x > 0.5, ],
  d[which(d$x > 0.5), ],
  subset(d, x > 0.5),
  filter(d, x > 0.5)
))
# A tibble: 4 × 6
  expression                 min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 d[d$x > 0.5, ]        102.92µs 116.71µs     7998.  252.16KB     21.7
2 d[which(d$x > 0.5), ]  88.08µs     99µs     9353.   271.9KB     51.1
3 subset(d, x > 0.5)    145.17µs 160.38µs     5938.   288.2KB     35.0
4 filter(d, x > 0.5)      2.05ms   2.15ms      454.    2.01MB     12.7

bench - relative results

summary(b, relative = TRUE)
# A tibble: 4 × 6
  expression              min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>            <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 d[d$x > 0.5, ]         1.17   1.18      17.6      1        1.71
2 d[which(d$x > 0.5), ]  1      1         20.6      1.08     4.03
3 subset(d, x > 0.5)     1.65   1.62      13.1      1.14     2.76
4 filter(d, x > 0.5)    23.2   21.7        1        8.17     1   

Parallelization

What is parallelization?

  • “parallelization” or “parallel computing” means deploying an algorithm’s calculations across several cores of a computer to perform computation at the same time

Terminology

  • CPU: central processing unit, primary component of a computer that processes instructions

  • Core: an individual processor within a CPU, more cores can improve performance and efficiency

  • Forking: a copy of the current R session is moved to new cores.

    • Not available on Windows
    • Less overhead and easy to implement
  • Sockets: a new R session is launched on each core.

    • Available on all systems
    • Each process on each core is unique

Package parallel

  • base R package

  • tools for the forking of R processes (some functions do not work on Windows)

  • Core functions:

    • detectCores
    • pvec
    • mclapply
    • mcparallel & mccollect

pvec

Parallelization of a vectorized function call. Forking takes time.

library(parallel)
system.time(pvec(1:1e7, sqrt, mc.cores = 1))

## user  system elapsed 
## 0.031   0.032   0.063 

system.time(pvec(1:1e7, sqrt, mc.cores = 4))

## user  system elapsed 
## 0.249   0.197   0.323 

system.time(pvec(1:1e7, sqrt, mc.cores = 8))

## user  system elapsed 
## 0.156   0.225   0.204 
  • ?proc.time for info

  • User CPU time: the CPU time spent by the current process, in our case, the R session

  • System CPU time: the CPU time spent by the OS on behalf of the current running process

Note that the wall time may be the less than the sum total (user + system) since parallelized processes accumulate user/system time at the same time.

pvec - bench::system_time

library(bench)
system_time(pvec(1:1e7, sqrt, mc.cores = 1))

## process    real 
## 83.5ms  83.2ms 

system_time(pvec(1:1e7, sqrt, mc.cores = 4))

## process    real 
## 266ms   312ms 

system_time(pvec(1:1e7, sqrt, mc.cores = 8))

## process    real 
## 249ms   262ms

mclapply

  • Parallelized lapply
system.time(rnorm(1e6))

## user  system elapsed 
## 0.047   0.004   0.051 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))

## user  system elapsed 
## 0.055   0.032   0.049 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))

## user  system elapsed 
## 0.058   0.039   0.036 

mclapply

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))

## user  system elapsed 
## 0.064   0.068   0.039 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))

## user  system elapsed 
## 0.068   0.084   0.046 

system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))

## user  system elapsed 
## 0.067   0.078   0.045 

mcparallel

Asynchronously evaluation of an R expression in a separate process

m = mcparallel(rnorm(1e6))
n = mcparallel(rbeta(1e6,1,1))
o = mcparallel(rgamma(1e6,1,1))

str(m)
List of 2
 $ pid: int 12106
 $ fd : int [1:2] 4 7
 - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
str(n)
List of 2
 $ pid: int 12107
 $ fd : int [1:2] 5 9
 - attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"

mccollect

Checks mcparallel objects for completion

str(mccollect(list(m,n,o)))
List of 3
 $ 12106: num [1:1000000] 1.62 0.904 -1.865 -2.384 -0.16 ...
 $ 12107: num [1:1000000] 0.224 0.241 0.733 0.532 0.129 ...
 $ 12108: num [1:1000000] 0.2199 0.0562 0.3705 0.998 6.7013 ...

mccollect - waiting

p = mcparallel(mean(rnorm(1e5)))
mccollect(p, wait = FALSE, 10) # will retrieve the result (since it's fast)
$`12109`
[1] 0.001003654
mccollect(p, wait = FALSE) # will signal the job as terminating
Warning in selectChildren(jobs, timeout): cannot wait for child 12109 as it does
not exist
NULL

doMC & foreach

doMC & foreach

Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop.

Package doMC is a parallel backend for the foreach package - a package that allows you to execute for loops in parallel.

library(doMC)
Loading required package: foreach

Attaching package: 'foreach'
The following objects are masked from 'package:purrr':

    accumulate, when
Loading required package: iterators

Core functions:

  • doMC::registerDoMC sets the number of cores for the parallel backend to be used with foreach

  • foreach, %dopar%, %do%

doMC serves as an interface between foreach and parallel Since parallel only works with systems that support forking, these functions will not work properly on Windows.

Set workers

To get started, set the number of cores with registerDoMC().

# check cores set up
getDoParWorkers()
[1] 1
# set 4 cores
registerDoMC(4)
getDoParWorkers()
[1] 4

Serial and parallel with foreach()

Sequential

  • %do% single execution
foreach(i = 1:4) %do% 
  sort(runif(n = 1e7, max = i))[1]
[[1]]
[1] 1.653098e-08

[[2]]
[1] 1.634471e-07

[[3]]
[1] 2.214219e-07

[[4]]
[1] 2.998859e-07
times(2) %do%
  sort(runif(n = 1e7))[1]
[1] 1.515727e-07 8.847564e-09

Parallel

  • %dopar% multicore execution
foreach(i = 1:4) %dopar%
  sort(runif(n = 1e7, max = i))[1]
[[1]]
[1] 1.001172e-08

[[2]]
[1] 1.653098e-07

[[3]]
[1] 5.587935e-09

[[4]]
[1] 1.529232e-06
times(2) %dopar%
  sort(runif(n = 1e7))[1]
[1] 2.444722e-08 1.024455e-08

Time comparison

Sequential

system.time({
foreach(i = 1:4) %do% 
  sort(runif(n = 1e7, max = i))[1]

times(2) %do%
  sort(runif(n = 1e7))[1]
})

##  user  system elapsed 
## 3.371   0.143   3.507 

Parallel

system.time({
foreach(i = 1:4) %dopar% 
  sort(runif(n = 1e7, max = i))[1]

times(2) %dopar%
  sort(runif(n = 1e7))[1]
})

##  user  system elapsed 
## 2.917   0.603   1.376 

Things to know about foreach

foreach can iterate across more than one value, but it doesn’t do length coercion.

Note: foreach is iterating over both simultaneously. This is not a nested for loop.

foreach(i = 1:3, j = 1:3) %do% {
  sqrt(i^2+j^2)   
}
[[1]]
[1] 1.414214

[[2]]
[1] 2.828427

[[3]]
[1] 4.242641

Note: foreach coerces the longer vector to be shorter!

foreach(i = 1:5, j = 1:2) %do% {
  sqrt(i^2+j^2)   
}
[[1]]
[1] 1.414214

[[2]]
[1] 2.828427

foreach bookkeeping

  • foreach does some bookkeeping for you and returns a list by default. Compare this to the traditional for loop that does no bookkeeping.

  • you can easily customize the bookkeeping.

foreach(i = 1:5, .combine='c') %do% {
  sqrt(i)
}
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
foreach(i = 1:5, .combine='cbind') %do% {
  sqrt(i)
}
     result.1 result.2 result.3 result.4 result.5
[1,]        1 1.414214 1.732051        2 2.236068
foreach(i = 1:5, .combine='+') %do% {
  sqrt(i)
}
[1] 8.382332

nested foreach

  • The %:% operator is the nesting operator, used for creating nested foreach loops.
foreach(i = 1:4, .combine = "c") %:% 
  foreach(j = 0:1, .combine = "c") %dopar% 
    {i ^ j}
[1] 1 1 1 2 1 3 1 4
foreach(i = 1:4, .combine = "data.frame") %:% 
  foreach(j = 0:1, .combine = "c") %dopar% 
    {i ^ j}
  result.1 result.2 result.3 result.4
1        1        1        1        1
2        1        2        3        4
foreach(i = 1:4, .combine = "c") %:% 
  foreach(j = 0:1, .combine = "+") %dopar% 
    {i ^ j}
[1] 2 3 4 5

furrr

future

A “future” is an abstraction for a value that may be available at some point in the future.

The purpose of the future package is to provide a very simple and uniform way of evaluating R expressions asynchronously using various resources available to the user.

furrr and purrr

  • furrr functions are just like purrr functions but begin with future_

Example

library(furrr)
library(tidyverse)
map_dbl(mtcars, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 
future_map_dbl(mtcars, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

plan(multisession, workers = 8)
future_map_dbl(mtcars, mean)
       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500 

Not sure we are running in parallel?

system.time({map_dbl(mtcars, ~ {Sys.sleep(.1); mean(.x)})})
   user  system elapsed 
  0.004   0.000   1.138 
system.time({future_map_dbl(mtcars, ~ {Sys.sleep(.1); mean(.x)})})
   user  system elapsed 
  0.043   0.002   0.258 
plan(sequential)

Example

  • How could you parallelize the text mining of lab 4?

  • Demo