Dr. Alexander Fisher
Duke University
flops or floating point operations measure the efficiency of an algorithm. flops consist of the binary floating point operations: addition, subtraction, multiplication, division and comparison. Individual floating point operations are performed by a single “core” of your computer’s CPU or GPU.
We use “big O” \(\mathcal{O}(n)\) notation to denote the complexity of an algorithm. For example,
matrix-vector multiplication A %*% b
, where \(A\) is \(m \times n\) and \(b\) is \(n \times 1\) takes \(2mn\) or \(\mathcal{O}(mn)\) flops.
matrix-matrix multiplication A %*% B
, where \(A\) is \(m \times n\) and \(B\) is \(n \times p\) takes \(2mnp\) or \(\mathcal{O}(mnp)\) flops.
Notice that in reporting complexity of each example we drop the leading constant “2”.
A hierarchy of computational complexity (let \(n\) be the problem size):
exponential order: \(\mathcal{O}(b^n)\) | NP-hard (horrible) |
polynomial order: \(\mathcal{O}(n^q)\) | doable |
\(\mathcal{O}(n \log n)\) | fast |
linear order: \(\mathcal{O}(n)\) | fast |
log order: \(\mathcal{O}(\log n)\) | super fast |
{This slide adapted from notes by Dr. Hua Zhou}
Suppose you wish to calculate a likelihood \(L(x|\theta)\) for \(n\) iid observations: \(x = \{x_i\}; i \in \{1, \ldots, n \}\). The likelihood looks like
\[ L(x|\theta) = \prod_i^n f(x_i | \theta) \] where \(f\) is some density function dependent on parameters \(\theta\).
\(L(x|\theta)\) has \(\mathcal{O}(n)\) complexity, i.e. scales linearly with the number of data points.
bench
d = tibble(
x = runif(10000),
y=runif(10000)
)
(b = bench::mark(
d[d$x > 0.5, ],
d[which(d$x > 0.5), ],
subset(d, x > 0.5),
filter(d, x > 0.5)
))
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 102.92µs 116.71µs 7998. 252.16KB 21.7
2 d[which(d$x > 0.5), ] 88.08µs 99µs 9353. 271.9KB 51.1
3 subset(d, x > 0.5) 145.17µs 160.38µs 5938. 288.2KB 35.0
4 filter(d, x > 0.5) 2.05ms 2.15ms 454. 2.01MB 12.7
bench
- relative results# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 d[d$x > 0.5, ] 1.17 1.18 17.6 1 1.71
2 d[which(d$x > 0.5), ] 1 1 20.6 1.08 4.03
3 subset(d, x > 0.5) 1.65 1.62 13.1 1.14 2.76
4 filter(d, x > 0.5) 23.2 21.7 1 8.17 1
CPU: central processing unit, primary component of a computer that processes instructions
Core: an individual processor within a CPU, more cores can improve performance and efficiency
Forking: a copy of the current R session is moved to new cores.
Sockets: a new R session is launched on each core.
parallel
base R package
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
detectCores
pvec
mclapply
mcparallel
& mccollect
Parallelization of a vectorized function call. Forking takes time.
?proc.time
for info
User CPU time: the CPU time spent by the current process, in our case, the R session
System CPU time: the CPU time spent by the OS on behalf of the current running process
Note that the wall time may be the less than the sum total (user + system) since parallelized processes accumulate user/system time at the same time.
bench::system_time
mclapply
lapply
system.time(rnorm(1e6))
## user system elapsed
## 0.047 0.004 0.051
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 2)))
## user system elapsed
## 0.055 0.032 0.049
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 4)))
## user system elapsed
## 0.058 0.039 0.036
mclapply
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 8)))
## user system elapsed
## 0.064 0.068 0.039
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 10)))
## user system elapsed
## 0.068 0.084 0.046
system.time(unlist(mclapply(1:10, function(x) rnorm(1e5), mc.cores = 12)))
## user system elapsed
## 0.067 0.078 0.045
mcparallel
Asynchronously evaluation of an R expression in a separate process
List of 2
$ pid: int 12106
$ fd : int [1:2] 4 7
- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
List of 2
$ pid: int 12107
$ fd : int [1:2] 5 9
- attr(*, "class")= chr [1:3] "parallelJob" "childProcess" "process"
mccollect
Checks mcparallel
objects for completion
List of 3
$ 12106: num [1:1000000] 1.62 0.904 -1.865 -2.384 -0.16 ...
$ 12107: num [1:1000000] 0.224 0.241 0.733 0.532 0.129 ...
$ 12108: num [1:1000000] 0.2199 0.0562 0.3705 0.998 6.7013 ...
Packages by Revolution Analytics that provides the foreach
function which is a parallelizable for loop.
Package doMC
is a parallel backend for the foreach
package - a package that allows you to execute for loops in parallel.
Loading required package: foreach
Attaching package: 'foreach'
The following objects are masked from 'package:purrr':
accumulate, when
Loading required package: iterators
Core functions:
doMC::registerDoMC
sets the number of cores for the parallel backend to be used with foreach
foreach
, %dopar%
, %do%
doMC
serves as an interface between foreach
and parallel
Since parallel
only works with systems that support forking, these functions will not work properly on Windows.
To get started, set the number of cores with registerDoMC()
.
%do%
single executionforeach
foreach
can iterate across more than one value, but it doesn’t do length coercion.
Note: foreach
is iterating over both simultaneously. This is not a nested for loop.
foreach
bookkeepingforeach
does some bookkeeping for you and returns a list by default. Compare this to the traditional for loop that does no bookkeeping.
you can easily customize the bookkeeping.
foreach
%:%
operator is the nesting operator, used for creating nested foreach loops.[1] 1 1 1 2 1 3 1 4
future
A “future” is an abstraction for a value that may be available at some point in the future.
The purpose of the future
package is to provide a very simple and uniform way of evaluating R expressions asynchronously using various resources available to the user.
future
documentation for further reading.furrr
functions are just like purrr
functions but begin with future_
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
Not sure we are running in parallel?
How could you parallelize the text mining of lab 4?
Demo