NAs, lists, NULL and attributes

Dr. Alexander Fisher

Duke University

January 20, 2023

NAs (missing values)

Missing values

R uses NA to represent missing values in its data structures.NA is a logical type. What may not be obvious is that NA may be treated as a different type thanks to coercion.

typeof(NA)
[1] "logical"
typeof(NA + 1)
[1] "double"
typeof(NA + 1L)
[1] "integer"
typeof(c(NA, ""))
[1] "character"
typeof(NA_character_)
[1] "character"
typeof(NA_real_)
[1] "double"
typeof(NA_integer_)
[1] "integer"
typeof(NA_complex_)
[1] "complex"


NA stickiness

Because NAs represent missing values it makes sense that any calculation using them should also be missing.

1 + NA
[1] NA
1 / NA
[1] NA
NA * 5
[1] NA
sqrt(NA)
[1] NA
3 ^ NA
[1] NA
sum(c(1, 2, 3, NA))
[1] NA

Summarizing functions (e.g. sum(), mean(), sd(), etc.) will often have a na.rm argument which will allow you to drop missing values.

mean(c(1, 2, NA), na.rm = TRUE)
[1] 1.5
sum(c(1, 2, NA), na.rm = TRUE)
[1] 3

NAs are not always sticky

A useful mental model for NAs is to consider them as a unknown value that could take any of the possible values for that type.

For numbers or characters this isn’t very helpful, but for a logical value we know that the value must either be TRUE or FALSE and we can use that when deciding what value to return.

If the value of NA affects the logical outcome, it is indeterminate and the operation will return NA. If the value of NA does not affect the logical outcome, the operation will return the outcome.

TRUE & NA


FALSE & NA


TRUE | NA


FALSE | NA
[1] NA
[1] FALSE
[1] TRUE
[1] NA

Testing for NA

Because NA could take any value, the result of, for example, 2 != NA or 1 == NA is inconclusive and returns NA.

Examples

2 != NA
[1] NA
1 == NA
[1] NA

Who’s to say two missing values are equal?

NA == NA
[1] NA

We should instead:

!is.na(2)
[1] TRUE
is.na(NA)
[1] TRUE

Other Special values (double)

These are defined as part of the IEEE floating point standard (not unique to R)

  • NaN - Not a number
  • Inf - Positive infinity
  • -Inf - Negative infinity
pi / 0
[1] Inf
0 / 0
[1] NaN
1 /0 + 1/0
[1] Inf
1/0 - 1/0
[1] NaN
NaN / NA
[1] NaN
Inf - Inf
[1] NaN

Note

IEEE (Institute of Electrical and Electronics Engineers) develops global standards for a broad range of industries including floating-point arithmetic. Read more about the IEEE 754 (the standard for floating-point arithmetic) here

Testing for Inf and NaN

is.finite(Inf)
[1] FALSE
is.infinite(-Inf)
[1] TRUE
is.nan(Inf)
[1] FALSE
Inf > 1
[1] TRUE
-Inf > 1
[1] FALSE
is.finite(NaN)
[1] FALSE
is.infinite(NaN)
[1] FALSE
is.nan(NaN)
[1] TRUE
is.finite(NA)
[1] FALSE
is.nan(NA)
[1] FALSE

Forced coercion

You can coerce one type to another with as.()

is.integer(2.0)
[1] FALSE
as.integer("2.0")
[1] 2
is.integer(as.integer(2.0))
[1] TRUE
is.integer(Inf)
[1] FALSE

Inf and NaN are doubles, however their coercion behavior is not the same as for other doubles.

is.double(Inf)
[1] TRUE
is.double(NaN)
[1] TRUE
is.integer(as.integer(Inf))
Warning: NAs introduced by coercion to integer range
[1] TRUE
is.integer(as.integer(NaN))
[1] TRUE

Exercise 1

Write a function that takes vector input x and returns the smallest and largest non-infinite value. Test your function on

x = c(1, Inf, 100, 10, -Inf)

Lists

Generic vectors (lists)

Two types of vectors in R. Atomic vectors (elements are all the same type) and generic vectors, aka lists (heterogeneous collection of elements). For example, a list can contain atomic vectors, functions, other lists, etc.

list("A", (1:4)/2, list(TRUE, 1), function(x) x^2)
[[1]]
[1] "A"

[[2]]
[1] 0.5 1.0 1.5 2.0

[[3]]
[[3]][[1]]
[1] TRUE

[[3]][[2]]
[1] 1


[[4]]
function(x) x^2

List structure

We can view the contents of a list and a brief description of the contents compactly with the structure function str()

str(list("A", (1:4)/2, list(TRUE, 1), function(x) x^2))
List of 4
 $ : chr "A"
 $ : num [1:4] 0.5 1 1.5 2
 $ :List of 2
  ..$ : logi TRUE
  ..$ : num 1
 $ :function (x)  
  ..- attr(*, "srcref")= 'srcref' int [1:8] 1 39 1 53 39 53 1 1
  .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fd5caeb0d58> 
str(1:100)
 int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
str(list(c(1,2), c(TRUE, FALSE)))
List of 2
 $ : num [1:2] 1 2
 $ : logi [1:2] TRUE FALSE

Recursive lists

str(list(list(list(list())))) # recursive list
List of 1
 $ :List of 1
  ..$ :List of 1
  .. ..$ : list()
str(list(1, list(2), list(3, 2))) # recursive list
List of 3
 $ : num 1
 $ :List of 1
  ..$ : num 2
 $ :List of 2
  ..$ : num 3
  ..$ : num 2

Because of this, lists become the most natural way of representing tree-like structures within R

List coercion

By default a vector will be coerced to a list (as a list is more general) if needed

str( c(1, list(4, list(6, 7))) )
List of 3
 $ : num 1
 $ : num 4
 $ :List of 2
  ..$ : num 6
  ..$ : num 7

We can coerce a list into an atomic vector using unlist - the usual type coercion rules then apply to determine the final type.

unlist(list(1:3, list(4:5, 6)))
[1] 1 2 3 4 5 6
unlist( list(1, list(2, list(3, "Hello"))) )
[1] "1"     "2"     "3"     "Hello"

as.integer and similar functions can be used, but only if the list is flat (i.e. no lists inside your base list)

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with atomic vectors).

This can make accessing list elements more straight forward.

str(list(A = 1, B = list(C = 2, D = 3)))
List of 2
 $ A: num 1
 $ B:List of 2
  ..$ C: num 2
  ..$ D: num 3

More complex names need to be quoted,

list("knock knock" = "who's there?")
$`knock knock`
[1] "who's there?"

Exercise 2

Represent the following JSON (JavaScript Object Notation) data as a list in R.

{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": 
  {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": 10021
  },
  "phoneNumber": 
  [
    {
      "type": "home",
      "number": "212 555-1239"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ]
}

NULL values

The NULL type

NULL is a special value within R that represents nothing - it always has length zero and type “NULL” and cannot have any attributes.

NULL
NULL
typeof(NULL)
[1] "NULL"
length(NULL)
[1] 0
c()
NULL
c(NULL)
NULL
c(1, NULL, 2)
[1] 1 2
c(NULL, TRUE, "A")
[1] "TRUE" "A"   

When combined in a vector, it disappears.

0-length coercion

Previously we saw that in multi-vector operations, short vectors get re-used until the length of the long vector is matched.

0-length length coercion is a special case of length coercion when one of the arguments has length 0. In this case the longer vector’s length is not used and result will have length 0.

integer() + 1
numeric(0)
log(numeric())
numeric(0)
logical() | TRUE
logical(0)
character() > "M"
logical(0)
NULL + 1
numeric(0)
log(NULL)
Error in log(NULL): non-numeric argument to mathematical function
NULL | TRUE
logical(0)
NULL > "M"
logical(0)

As a NULL values always have length 0, this coercion rule will apply (note type coercion is also occurring here)

NULL and comparison

Given the previous issue, comparisons and conditional with NULLs can be problematic.

x = NULL

if (x > 0)
  print("Hello")
Error in if (x > 0) print("Hello"): argument is of length zero
if (!is.null(x) & (x > 0))
  print("Hello")
Error in if (!is.null(x) & (x > 0)) print("Hello"): argument is of length zero
if (!is.null(x) && (x > 0))
  print("Hello")

This is due to short circuit evaluation which occurs with && and || but not & or |.

Attributes

Attributes

Attributes are named lists that can be attached to objects in R. Attributes contain metadata about an object, e.g. the object’s names, dim, class, levels etc.

Attributes can be interacted with via attr and attributes functions.

(x = c(L=1,M=2,N=3))
L M N 
1 2 3 
attributes(x)
$names
[1] "L" "M" "N"
str(attributes(x))
List of 1
 $ names: chr [1:3] "L" "M" "N"
attr(x, "names")
[1] "L" "M" "N"
attr(x, "something")
NULL

Assigning attributes

x = c(1, 2, 3)
x
[1] 1 2 3
names(x) = c("Z","Y","X") # helper function
x
Z Y X 
1 2 3 
names(x)
[1] "Z" "Y" "X"
attr(x, "names") = c("A","B","C")
x
A B C 
1 2 3 
names(x)
[1] "A" "B" "C"

Factors

Factor objects are how R represents categorical data (e.g. a variable where there are a fixed # of possible outcomes).

(x = factor(c("Sunny", "Cloudy", "Rainy", "Cloudy", "Cloudy")))
[1] Sunny  Cloudy Rainy  Cloudy Cloudy
Levels: Cloudy Rainy Sunny
str(x)
 Factor w/ 3 levels "Cloudy","Rainy",..: 3 1 2 1 1
typeof(x)
[1] "integer"

What’s really going on?

attributes(x)
$levels
[1] "Cloudy" "Rainy"  "Sunny" 

$class
[1] "factor"

A factor is just an integer vector with two attributes: class and levels.

Building objects

We can build our own factor from scratch using,

y = c(3L, 1L, 2L, 1L, 1L)
attr(y, "levels") = c("Cloudy", "Rainy", "Sunny")
attr(y, "class") = "factor"
y
[1] Sunny  Cloudy Rainy  Cloudy Cloudy
Levels: Cloudy Rainy Sunny

The approach we just used is a bit clunky - generally the preferred method for construction an object with attributes from scratch is to use the structure function.

y = structure(
  c(3L, 1L, 2L, 1L, 1L), # data
  levels = c("Cloudy", "Rainy", "Sunny"),
  class = "factor"
)
y
[1] Sunny  Cloudy Rainy  Cloudy Cloudy
Levels: Cloudy Rainy Sunny
class(y)
[1] "factor"
is.factor(y)
[1] TRUE

Knowing factors are stored as integers help explain some of their more interesting behaviors:

x
[1] Sunny  Cloudy Rainy  Cloudy Cloudy
Levels: Cloudy Rainy Sunny
x+1
Warning in Ops.factor(x, 1): '+' not meaningful for factors
[1] NA NA NA NA NA
is.integer(x)
[1] FALSE
as.integer(x)
[1] 3 1 2 1 1
as.character(x)
[1] "Sunny"  "Cloudy" "Rainy"  "Cloudy" "Cloudy"
as.logical(x)
[1] NA NA NA NA NA

Exercise 3

Create a factor vector based on the vector of airport codes below.

airports = c("RDU", "ABE", "DTW", "GRR", "RDU", "GRR", "GNV",
             "JFK", "JFK", "SFO", "DTW")

All of the possible levels are

c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO")