About missing values: regular NAs, tagged NAs and user NAs
Joseph Larmarange
Source:vignettes/missing_values.Rmd
missing_values.Rmd
In base R, missing values are indicated using the
specific value NA
. Regular NAs could be
used with any type of vector (double, integer, character, factor, Date,
etc.).
Other statistical software have implemented ways to differentiate several types of missing values.
Stata and SAS have a system of
tagged NAs, where NA values are tagged with a letter
(from a to z). SPSS allows users to indicate that
certain non-missing values should be treated in some analysis as missing
(user NAs). The haven
package implements
tagged NAs and user NAs in order to
keep this information when importing files from Stata,
SAS or SPSS.
Tagged NAs
Creation and tests
Tagged NAs are proper NA
values with a
tag attached to them. They can be created with tagged_na()
.
The attached tag should be a single letter, lowercase (a-z) or uppercase
(A-Z).
For most R functions, tagged NAs are just considered as regular NAs. By default, they are just printed as any other regular NA.
x
## [1] 1 2 3 4 5 NA NA NA
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
To show/print their tags, you need to use na_tag()
,
print_tagged_na()
or format_tagged_na()
.
na_tag(x)
## [1] NA NA NA NA NA "a" "z" NA
## [1] 1 2 3 4 5 NA(a) NA(z) NA
## [1] " 1" " 2" " 3" " 4" " 5" "NA(a)" "NA(z)" " NA"
To test if a certain NA is a regular NA or a tagged NA, you should
use is_regular_na()
or is_tagged_na()
.
is.na(x)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
is_tagged_na(x)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
# You can test for specific tagged NAs with the second argument
is_tagged_na(x, "a")
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Tagged NAs could be defined only for double vectors. If you add a tagged NA to a character vector, it will be converted into a regular NA. If you add a tagged NA to an integer vector, the vector will be converted into a double vector.
## [1] "a" "b" NA
is_tagged_na(y)
## [1] FALSE FALSE FALSE
## Error: `x` must be a double vector
## [1] "double"
## [1] " 1" " 2" "NA(a)"
Unique values, duplicates and sorting with tagged NAs
By default, functions such as base::unique()
,
base::duplicated()
, base::order()
or
base::sort()
will treat tagged NAs as the same thing as a
regular NA. You can use unique_tagged_na()
,
duplicated_tagged_na()
, order_tagged_na()
and
sort_tagged_na()
as alternatives that will treat two tagged
NAs with different tags as separate values.
## [1] 1 2 NA(a) 1 NA(z) 2 NA(a) NA
unique(x) %>% print_tagged_na()
## [1] 1 2 NA(a)
## [1] 1 2 NA(a) NA(z) NA
duplicated(x)
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
sort(x, na.last = TRUE) %>% print_tagged_na()
## [1] 1 1 2 2 NA(a) NA(z) NA(a) NA
## [1] 1 1 2 2 NA(a) NA(a) NA(z) NA
Tagged NAs and value labels
It is possible to define value labels for tagged NAs.
x <- c(1, 0, 1, tagged_na("r"), 0, tagged_na("d"), tagged_na("z"), NA)
val_labels(x) <- c(
no = 0, yes = 1,
"don't know" = tagged_na("d"),
refusal = tagged_na("r")
)
x
## <labelled<double>[8]>
## [1] 1 0 1 NA(r) 0 NA(d) NA(z) NA
##
## Labels:
## value label
## 0 no
## 1 yes
## NA(d) don't know
## NA(r) refusal
When converting such labelled vector into factor, tagged NAs are, by default, converted into regular NAs (it is not possible to define tagged NAs with factors).
to_factor(x)
## [1] yes no yes <NA> no <NA> <NA> <NA>
## Levels: no yes
However, the option explicit_tagged_na
of
to_factor()
allows to transform tagged NAs into explicit
factor levels.
to_factor(x, explicit_tagged_na = TRUE)
## [1] yes no yes refusal no don't know NA(z)
## [8] <NA>
## Levels: no yes don't know refusal NA(z)
to_factor(x, levels = "prefixed", explicit_tagged_na = TRUE)
## [1] [1] yes [0] no [1] yes [NA(r)] refusal
## [5] [0] no [NA(d)] don't know [NA(z)] NA(z) <NA>
## Levels: [0] no [1] yes [NA(d)] don't know [NA(r)] refusal [NA(z)] NA(z)
Conversion into user NAs
Tagged NAs can be converted into user NAs with
tagged_na_to_user_na()
.
## <labelled_spss<double>[8]>
## [1] 1 0 1 3 0 2 4 NA
## Missing range: [2, 4]
##
## Labels:
## value label
## 0 no
## 1 yes
## 2 don't know
## 3 refusal
## 4 NA(z)
tagged_na_to_user_na(x, user_na_start = 10)
## <labelled_spss<double>[8]>
## [1] 1 0 1 11 0 10 12 NA
## Missing range: [10, 12]
##
## Labels:
## value label
## 0 no
## 1 yes
## 10 don't know
## 11 refusal
## 12 NA(z)
Use tagged_na_to_regular_na()
to convert tagged NAs into
regular NAs.
## <labelled<double>[8]>
## [1] 1 0 1 NA 0 NA NA NA
##
## Labels:
## value label
## 0 no
## 1 yes
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
User NAs
haven
introduced an haven_labelled_spss
class to deal with user defined missing values in a similar way as
SPSS. In such case, additional attributes will be used
to indicate with values should be considered as missing, but such values
will not be stored as internal NA
values. You should note
that most R function will not take this information into account.
Therefore, you will have to convert missing values into NA
if required before analysis. These defined missing values could co-exist
with internal NA
values.
Creation
User NAs could be created directly with labelled_spss()
.
You can also manipulate them with na_values()
and
na_range()
.
## <labelled<double>[8]>
## [1] 1 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
na_values(v) <- 9
v
## <labelled_spss<double>[8]>
## [1] 1 2 3 9 1 3 2 NA
## Missing values: 9
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
na_values(v) <- NULL
v
## <labelled<double>[8]>
## [1] 1 2 3 9 1 3 2 NA
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
## [1] 5 Inf
v
## <labelled_spss<double>[8]>
## [1] 1 2 3 9 1 3 2 NA
## Missing range: [5, Inf]
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
NB: you cant also use set_na_range()
and
set_na_values()
for a dplyr
-like syntax.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# setting value labels and user NAs
df <- tibble(s1 = c("M", "M", "F", "F"), s2 = c(1, 1, 2, 9)) %>%
set_value_labels(s2 = c(yes = 1, no = 2)) %>%
set_na_values(s2 = 9)
df$s2
## <labelled_spss<double>[4]>
## [1] 1 1 2 9
## Missing values: 9
##
## Labels:
## value label
## 1 yes
## 2 no
# removing user NAs
df <- df %>% set_na_values(s2 = NULL)
df$s2
## <labelled<double>[4]>
## [1] 1 1 2 9
##
## Labels:
## value label
## 1 yes
## 2 no
Tests
Note that is.na()
will return TRUE
for user
NAs. Use is_user_na()
to test if a specific value is a user
NA and is_regular_na()
to test if it is a regular NA.
v
## <labelled_spss<double>[8]>
## [1] 1 2 3 9 1 3 2 NA
## Missing range: [5, Inf]
##
## Labels:
## value label
## 1 yes
## 3 no
## 9 don't know
is.na(v)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
is_user_na(v)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Conversion
For most R functions, user NAs values are still regular values.
x <- c(1:5, 11:15)
na_range(x) <- c(10, Inf)
val_labels(x) <- c("dk" = 11, "refused" = 15)
x
## <labelled_spss<integer>[10]>
## [1] 1 2 3 4 5 11 12 13 14 15
## Missing range: [10, Inf]
##
## Labels:
## value label
## 11 dk
## 15 refused
mean(x)
## [1] 8
You can convert user NAs into regular NAs with
user_na_to_na()
or user_na_to_regular_na()
(both functions are identical).
## <labelled<integer>[10]>
## [1] 1 2 3 4 5 NA NA NA NA NA
mean(user_na_to_na(x), na.rm = TRUE)
## [1] 3
Alternatively, if the vector is numeric, you can convert user NAs
into tagged NAs with user_na_to_tagged_na()
.
## 'x' has been converted into a double vector.
## <labelled<double>[10]>
## [1] 1 2 3 4 5 NA(a) NA(b) NA(c) NA(d) NA(e)
##
## Labels:
## value label
## NA(a) dk
## NA(e) refused
mean(user_na_to_tagged_na(x), na.rm = TRUE)
## 'x' has been converted into a double vector.
## [1] 3
Finally, you can also remove user NAs definition without converting
these values to NA
, using
remove_user_na()
.
## <labelled<integer>[10]>
## [1] 1 2 3 4 5 11 12 13 14 15
##
## Labels:
## value label
## 11 dk
## 15 refused
mean(remove_user_na(x))
## [1] 8