Outlier detection

Detecting outliers within a dataset or test if a new (novel) observation is an outlier.

fit_outlier(
  A,
  adj,
  z = NULL,
  alpha = 0.05,
  nsim = 10000,
  ncores = 1,
  validate = TRUE
)

Arguments

A	Character matrix or data.frame. All values must be limited to a single character.
adj	Adjacency list or `gengraph` object of a decomposable graph. See package `ess` for `gengraph` objects.
z	Named vector (same names as `colnames(A)`) or `NULL`. See details. Values must be limited to a single character.
alpha	Significance level
nsim	Number of simulations
ncores	Number of cores to use in parallelization
validate	Logical. If true, it checks if `A` only has single character values and converts it if not.

Value

A outlier_model object with either novelty or outlier as child classes. These are used for different purposes. See the details

Details

If the goal is to detect outliers within A set z to NULL; this procedure is most often just referred to as outlier detection. Once fit_outlier has been called in this situation, one can exploit the outliers function to get the indicies for which observations in A that are outliers. See the examples.

On the other hand, if the goal is test if the new unseen observation z is an outlier inA, then supply a named vector to z.

All values must be limited to a single character representation; if not, the function will internally convert to one such representation. The reason for this, is a speedup in runtime performance. One can also use the exported function to_chars on A in advance and set validate to FALSE.

The adj object is most typically found using fit_graph from the ess package. But the user can supply an adjacency list, just a named list, of their own choice if needed.

Examples


library(dplyr)
library(ess)  # For the fit_graph function
set.seed(7)   # For reproducibility

# Psoriasis patients
d <- derma %>%
  filter(ES == "psoriasis") %>%
  select(1:20) %>% # only a subset of data is used to exemplify
  as_tibble()

# Fitting the interaction graph
# see package ess for details
g <- fit_graph(d, trace = FALSE)
plot(g)

# -----------------------------------------------------------
#                        EXAMPLE 1
#    Testing which observations within d are outliers
# -----------------------------------------------------------

# Only 500 simulations is used here to exeplify
# The default number of simulations is 10,000
m1 <- fit_outlier(d, g, nsim = 500)
print(m1)
#> 
#>  -------------------------------- 
#>   Simulations: 500 
#>   Variables: 20 
#>   Observations: 111 
#>   Estimated mean: 26.3 
#>   Estimated variance: 23.7 
#>  --------------------------------
#>   Critical value: 35.35006 
#>   Alpha: 0.05 
#>   <outlier, outlier_model, list> 
#>  --------------------------------
outs  <- outliers(m1)
douts <- d[which(outs), ]
douts
#> # A tibble: 12 x 20
#>    c1    c2    c3    c4    c5    c6    c7    c8    c9    c10   c11   h12   h13  
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 2     1     2     2     2     0     0     0     0     0     0     0     1    
#>  2 2     2     2     3     3     0     0     0     0     2     0     0     1    
#>  3 3     3     2     2     1     0     0     0     0     1     0     0     2    
#>  4 1     1     1     1     1     0     0     0     1     1     0     0     0    
#>  5 1     1     1     1     1     0     0     0     2     2     0     0     0    
#>  6 1     1     1     1     1     0     1     0     2     3     0     0     0    
#>  7 2     3     1     2     1     0     0     0     0     0     0     0     0    
#>  8 3     3     2     2     0     0     0     0     2     0     0     0     0    
#>  9 1     3     1     0     0     0     0     0     0     0     0     0     0    
#> 10 3     2     3     0     1     0     0     0     1     2     0     0     0    
#> 11 3     2     3     2     0     0     0     0     0     2     1     0     0    
#> 12 2     2     1     1     0     0     0     0     0     2     1     0     0    
#> # … with 7 more variables: h14 <chr>, h15 <chr>, h16 <chr>, h17 <chr>,
#> #   h18 <chr>, h19 <chr>, h20 <chr>

# Notice that m1 is of class 'outlier'. This means, that the procedure has tested which
# observations _within_ the data are outliers. This method is most often just referred to
# as outlier detection. The following plot is the distribution of the test statistic. Think
# of a simple t-test, where the distribution of the test statistic is a t-distribution.
# In order to conclude on the hypothesis, one finds the critical value and verify if the
# test statistic is greater or less than this.

# Retrieving the test statistic for individual observations
x1 <- douts[1, ] %>% unlist()
x2 <- d[1, ] %>% unlist()
dev1 <- deviance(m1, x1) # falls within the critical region in the plot (the red area)
dev2 <- deviance(m1, x2) # falls within the acceptable region in the plot

dev1
#> [1] 37.71912
dev2
#> [1] 34.03639

# Retrieving the pvalues
pval(m1, dev1)
#> [1] 0.01
pval(m1, dev2)
#> [1] 0.076

# -----------------------------------------------------------
#                        EXAMPLE 2
#         Testing if a new observation is an outlier
# -----------------------------------------------------------

# An observation from class "chronic dermatitis"
z <- derma %>%
  filter(ES == "chronic dermatitis") %>%
  select(1:20) %>%
  slice(1) %>%
  unlist()

# Test if z is an outlier in class "psoriasis"
# Only 500 simulations is used here to exeplify
# The default number of simulations is 10,000
m2 <- fit_outlier(d, g, z, nsim = 500)
print(m2)
#> 
#>  -------------------------------- 
#>   Simulations: 500 
#>   Variables: 20 
#>   Observations: 112 
#>   Estimated mean: 26.79 
#>   Estimated variance: 25.16 
#>  --------------------------------
#>   Critical value: 36.70118 
#>   Deviance: 55.14169 
#>   P-value: 0 
#>   Alpha: 0.05 
#>   <novelty, outlier_model, list> 
#>  --------------------------------
plot(m2) # Try using more simulations and the complete derma data

# Notice that m2 is of class 'novelty'. The term novelty detection
# is sometimes used in the litterature when the goal is to verify
# if a new unseen observation is an outlier in a homogen dataset.

# Retrieving the test statistic and pvalue for z
dz <- deviance(m2, z)
pval(m2, dz)
#> [1] 0

Arguments

Value

Details

See also

Examples