Explorer - User’s Guide

Standard (Free) Edition

©NXG Logic, LLC.

October 7, 2019

PIC

©Copyright 2019 by NXG Logic, LLC.

All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the publisher except for the use of brief quotations in a user guide review.

Printed in the United States of America

ISBN: 978-1-5323-6817-2
United States Library of Congress - US Copyright Office Registration: 1-6268467801
NXG Logic, LLC.
Houston, Texas 77030

www.nxglogic.com

CONTENTS

1 PREFERENCES AND DEFAULT SETTINGS
 1.1 Feature Types
 1.2 General Default Settings
  1.2.1 Missing Data Code
  1.2.2 Output Options
  1.2.3 Replace Missing Continuous Data
  1.2.4 Replace Missing Binary Data
  1.2.5 Replace Missing Categorical Data
  1.2.6 Default image (save) format
  1.2.7 Input Data Size Control
  1.2.8 Data Transformations
  1.2.9 Statistical Analysis
  1.2.10 Script editor
  1.2.11 String Manipulation
  1.2.12 Class Discovery Runs
  1.2.13 Class Prediction Runs
  1.2.14 Monte Carlo Uncertainty Analysis
 1.3 Default Table Fonts and Colors
 1.4 Color and Graphics Defaults
  1.4.1 Graphics
  1.4.2 Histograms
  1.4.3 Correlation, matrix plots, cluster images
  1.4.4 SOM Map Colors
  1.4.5 Cluster Heat Map Colors
  1.4.6 Series colors
  1.4.7 Class discovery score plot markers.
  1.4.8 Multichart panel.
  1.4.9 Histogram, line, and multipanel graph fonts.
  1.4.10 Options for YYYY and XYYY plots.
 1.5 Changing Feature Types
 1.6 Copying HTML output
 1.7 Copying images
 1.8 File Command
  1.8.1 Open Project
  1.8.2 Save Project
  1.8.3 Save Project As
  1.8.4 Import Data
  1.8.5 Export
2 SUMMARIZE
 2.1 Data Arrangement
 2.2 Normal Distribution
 2.3 Central Tendency and Dispersion
3 LABELS, TRANSFORMS, FILTERING, AND GROUPING
 3.1 Editor
  3.1.1 Labels for levels of categorical factors
  3.1.2 Mathematical transforms
 3.2 Importing Category Labels from Text File
 3.3 Filtering records
 3.4 Grouping
 3.5 Limitations of Filtering and Grouped Analysis
4 PRESET TRANSFORMATIONS
 4.1 Normalization
 4.2 Mean-Zero Standardization
 4.3 Logarithmic
 4.4 Quantile
 4.5 Rank
 4.6 Percentile
 4.7 van der Waerden Scores
 4.8 Nominal-to-Binary
 4.9 Fuzzification
 4.10 Continuous to Categorical
 4.11 Text to Categorical (converts all text features)
 4.12 Text to Categorical (select which text feature)
 4.13 Assign Categorical Features w/labels to Class Feature
 4.14 Transformations During Cross-Validation for Supervised Class Prediction
5 INDEPENDENCE
 5.1 2 Unrelated Samples
  5.1.1 Student’s T-test
  5.1.2 Homogeneity of Variance Test for Two Samples
  5.1.3 Mann-Whitney U Test
 5.2 Geometric and Harmonic Means
 5.3 k Unrelated Samples
  5.3.1 Analysis of Variance (ANOVA)
  5.3.2 Welch ANOVA Test for Unequal Variances
  5.3.3 Kruskal-Wallis Test
 5.4 Related Samples
  5.4.1 T-test for Paired Samples
 5.5 Equality Tests for 2 Independent Proportions
 5.6 Chi-Squared Contingency Table Analysis
 5.7 Two-way Contingency Tables
 5.8 Fisher’s Exact Test
 5.9 Multiway Contingency Tables
 5.10 Exact Tests for Multiway Tables
 5.11 McNemar Test for Paired Binary Outcomes
6 ASSOCIATION
 6.1 Covariance
 6.2 Parametric Correlation: Pearson Product Moment
 6.3 Non-parametric Correlation: Spearman Rank
 6.4 Force Plots from Minimum Spanning Trees
 6.5 Multivariate Forms of Covariance and Correlation
 6.6 Euclidean Distance
 6.7 Matrix Formulation of Covariance and Correlation
7 DEPENDENCY
 7.1 Linear Regression: Single Predictor
 7.2 Multiple Linear Regression
  7.2.1 Regression Diagnostics
  7.2.2 Partial F-tests
 7.3 Multivariate Linear Regression
 7.4 Logistic Regression
  7.4.1 Unconditional Logistic Regression
  7.4.2 Polytomous Logistic Regression
 7.5 Additive and Multiplicative Poisson Regression
  7.5.1 Poisson Assumption
  7.5.2 Residuals and Regression Diagnostics
  7.5.3 Goodness-of-Fit Statistics
8 SURVIVAL ANALYSIS
 8.1 Kaplan-Meier Analysis and Logrank Test
  8.1.1 Product Limit Estimate of Survivorship Function
  8.1.2 Log-rank Test
 8.2 Cox Proportional Hazards Regression
  8.2.1 Regression Diagnostics
9 DATASETS USED
 9.1 Plasma Retinol
 9.2 Low Birth Weight
 9.3 Small Round Blue-Cell Tumors
 9.4 British Doctors Cardiovascular and Smoking Data
 9.5 Primary Biliary Cirrhosis

Chapter 1
PREFERENCES AND DEFAULT SETTINGS

1.1 Feature Types

There are 4 types of features used in NXG Explorer: continuous, binary, categorical, and text. Feature type is determined when inputting a dataset. In order to distinguish the 3 numeric types of features, during input of data NXG Explorer enumerates the number of unique values identified for each feature. The number of unique values is compared with a default setting called categorical threshold (default=10). If the number of unique feature values exceeds the categorical threshold, then the feature is specified as a continuous type. If the number of unique values of a features and the maximum value is less than the default categorical threshold, the variable is specified as a categorical. However, if there are two unique values of a feature identified, and min=0 and max=1, then the feature is specified as being binary. Any time text is identified within an input feature’s values, the feature is specified as a text feature.

The table below illustrates the feature types along with the graphic icon used during feature selection for all analyses.

PIC

NXG Explorer searches for a decimal in the input values, and if found, the feature is set to the continuous type.

An example of the feature codes for the Retinol dataset in Explorer is shown below:

PIC

1.2 General Default Settings

From time-to-time, users will want to modify some of the default settings in NXG Explorer, such as heat map colors, text and line colors on charts, and text-formatting during data input. To change defaults settings, select Preferences→ General default settings as shown below:

PIC

and the following popup window will appear:

PIC

The following options available in the popup window are described below.

1.2.1 Missing Data Code

Remove spaces and special symbols on input. This filter provides a useful way to filter out common errors in input files. Examples are empty spaces, dashes, Excel error codes, or special symbols. Values for the empty data cells identified will be set to a null set. No periods or codes are used for replacement.

Missing data code. If a preset missing data code such as NA or 9999 was used during coding, then the missing data code can be used in this textbox. NXG Explorer will replace any table cells with the missing code to a null value.

Remove empty rows in Excel. This option will remove empty rows from the input dataset.

Required proportion of records with string entries for text features. This option helps guard against the input of messy features containing, for example, half text and half numeric values. If string characters are found in a feature’s values, then this proportion can be used as a threshold for determining which features are text-based. For example, when set to 0.8, if less than 80% of the input records for an input feature contain text, then the feature will not be identified as a text-based feature, and any records with text will be set to nothing. On the other hand, if more than 80% of the input values for this feature contain text, then the entire feature will be input as a text feature. Use of this threshold is a good way to enforce that true text-based features do indeed contain mostly text.

1.2.2 Output Options

Empty output folder at program startup. Due to Windows security issues, especially in Windows 10, there are limitations for the folders which can freely be used for output by installed software. By default, at run-time, NXG Explorer will create an output folder called “NXG Explorer” in the user’s Document folder, which will contain all output files. When this option is selected, the entire folder will be deleted before each run, preventing the accumulation of old files from previous runs. We have used this option most of the time, and have never run into any problems, knowing full well that if an output is to be copied, then we will specifically save outputs to working folders used for report-writing or manuscript preparation. Otherwise, if this option is not selected, then the output files form all runs will be placed in this folder, without being purged.

Remove logos. This option will remove NXG Logic logos from various output files.

1.2.3 Replace Missing Continuous Data

Apply. When checked, will apply the following selected option (only one selection allowed).

With zeroes. Missing continuous values will be replaced with zeroes.

With feature means. Missing continuous values will be replaced with the average of the valid input values.

With feature medians. Missing continuous values will be replaced with the median of the valid input values.

1.2.4 Replace Missing Binary Data

Apply. When checked, will apply the following selected option (only one selection allowed).

With zeroes. Missing binary values will be replaced with zeroes.

With ones. Missing binary values will be replaced with ones.

1.2.5 Replace Missing Categorical Data

Apply. When checked, will apply the following selected option (only one selection allowed).

With zeroes. Missing categorical values will be replaced with the lowest feature category value (level of the factor).

With ones. Missing categorical values will be replaced with the greatest feature category value (level of the factor).

1.2.6 Default image (save) format

Available formats. NXG Explorer available default image formats are png, wmf, jpg, gif, tif, bmp, emf, exi, ico, and memorybmp.

1.2.7 Input Data Size Control

Max limit for display of input data points. For large datasets, limitations can be set on the maximum size of data which can be displayed in the visible datagrid.

1.2.8 Data Transformations

Recode text categories into numeric codes on input. This option (true, false) will prompt the user with text-to-numeric conversion options during data input to assign numeric class membership to various levels of a categorical text field. If text is found in a feature’s value when set to true, NXG Explorer will tabulate the number of unique text strings found among the features data and prompt the user if the unique text words should be transformed into numeric class values. Therefore, the command will result in text-to-numeric class assignments when specified by the user. When set to false, and text is found in features upon input, the features will remain as text-based features.

Default level of categorical features: 10. This default value, known as the categorical threshold, forms the basis for determining which features are to be specified as categorical and continuous. The section above on “Feature types” describes how this threshold is used. Quite basically, if the number of unique feature values exceeds this default setting, the feature will be specified as a continuous feature, and categorical otherwise.

Input binary and categorical as continuous. When checked, this option will change the feature type from binary or categorical to continuous.

1.2.9 Statistical Analysis

Minimum sample size for grouped tests: n = 10  . The default value for the minimum sample size for grouped tests is n = 10  . Below a sample size of 10, which would equate n = 5  on average for a 2-sample test, the robustness and reliability of a hypothesis becomes degraded as sample size decreases. Users can set this criterion to whatever value they choose.

Report reduced test results table. For any 2- or k -sample test of equality of means test for continuous features, this option will cause NXG Explorer to output a reduced table of results based on the choices listed below. This option is particularly useful for presentations and/or publications, when not all of the results are used in published tables. The more detailed tables which are generated when this option is not specified are more amenable to research and exploration, when more details are needed.

  1. Report only parametric test results. When the reduced test result table option is specified, only parametric test results will be listed for continuous features.
  2. Report only non-parametric test results. When the reduced test result table option is specified, only non-parametric test results will be listed for continuous features.

Regression: Minimum n-2-p ratio. This option determines the minimum ratio of n =sample size to p =feature ratio. If the ratio of the sample size to model-based features is less than the specified ratio, then regression runs will be disabled. It is recommended that an n :p ratio of 10 is used, which follows the typical recommendation of having 10 records for every regression parameter used.

Model building strategy screen p-value. During regression, a univariate regression model whose p-value is less than the specified value will be included in the following multi-feature regression model.

Model building strategy table p-value. Regression covariate’s whose p-values are less than the specified value will be listed in the multi-feature regression output.

Iterations for R × C categorical tables - randomization tests (Fisher’s Exact). This value determines the number of iterations used during an empirical p-value test for chi-squared contingency table analysis, in order to mimic the Fisher exact test. A default value of 100,000 is recommended; however, values of e.g. 10,000 should not result in unreliable results, based on Monte Carlo statistics.

1.2.10 Script editor

Change font. This default setting changes the font typeface for the label and transform editor.

Font size. This default setting changes the font size for the label and transform editor.

1.2.11 String Manipulation

Create unique feature names on input. If there are long feature names among input features names, it has been observed that using this option is beneficial, since many long feature names can be drastically shortened. By default, this option is turned off.

Input feature names as lower case. This option allows users to reduce any upper case characters among input feature names to lower case. Many software packages now work with lower case feature names as an attempt to get away from use of case-sensitive feature names.

Drop special characters from input feature names. When this option is selected, special characters such as, # $ %, etc. will be filtered out of input feature names.

1.2.12 Class Discovery Runs

Dimension reduction only. Class discovery runs include non-linear manifold learning (dimensional reduction) techniques, which have varying output. When this option is specified, only the reduced 2D feature values are written to an Excel file along with the 2D scatter plot of the scores for the 2D.

Full output. This option will results in the full output by the class discovery runs, which can include a variety of different charts and Excel files, depending on the algorithm used.

Use corr matrix for CMF output(Cov when unchecked). During PCA, component subtraction, and Monte Carlo analysis, when checked, the correlation matrix will be used, vs. use of the covariance matrix when unchecked.

1.2.13 Class Prediction Runs

Transform features during CV. When checked, this option will employ feature transformation over all objects within training folds, and apply parameters to objects in the testing fold.

Mean-zero standardization. Specifies that feature value mean-zero standardization will be used for feature transformation. Feature mean and s.d. will be derived from objects in the training folds, and will be applied to objects in the testing fold.

Normalize into range [0,1]. Specifies that feature value normalization to the range [0,1] will be used for feature transformation. Feature min x and max x will be derived from objects in the training folds, and will be applied to objects in the testing fold.

1.2.14 Monte Carlo Uncertainty Analysis

Discrete features are zero-based. When checked, this option will establish a zero base for any simulated discrete features.

Add distribution parameters to feature names. When checked, this option will append feature names with an abbreviated name of the probability distribution used as well as the parameter values.

1.3 Default Table Fonts and Colors

This section applies to the format of the table with summary statistics (command: Analysis→ Summarize) and hypothesis test results (command: Analysis→ Independence→ Unpaired 2- and k-sample tests). To specify the default table fonts and colors, select Preferences→ Default table fonts and colors, as shown below:

PIC

and you will see the following popup window:

PIC

The following descriptions describe the various options in the popup window.

Font. The font used for the entire table.

Font size. The font size used for the entire table.

Header font color. Color of the header text.

Row font color. Color of the text in rows.

Table border width. Width of the border.

Table border color. Color of the border.

Header border width. Width of the header’s border.

Header border color. Color of the headers’s border.

Cell border width. Width of cell border.

Cell border color. Color of the cell border.

Header background color. Background color of the header.

Odd row background color. Background color for odd rows. Note: row numbering starts with row 0 below the header row.

Even row background color. Background color for even rows.

Make header background same as odd row background. This option will set the background of the header to be the same as the background for odd rows.

Auto-color picker (base color). This color option allows the user to set the base color for the entire table.

1.4 Color and Graphics Defaults

To change color and graphics default values, select Preferences→ Color and graphic defaults, shown below:

PIC

and the following popup window will appear:

PIC

1.4.1 Graphics

Low resolution. This option results in low resolution graphics near 96 dpi.

High resolution. This option results in higher resolution graphics at 300 dpi.

1.4.2 Histograms

Histogram bar color. This default color option specifies the color choice for histogram bars.

Histogram spline color. The default color of the line when spline fitting is specified for histograms.

Histogram bins. The default value of the number of histogram bins.

Datagrid selection color. This default option specifies the color of either the alternating row colors in the datagrid, or selected rows after record filtering has been invoked.

1.4.3 Correlation, matrix plots, cluster images

Gradient 1st color. This gradient color defines the color starting at the top of the cluster image.

Gradient 2nd color. This gradient color defines the color at the bottom of the cluster image.

Line color. This color specifies the color of the dendograms for objects and features.

Font color. This color specifies the color of all object and feature names, as well as header information regarding the distance metric used and scale for heat map colors. Note that if “Alternate object name colors” is checked, then those alternating colors override the default text font color.

Alternate object name colors. This option only applies when the true classes are input with the dataset. When checked, the color of object names within classes will vary in the order blue, green, red, and black. Modulo 4 is used for the object’s class numbers. If no classes are input with the data, then the default font color will be applied.

Missing values. This color specifies the heat map cell colors for missing values.

1.4.4 SOM Map Colors

This default setting specifies the default color gradient used for the U-matrix and feature-specific component plots, which are generated when self-organizing maps are used.

1.4.5 Cluster Heat Map Colors

This default setting specifies the default color combination used for heat maps in cluster images.

Some example cluster heat map results are shown as follows:

PIC

PIC

PIC

Default image (save) format This default setting specifies the file format for all graphic output files. The choices are png, wmf, jpg, gif, tif, bmp, emf, exi, ico, and memorybmp.

1.4.6 Series colors

Windows color palettes. Numerous Windows-based color palettes are available for use with 2D, 3d, line and X-Y scatter plots. If object classes are specified at input, then classes 1,2,...,k will be assigned the series colors which are selected for default. Otherwise, in YYYY and XYYY line or scatter plots, the different series-specific colors will be applied to the features or dimensions used.

Custom color palette. You can specify which default series colors are to be used by clicking on each series color, and then selecting the desired color. After clicking on Apply, the default custom colors will be applied, as long as the option above the custom colors is selected.

1.4.7 Class discovery score plot markers.

You can specify which default series colors are to be used for class-specific colors in 2d and 3D plots generated during unsupervised class discovery.

1.4.8 Multichart panel.

Default series colors are also available for object class in the multichart plot generated at the end of unsupervised class discovery.

1.4.9 Histogram, line, and multipanel graph fonts.

Default fonts can be selected for class discovery plots, as well as line and scatter plots (YYYY, XYYY), and multipanel plots.

1.4.10 Options for YYYY and XYYY plots.

Default fonts can be selected for class discovery plots, as well as line and scatter plots (YYYY, XYYY), and multipanel plots.

Line only. This selection will result in lines between data points in YYYY and XYYY line (scatter) plots.

Points only. This selection will result in points being plotted in YYYY and XYYY line (scatter) plots.

Line and points. This selection will result in lines and points being plotted in YYYY and XYYY line (scatter) plots.

X- and Y-axis thickness. This selection specifies the X- and Y-axis line thickness in YYYY and XYYY line (scatter) plots.

1.5 Changing Feature Types

Changing categorical features to continuous. On occasion, you may need to change categorical features to continuous. As an example, let’s suppose you have integer-based scores on the Likert scale for which the item response is 1-Strongly disagree, 2-Disagree, 3-Neutral, 4-Agree, 5-Strongly agree. By default, this particular feature will be read in as categorical type, however, for the analysis you want to use means and standard deviations perhaps for t-tests, ANOVA, or regression. There are two ways to change categorical features to continuous:

  1. To change during input: select Preference→ General default settings, and then set the maximum level used for determining categorical features during data input. For example if there are many integer-based categorical features with a maximum value of 10, and you would like all of these features to be read in as continuous features, then simply change the value of the maximum level for determining categorical features to e.g. 3, as shown below.

    PIC

    Then click on Apply. You will need to reload the input data, and then, for each text feature, NXG Explorer will code the feature as continuous.

  2. To change after data input: select the “Feature scale preferences,” menu command in the Preferences menu, shown as:

    PIC

    The following popup window will appear:

    PIC

    Simply click on continuous for the categorical features you want changed.

Changing text features to categorical. You cannot use the popup window above to change text features to categorical. Rather, this has to be done one of two ways:

  1. To change during data input: select Preference→ General default settings, and then click the check box under Data Transformations:

    PIC

    Then click on Apply. You will need to reload the input data, and then, for each text feature, NXG Explorer will identify the number of unique categories based on the text values, and prompt you to convert to categorical for each text feature found in the data. (If you have a lot of text features, this method is not recommended, see the next option).

  2. To change after data input: select the following menu commands from the Transformations menu command, as follows:

    PIC

    Then select the text feature you want converted to categorical and NXG Explorer will identify the number of unique values among the text values and convert the feature to categorical.

1.6 Copying HTML output

Right-clicking on HTML output. Whenever the Output tab is activated after clicking on a treenode that was generated during a run, a region of interest can be selected for copying & pasting by right-clicking anywhere in the HTML output and then dragging the rectangular “rubber band” around the region of interest. As soon as the left-mouse button is let go, the ROI selected will be copied into memory, which can be pasted in any software that accepts Windows objects (Word, Powerpoint, etc.).

Shift-Right-clicking on HTML output. If the Shift key is pressed and then the mouse pointer is right-clicked on an HTML output, the Windows printer options will appear.

1.7 Copying images

Right-clicking on any output image will raise a popup window, which will show the following choices:

If Region of interest is selected for copying, then click on the left mouse button and drag the mouse pointer rectangular “rubber band” around the region of interest. As soon as the left-mouse button is released, the ROI selected will be copied into Windows clipboard, which can be pasted in any software that accepts Windows objects (Word, Powerpoint, etc.). If Copy is selected, the entire image will be copied into the Windows clipboard, which can then be pasted into other Windows application that accept Windows objects. Otherwise, if Save is selected, several image format options will appear before saving the image.

1.8 File Command

1.8.1 Open Project

This option opens an already-saved Explorer project, for which the filename extension is .nxgp.

1.8.2 Save Project

This option saves an already-saved Explorer project, for which the filename extension is .nxgp.

1.8.3 Save Project As

This option saves a dataset in memory to an Explorer project, for which the filename extension is .nxgp.

1.8.4 Import Data

Excel. This option imports either an Excel file (.xlsx, or .xls), comma-delimited file (.csv), or a tab-delimited text file (.txt).

Category codes. This option allows the import of a tab-delimited text file (.txt) to affix category labels to levels of the factors for which labels are provided. The command can be invoked (used) before or after data input. See the catcodes_Retinol.txt file for an example of category label formats for the three categorical factors in the Retinol dataset (sex, smoking status, and vitamin use).

Category code map. This option allows the import of a tab-delimited text file (.txt) with categorical feature names followed by a mapped term which is used in the dataset. For example, consider the following example categorical features and their counterpart mapped terms which are used in a dataset:




Factor Feature name Mapped name



Arm arm armcode



Sex sex sexcode



Smoking status smokstat smokstatcode



In the case of the Arm factor, the dataset contains the mapped term armcode, however, the user wants to use a feature name arm in all the analyses. Use of this command option will allow the user to translate a mapped term to any desired feature name.

Merge Excel. Use this option for merging multiple files together. Since 1-to-1 and 1-to-many cannot be specified, you should import the largest file (with a primary key), followed by smaller files (with primary keys). In this fashion, a many-to-fewer can be used successively to achieve the desired final merged file.

1.8.5 Export

Comma-delimited (.csv) Selecting this option will export data in memory to a comma-delimited .csv file, for which the name must be provided.

Tab delimited text (.txt) Selecting this option will export data in memory to a tab-delimited text (.txt) file, for which the name must be provided.

Chapter 2
SUMMARIZE

This chapter covers NXG Explorer’s Summarize methods for summarization of variables in a dataset. Construction of a typical dataset is first discussed, which is followed by measures of central tendency for continuously-scaled variables, followed by a brief discussion of categorical data analyses.

2.1 Data Arrangement


Table 2.1: Arrangement of records and variables for a set of continuously-scaled observations.







Variable






Record 1 2 ⋅⋅⋅ j ⋅⋅⋅ p







1 x11  x12  ⋅⋅⋅ x1j  ⋅⋅⋅ x1p
2 x21  x22  ⋅⋅⋅ x2j  ⋅⋅⋅ x2p
..
.  ..
.  ..
.  ...  ..
.  ..
.  ..
.
i xi1  xi2  ⋅⋅⋅ xij  ⋅⋅⋅ ..
.
...  ...  ...  ⋅⋅⋅ ...  ...  ...
n xn1  xn2  ⋅⋅⋅ ⋅⋅⋅ ⋅⋅⋅ xnp







A set of continuously-scaled data are often arranged into individual records and variables. A record represents an individual measurement or observation made from one unit of study, such as a patient, mouse, cell, gene, or protein. Variables, on the other hand, represent the collection of observations for all the records. When arranging the data in tabular form, records are arranged in rows, where the index i is used to denote each record, with range i = 1,2,...,n . Variables are arranged in columns where the index j is used to denote each variable, with range j = 1,2,...,p . Let us consider the random variable xij  for representing a single observation, where the subscript i denotes the i th record and subscript j denotes the j th variable. A collection of multiple xij  ’s for many observations can be arranged in tabular form according to the arrangement in Table 2.1. The symmetric arrangement of data in Table 2.1 can also be represented by use of a matrix. For n records and p variables, we introduce the matrix X in the form:

     (                   )
     |  x11  x12  ⋅⋅⋅  x1p|
X  = ||  x21  x22  ⋅   x2p|| .
n×p  (   ...    ...   ...   ... )
        xn1  xn2  ⋅⋅⋅  xnp
(2.1)

Each row in X represents a record, and each column represents a variable. A matrix is characterized, among other things, by the dimension. The dimension of X for our example is n × p , called “n by p ,” representing the n records and p variables. Each element denoted xij  , represents the observation for the i th record and the j th variable in the data set. Because the matrix notation xij  can be used to represent any record or variable in the data set, matrices occupy a very powerful area of statistics based on what is known as matrix algebra. As we shall see in the remaining chapters, matrix algebra is fundamental to many areas of statistics, including analysis of variance, maximum likelihood modeling, etc.

However, before we address more advanced topics, let’s stick with methods used for describing the distribution of a random variable like xij  . When considering a set of data, independent of study design or analysis, a fundamental question among all statisticians is: “How are the data distributed?” With regard to scale of measurement, are the data categorical and measured on the nominal scale, ordinal scale, or are the data continuous and measured on the ratio scale? In terms of probability distributions, when the data are arranged into a histogram, are they normally distributed with a bell-curve shape, distributed according to the binomial distribution, Poisson distribution, or multinomial distribution? A frequency histogram (distribution) of most data used in statistics can be characterized by a specific probability distribution that captures information related to the scale of measurement, observation response, and type of variable.

2.2 Normal Distribution

There are many kinds of continuous distributions. The one we will be most concerned with is the is called the normal distribution. Normal distributions are also known as Gaussian distributions, and consist of a sample of normal variates. A frequency histogram of normal variates is typically bell-shaped, with the bulk of data lying in the middle of the curve and smaller less frequent data residing in vanishing tails. The particular shape of a histogram is characterized by its central tendency and dispersion.

2.3 Central Tendency and Dispersion

For normal distributions, two parameters used to distinguish central tendency and dispersion are the mean, or average, and standard deviation. Let x1j,x2j,...,xnj  be a random sample of size n from the distribution of random variable Xj  . The mean of the n variates is denoted ¯xj  and is a measure of central tendency that reflects the average value of xij  :

    ∑n
¯xj =--i=1xij.
       n
(2.2)

The mean is also equal to the expected value of x , in the form

           ∑
μx = E[x] =   pixi
            i
         = (x1 +-x2 +-⋅⋅⋅+-xn)
           ∑       n
         = --ixi.
             n
(2.3)

Terminology used for the mean is μ for a population, and ¯x for a sample. Typically ¯x is used since one usually never knows the true population mean, but rather an estimate of the sample mean through some kind of sampling strategy. The sample variance of variable j is a measure of dispersion reflecting the overall width of the distribution, and takes the form

    ∑n   (x  − ¯x )2
s2j = --i=1-ij---j--.
         n− 1
(2.4)

and the standard deviation of variable j is simply the square root of the variance, functionally composed as

    ∘ ---------------
      ∑ni=1(xij −-¯xj)2
sj =        n− 1     .
(2.5)

The variance can also be determined by using the expected value of  2
x  . The expected values of a random variable  2
x  is

       ∑
E[x2] =   pix2i,
        i
(2.6)

and the variance of x becomes

σ2x = Var(x) = E [x2]− E [x]2
             ∑   (xi − μx)2
           =     ---n-----
             ∑ i       ∑
           =    pix2i − (  pixi)2.
               i        i
(2.7)

Note that there was a shift in the notation used for several of the parameters introduced above. Various forms of notation are used interchangeably. These are described as follows.

  Sample notation:

              x¯= Sample mean
              sx = Sample standard deviation
              s2x = Sample variance

             sxy = Sample covariance between x and y
             rxy = Sample correlation between x and y

Population notation:

             μx = Population mean
              σx = Population standard deviation
              σ2x = Population variance
             σ  = Population covariance between x and y
              xy
             ρxy = Population correlation between x and y
(2.8)

There are several other important summary statistics for the normal distribution, which are the range, median, mode, skewness, and kurtosis. The range can be simply determined by use of the relationship

rangej = max (xij)− min(xij).
(2.9)

The range of a distribution is informative about how far apart the most extreme data points are, which is useful when comparing variables with another. For example, if max (xi1)  and min(xi1)  for variable j = 1  are 2.53 and 0.4, the range is 2.13. Whereas, if for variable j = 2  the values of max(xi2)  and min (xi2)  are 79.6 and 21.2, the range is 58.4. A large difference in range is also known as a large change in scale, affecting the mean difference and covariance between two variables. Another statistic of interest is the median, which reflects the value of a variable in the midpoint of the distribution. The median can easily be found by sorting the data in a distribution in ascending order and identifying the midpoint. Leaving the variable subscript j aside, let x1,x2,...,xn  be a set of normal variates and x   ≤ x   ≤ ⋅⋅⋅ ≤ x
 (1)   (2)        (n)  be their rank ordered counterpart. The median is equal to the rank ordered variate x
 (n∕2)  . Definition of the median introduces a new concept called the p th quantile. A quantile is any value among the rank ordered variates x   ≤ x  ≤ ⋅⋅⋅ ≤ x
 (1)   (2)        (n)  . Consider 100 rank-ordered normal variates such that n = 100  . The lower quartile is the 25th percentile which is equal to x
 (25)  . As mentioned previously, the median is the 50th percentile and for this example would be equal to x
 (50)  . The upper quartile is the 75th percentile and is equal to is x
 (75)  .

In cases when the most frequent quantile is sought, the mode is determined, which is the most frequently occurring quantile. The mode occurs at the peak of the distribution, since the x -value at this location is the most frequent.

We now introduce the r th moment about the mean of a distribution. The r th moment about the mean is defined as

        n
κr = 1-∑  (xi − ¯x)r.
     n i=1
(2.10)

The coefficient of skewness is

     --κ3--
γ1 = (κ2)3∕2,
(2.11)

and the coefficient of kurtosis is

     -κ4--
γ2 = (κ )2 − 3.
       2
(2.12)

Shapiro-Wilk Normality Test. NXG Logic Explorer also performs a normality test on each continuously-scaled feature to determine if the feature is normally-distributed. The Shapiro-Wilk test  [1] first sorts x -values for each feature in ascending order: x(1) < x(2) < ⋅⋅⋅ < x (n)  . Next, a standard normal variate is generated for each x(i)  , based on       −1
mi = Φ  [(i− 0.375)∕(n + 0.25)]  , which are then summed and squared to give

    ∑n
m =    m2i.
     i
(2.13)

Next, calculate u = 1∕√n- , and determine a series of a
 i  values using the relationships:

a  = − 2.706056u5 + 4.434685u4 − 2.07119u3 − 0.147981u2 + 0.221157u + m m−0.5
 n                                                              n
(2.14)

                5           4          3           2                   −0.5
an−1 = − 3.582633u + 5.682633u − 1.752461u − 0.293762u + 0.042981u+ mn −1m
(2.15)

and finally, for i = 3,4,...,n − 2  , we use

ai = mi∕√ 𝜖,
(2.16)

where

    m-−-2m2n-−-2m2n−1-
𝜖 =  1− 2a2n − 2a2n−1 .
(2.17)

The lower values on the other end of the scale are a1 = − an  , and a2 = − an−1  . The W statistic is calculated as

      ∑n      2
W =  (∑n-i-aixi)2.
       i(xi − ¯x)
(2.18)

The Shapiro-Wilk test statistic is standard normal distributed with

Z = log(1-−-W-)−-μ,
          σ
(2.19)

where

μ = 0.0038915log(n)3 − 0.083751log(n)2 − 0.31082 log(n)− 1.5861,
(2.20)

and

       [             2                      ]
σ = exp 0.0030302log(n) − 0.082676 log(n)− 0.4803 .
(2.21)

Values of |Z| > 1.96  indicate departure of the data being tested from the normal distribution. Hence, if the p-value of the Shapiro-Wilk test is less than 0.05, the test result indicates that the feature being tested is not normally-distributed.

Example 1 - Summary Statistics. This example will involve import of a (Microsoft) Excel file into NXG Explorer, and then generation of summary statistics for the variables in the file. Open the Retinol dataset which includes n = 314  records for research subjects who studied in a nutritional investigation on the antioxidants retinol and beta-carotene.

First, import the retinol.xlsx file in the Data directory where NXG Explorer is installed. This is done by selecting File → Import → Excel, and then selecting the Retinol.xlsx file.

PIC

Another popup window will appear which allows the user to specify the input format of the Excel workbook which is being imported. Let the default format be the choice, then click Apply:

PIC

After the Retinol.xlsx filename is selected and double-clicked, the data tab will show the data that were input (see image below).

PIC

Next, select the Summarize command in the Analysis pull-down menu:

PIC

You will now see a list of variables,

PIC

Click on Select all:

PIC

You will then notice check marks placed in all of the checkboxes,

PIC

Next, click on the Apply button, and the summary statistics will be generated, revealing a tree-node in the treeview on the left:

PIC

Click on the Summary statistics treenode, and the results will be shown in the Output tab:

PIC

Click on the Category frequencies icon, and a listing of categorical summary statistics will be shown for the categorical features:

PIC

Next, click on the Histogram icon, and you will notice a histogram plot that was generated for each feature:

PIC

To manually generate histograms for all the continuous variables, click on the Histogram command of the Graphics pull-down menu:

PIC

Next, uncheck the categorical variables representing sex, smoking status (smokstat), and vitamin use (vituse):

PIC

Click on Apply in the next popup window:

PIC

Select the betadiet variable (treenode) from the treeview:

PIC

The following plot shows the histogram for the betadiet variable, revealing a log-normally distributed variable with right-tail skewness.

PIC

Bibliography

[1]   Royston, P. An extension of Shapiro and Wilks’s W test for normality to large samples. Applied Statistics. 31: 115–124; 1982.

Chapter 3
LABELS, TRANSFORMS, FILTERING, AND GROUPING

3.1 Editor

NXG Explorer provides a label and mathematical transform editor which can be employed to assign category labels to the various levels of a nominal or ordinal categorical feature. The editor can also be used for creating new features based on mathematical transforms of existing features, or dummy indicator (binary) features with values (0,1).

The editor is invoked by selecting the Labels and transforms command of the Preferences pull-down menu.

PIC

The image below shows the editor after it opens.

PIC

Next, navigate to the C:\Program Files\NXG Logic \NXG Logic Explorer \Scripts folder and Open the script file called retinol.nxgs, as shown by

PIC

In the image shown below, after opening the script file, you will notice any existing script within the file.

PIC

The scripts are divided into two sections that begin with either labels and/or trans, followed by the relevant syntax used in the specific section.

PIC

3.1.1 Labels for levels of categorical factors

Recall, the categorical features in the Retinol dataset are as follows:



Variable Explanation (Unit)


sex Sex (1=Male, 2=Female)


smokstat Smoking status (1=Never, 2=Former, 3=Current Smoker)


vituse Vitamin Use (1=Yes, 2=Sometimes, 3=No)


Labels for various levels of categorical factors (features) can be assigned quite simply by entering the level of the factor and listing the name of the category (level) immediately after, as shown below:



Number of categories Command


2 catvarname^1 cat1label^2 cat2label


3 catvarname^1 cat1label^2 cat2label^3 cat3label


4 catvarname^1 cat1label^2 cat2label^3 cat3label^4 cat4label


5 catvarname^1 cat1label^2 cat2label^3 cat3label^4 cat4label^5 cat5label


k catvarname^1 cat1label^2 ... ^k catklabel


An example for the categorical factors in the Retinol dataset for smoking status, vitamin use, and gender, would be:

labels sex^1 Male^2 Female;  
       smokstat^1 Never^2 Former^3 Current;  
       vituse^1 Yes^2 Sometimes^3 No@

To invoke (implement) the label and transformations listed in the table above, open the Retinol dataset, open the script file, and then select Run.

The labeling results can be observed in the following image which reveals the category labels in an ANOVA test of several continuously-scaled and categorical variables in the Retinol dataset on the 3-category feature vituse.

PIC

Warning! You must ensure that each set of labels assigned to each categorical feature (factor) ends with a semicolon “;”. In addition, the group of label commands for several categorical features must end with an “@” symbol.

3.1.2 Mathematical transforms

Examples of mathematical transformations available in NXG Explorer are listed in the following table:





Type Command Example Result




Binary ==  male=sex==1; males=1, females=0



==  female=sex==2; females=1, males=0



==  smok1=smokstat==1; smokstat(1)=1, other=0



==  smok2=smokstat==2; smokstat(2)=1, other=0



==  smok3=smokstat==3; smokstat(3)=1, other=0



> agegt70=  age> 70; age> 70 =1, age≤ 70 = 0



>=  agege70=  age>=  70; age≥ 70 =1, age< 70 = 0



< agelt70=  age< 70; age< 70 =1, age≥ 70 = 0



<=  agele70=  age<=  70; age≤ 70 =1, age> 70 = 0




Math +  ageplus5=  age+5; age+5



− ageminus5=  age-5; age-5



∗ agetimes5=  age*5; age× 5



∕ ageover5=  age/5; age/5



^2 agesq=  age^2; age× age



sqrt agesqrt=  sqrt(age); √age-



loge  logage=  log(age); loge(age)



log10  log10age=  log10(age); log10(age)



exp expage=  exp(age); exp(age)




Combination ∗ femagegt70=  female*agegt70; females w/age> 70=1, other=0



∗ malesmok3=  male*smok3; male w/smokstat(3)=1, other=0




Examples of new features are:

tran one=1 ; zero=0 ;  
     agegt30=age>30;  
     agege40=age>=40;  
     agege50=age>=50;  
     agelt50=age<50;  
     agegt50=age>50;  
     lage40 = log(age/40) ; lage40sq = lage40^2 ;  
     lage70 = log(age/70) ; lage70sq = lage70^2 ;  
     agegt70=age>70;  
     agege70=age>=70;  
 
     sex1 = sex==1;  
     sex2 = sex==2;  
     sex1lage70 = sex1 * lage70;  
     sex2lage70 = sex2 * lage70;  
     quettimes5 = quetelet * 5@

Results of several of the mathematical feature transformations are listed in the figure below.

PIC

Warning! You must ensure that each set of mathematical transformations ends with a semicolon “;”. In addition, the group of transformation commands for several features must end with an “@” symbol.

3.2 Importing Category Labels from Text File

Another method for inputting and affixing category labels is to import them from a text file directly, instead of using the Editor.

Example 1 - Assign category labels to the Retinol dataset. First, import the retinol.xlsx file in the Data folder where Explorer is installed. Next, import the category-label text file distributed with Explorer, using the pull-down menu command File→ Import Data→ Category codes (labels), and then select the catcodes_Retinol.txt file. Once the file is imported, the category labels will be assigned to the levels of the categorical features (sex, smoking status, and vitamin intake). You will now be able to see the categorical labels in most of the analyses which use categorical features.

3.3 Filtering records

Consider the case where only females in a dataset are to be filtered out and used for hypothesis testing. The first effort should entail filtering on a gender-related feature. Let’s use the Retinol dataset, for which the gender feature is called sex.

To get started, open the Filter form by selecting Filter on the Analysis pull-down menu:

PIC

Next, specify the sex categorical feature, and level 2. then click on Apply.

PIC

The filtered records will then be highlighted in the datagrid, and also will be selected in memory for future use.

PIC

Summary statistics for female records in the Retinol data set are shown below:

PIC

3.4 Grouping

NXG Explorer can perform grouped analysis of most of the analytic modules. Let’s look at an example involving summary statistics of the Retinol dataset, grouped on smoking status. For this example, we will also label the smoking categories using methods outlined earlier in this chapter.

Example 2 - Summary statistics of Retinol dataset, grouped by smoking status.

1. To begin, import the Retinol dataset.

2. Apply the labels provided in the retinol.nxgs file (see earlier labeling example).

3. Select Analysis→ Grouped Analysis, as shown in the image below:

PIC

4. Next, select the smokstat feature, shown as:

PIC

5. Click on Apply, and you will notice that a red background is now shown in the warning textbox to the upper right of the user screen, shown as:

PIC

6. In the next popupwindow that appears, select the features age, quetelet, calories, fat, and fiber.

7. When the run is complete, you will notice three red icons representing the three smoking status groups, with the accompanying group names and categorical codes listed in parenthesis next to each red icon:

PIC

Click on the Summary statistics icon beneath the red icon with label “Summarize(smokstat=3),” and the summary statistics table for current smokers will be shown as follows:

PIC

3.5 Limitations of Filtering and Grouped Analysis

When using Explorer, filtering and grouping do not work together. Filtering records should only be followed by non-grouped analyses such as summary statistics, correlation, or regression after the specified records have been selected. Whereas grouped analysis can be used for 2- and k-group hypothesis testing, summary statistics, regression, and Kaplan-Meier grouped analysis.

In summary: use the following sequence:

Filtering (gender=males): can run summary statistics, correlation, regression, but not independence tests (T-tests, MW, ANOVA, KW) or KM grouped survival analysis.

Grouped analysis (gender): can run summary statistics, independence (T-tests, MW, ANOVA, KW), regression, correlation, and KM survival analysis.

For example, if you want to run ANOVA with only male data, then perform a grouped analysis using gender as the grouping variable, and then run ANOVA using e.g. treatment arm as the categorical factor, and age, BMI, protein expression as continuous features

Warning! Filtering or grouping may result in program errors or sample sizes of “NaN,” (not a number) in table outputs, depending on the analysis employed. For example, if a treatment arm in a clinical trial does not have any male subjects and grouped analysis is employed for treatment arms, and 2- or k-group T-tests (ANOVA) are performed on gender as a factor feature, the arm with no males will report a sample size of NaN.

Chapter 4
PRESET TRANSFORMATIONS

The range and scale of a feature can have a large impact on the results of an analysis, especially if the heterogeneity of range and scale among the inputs are widely varying. NXG Explorer can transform features (variables) using various preset methods to remove skewness or to change the scale and range.

Once a dataset is opened, the preset transformations can be accessed from the Transformations command of the Analysis pull-down menu, as shown below.

PIC

The following sections describe the various preset transformations provided in NXG Explorer.

4.1 Normalization

Normalization re-scales the values of a continuous feature to be in the range [0,1] using the relationship

x =  -xi −-xmin-,
 i   xmax − xmin
(4.1)

where x
 i  is the i th value of a feature, x
 min  is the minimum value of the feature, and x
 max  is the maximum.

Note: During normalization, NXG Explorer adds “n_” to the leading edge of the original variable’s name.

4.2 Mean-Zero Standardization

Normal distributions can vary by central location and dispersion based on the mean and variance. A frequency histogram for a group of patients having a mean weight of 130 pounds and standard deviation 12, will be centrally located 70 units (pounds) to the left of a histogram for patients with mean weight 200 pounds and standard deviation 12, when plotted on the same graph. Distributions with varying means and standard deviations are called scale and range dependent, since their location and spread on the same histogram will be different. A powerful transformation which can be applied to any normal distribution to remove the effects of scale and range is called the Z-transformation.

Let ¯x and σx  be the mean and standard deviation of a normal distribution of sample size n =1,000. The Z -score for each of the n normal variates is given as

    x − x¯
Zi =-i----,
      σx
(4.2)

where xi  represents a normal variate, and Zi  is a newly derived standard normal variate. The process of subtracting the mean from each normal variate and dividing by the standard deviation is called standardization, which results in a standard normal distribution. A common property of all standard normal distributions is that they all have a mean equal to zero and a standard deviation of one, with variance equal to one (12 = 1  ). In addition, the Z-score equates to the number of standard deviations each normal variate lies away from the mean. Large values of Zi  above 2 are said to be “2-sigma” away from the mean, and are commonly considered as extreme values. For a tail probability of α = 0.05  , the value of Z in the negative tail is -1.65, whereas in the positive tail the value of Z =1.65. When the tail probability is divided by 2, that is α∕2  , the values of Z in the negative and positive tails become -1.96 and 1.96, respectively. A probabilistic definition of tail probabilities is that they represent the probability that a given value of Z is greater than or equal to reference value of Zα  . For example, P(Z ≤ − 1.96) = 0.025  , P (Z ≤ − 1.65) = 0.05  . Analogously, P(Z ≥ 1.65) = 0.05  , P (Z ≥ 1.96) = 0.025  .

Standard normal distributions also form the basis of many areas in statistics, including

It is important to note that Z-transformations do not remove skewness from an input distribution. If the feature(s) being transformed is skewed, then the resulting Z-scores will also contain the same degree of skewness. To remove skewness, the log-transform is typically used for log-normally-distributed data, or the van der Waerden transform is used, which always removes skewness.

Warning! Z-transformations do not remove skewness from an input distribution. If the feature(s) being transformed is skewed, the resulting Z-scores will also result in the same degree of skewness.

4.3 Logarithmic

The logarithmic transform is one of the better transforms for rescaling a skewed dataset with a right tail so that it becomes more normally-distributed. As an example, in the Retinol dataset, the distribution of RETDIET is as follows:

PIC

whereas the log-transform (base e) of the RETDIET is illustrated as follows:

PIC

and it can be seen that the transformed version of RETDIET is more normally-distributed since large outlier values of RETDIET in the right tail of the untransformed RETDIET are essentially removed after taking the log_e transform.

Note: For base e , NXG Explorer adds “loge_” to the leading edge of the original variable’s name. For base 10  , “log10_” is added to the leading edge of the original variable’s name. For use of a custom base, for example, 2, the custom base is added to the leading edge in the form “log2_”.

4.4 Quantile

The quantile transform will first determine the percentile value for each observation based on each value’s rank divided by (n + 1)  , and then generate a new variable whose codes represent the various quantile category the percentile falls into. For example, if 4 quartiles (default) are specified, then any original percentile values less than or equal to 25% will be assigned a 1, percentiles greater than 25% and less than or equal to 50% will be coded as a 2, greater than 50% and less than or equal to 75% will be coded as a 3, and the remaining percentiles greater than 75% will be coded as a 4.

Note: NXG Explorer adds “q_” to the leading edge of the original variable’s name.

4.5 Rank

The rank transform will arrange the values for a variable in ascending order (smallest first) and then assign the record number corresponding to the sort position. For example, if there are 100 values for a variable, then the ranks for these original variable values will take on ranks 1,2,...,100  . A common approach for displaying values of x that are sorted in ascending order is to use the notation x(1),x(2),...,x(n)  , where the subscript () denotes ranking. Another common notation to reflect ranked variate values is x1 < x2 < ...< xn  .

The table below lists the ranks for 10 x-values sorted in ascending order.



x Rank(x )


5 1
7 2
9 3
10 4
11 5
17 6
23 7
27 8
31 9
62 10


Note: NXG Explorer adds “r_” to the leading edge of the original variable’s name.

4.6 Percentile

The percentile value for each observation is equal to the value’s rank divided by (n+1). Percentiles reveal the location of each value on a scale of 0 to 1. Therefore, the p5 is the 5th percentile, p10 is the 10th percentile, p25 is the 25th percentile, p50 is the median or 50th percentile, p75 the 75th percentile, p90 the 90th percentile, and p95 the 95th percentile. The table below lists the percentiles for 10 x-values sorted in ascending order.




x Rank(x ) pct



5 1 0.091
7 2 0.182
9 3 0.273
10 4 0.364
11 5 0.455
17 6 0.545
23 7 0.636
27 8 0.727
31 9 0.818
62 10 0.909



Note: NXG Explorer adds “pct_” to the leading edge of the original variable’s name.

4.7 van der Waerden Scores

van der Waerden (vdw) scores transform a variable’s values into skew-zero standard normal variates for which the mean is zero and standard deviation is unity (one). Recall that the normal cdf function Φ (Z )  converts an input Z-score into a tail probability using the normal cdf, whereas the inverse cdf function maps the probability back to the Z-score, as listed in the following table:




Function Notation Conversion



CDF Φ(Z)  Z-score to tail probability (P )
Inverse CDF Φ−1(P)  Tail probability (P ) to Z-score



The steps involved for calculating the vdw score for each value of a variable are:

  1. Sort values in ascending order
  2. Assign ranks corresponding the sort order, starting with a 1 for the smallest value and n for the largest value
  3. Divide each rank by n + 1  to obtain the value’s percentile, that is, pcti = R(xi)∕(n + 1)  .
  4. Input each pct value into the inverse CDF function,  −1
Φ  (pct)  , to obtain the Z-score for the given cdf probability

The table below lists the VDW scores for 10 x-values sorted in ascending order.





x Rank(x ) pct Φ−1(pct)




5 1 0.091 -1.335
7 2 0.182 -0.908
9 3 0.273 -0.605
10 4 0.364 -0.349
11 5 0.455 -0.114
17 6 0.545 0.114
23 7 0.636 0.349
27 8 0.727 0.605
31 9 0.818 0.908
62 10 0.909 1.335




The VDW transform is a skew-zero transform, since it always removes skewness that may exist in the input distribution.

Note: NXG Explorer adds “vdw_” to the leading edge of the original variable’s name.

4.8 Nominal-to-Binary

The nominal-to-binary transform creates a new binary yes/no variable for each level of a category of a nominally-scaled categorical variable. For example, if there are 3 levels of a category, 3 new variables will be generated, and for each record, only one of the three new variable will be set to 1 (0 otherwise) corresponding to the category of the particular record. The table below lists the nominal to binary coding results for a fictitious 3-level categorical variable called “catvar.”





catvar b catvar 1  b-catvar-2  b-catvar-3




3 0 0 1
1 1 0 0
2 0 1 0
2 0 1 0
1 1 0 0
1 1 0 0
2 0 1 0
3 0 0 1
1 1 0 0
2 0 1 0




Note: NXG Explorer adds a “b_” to the leading edge of the categorical variable’s name, and adds a trailing 1,2,...,k to the end of the original variable name to denote which category the binary variable represents.

4.9 Fuzzification

Fuzzy logic provides a mixture of methods for flexible information processing of ambiguous data [123]. Fuzzy transformations were are used to map the original values of an input feature into 3 fuzzy sets representing linguistic membership functions, in order to facilitate the semantic interpretation of each fuzzy set. The fuzzy sets low, medium, and high exploit uncertainty among the original feature values, reducing the information in order to obtain a robust, less-expensive, and tractable solution. To fuzzify a continuous feature, first determine x
 min  and x
 max  as the minimum and maximum values of x
  ij  for feature j over all input samples and q
 1  and q
 2  as the quantile values of x
 ij  at the 33rd and 66th percentile. Then, calculate the averages Avg  = (x   + q )∕2
    1    min   1  , Avg  = (q + q )∕2
    2    1   2  , and Avg  = (q  +x    )∕2
   3    2    max  . Next, translate each value of x
 ij  for feature j into 3 fuzzy membership values in the range [0,1] as μ
 low,i,j  , μ
 mid,i,j  , and μ
 high,i,j  using the relationships

        (
        |{ 1q −x   x < Avg1
μlow,i,j = | q2−2Avg1 Avg1 ≤ x < q2
        ( 0       x ≥ q2,
(4.3)

         (
         ||| 0       x < q1
         { AAvvgg22−−xq1- q1 ≤ x < Avg2
μmed,i,j = || qq2−−Axvg Avg2 ≤ x < q2
         |( 02   2  x ≥ q ,
                        2
(4.4)

         (
         |{ 0       x < q1
μhigh,i,j = | Avx−g3q1−q1 q1 ≤ x < Avg3
         ( 1       x ≥ Avg3.
(4.5)

The above computations result in 3 fuzzy sets (vectors) μlow,j  , μmed,j  and μhigh,j  of length n which replace the original input feature. Figure 4.1 illustrates the values of the membership functions as a function of xij  . When features are fuzzified using the methods described above and then input into class prediction analysis the classification analysis is called “fuzzy,” whereas without feature fuzzification, the classification analysis is called “crisp.”


Figure 4.1: The 3 fuzzy membership functions μlow,j  , μmed,j  and μhigh,j  , which were used to replace expression values of feature (gene) j during fuzzy classification.

Note: NXG Explorer adds an “f_” to the leading edge of the categorical variable’s name, and adds a trailing _1, _2, or _3 to the end of the original variable name to denote the lower, middle, and upper fuzzy transform.

4.10 Continuous to Categorical

This option in the feature transformation module allows one to convert continuously-scaled features to categorical using cutpoints and lower and upper boundary limits. Below is a snapshot of the popup window that appears when performing the continuous to categorical transformations:

PIC

The various options are described below.

Cutpoints are lower bounds. For example, if age is the feature, cutpoints 20,30,40 will result in category codes 1 for ≤ 19, 2 for 20-29, 3 for 30-39, and 4 for 40+.

Cutpoints are upper bounds. For example, if age is the feature, cutpoints 20,30,40 will result in category codes 1 for ≤ 20, 2 for 21-30, 3 for 31-40 and 4 for 41+.

Apply lower limit (30-39, .... This option will enforce a lower bound to the categorization.

Apply upper limit ....51-60). This option will enforce an upper bound to the categorization.

Use real-valued cutpoints (integer is default). This option will replace the standard mathematical notation such as 30 ≤ X < 40  with e.g. 30− 39  , and replace X ≥ 60  with “60+.”

4.11 Text to Categorical (converts all text features)

This feature transformation option will prompt the user to convert all text features in the input dataset. As an example, import the Fisher_Iris_wo_class.xlsx file, and then select Transformations→ Text-to-Categorical (converts all text features) from the command pull-down menu as shown in the following image:

PIC

Since the class feature is a text feature, it’s elements will be loaded into a form showing all words found along with their frequencies.

PIC

and then click on the Set as Categorical button, and the categorical levels of the class text feature will be transformed into class categorical values of 1, 2, and 3. NXG Logic Explorer will also automatically copy the original text value for each category into memory, so that their labels can be output during analyses. As an example, the image below shows results of running ANOVA on the 4 continuous features in the Fisher Iris dataset, with the category labels shown in the column headers.

PIC

Default settings: NXG Explorer’s default settings can be set to convert all input text features to categorical features in the Preferences→ General default settings, and then checking the box to the left of Recode text categories to numeric on input, in the Data Transformations group of commands.

4.12 Text to Categorical (select which text feature)

This transformation will also perform a text-to-categorical transformation for any text feature, however, the user must select the specific feature to be transformed.

4.13 Assign Categorical Features w/labels to Class Feature

Any categorical feature can be set to be a class feature (for class prediction analysis) as long as the labels have been input. This can be accomplished by either importing a file with class values and their labels, or by transforming text-to-categorical using the above transformations, or by assigning categorical labels using the Editor. The numeric class feature values will then be used for assigning symbol colors in all class discovery runs.

4.14 Transformations During Cross-Validation for Supervised Class Prediction

Explorer will transform features values over objects inside CV folds used for training and separately for objects used for testing, based on parameters obtained from objects within training folds. For example, if 10 folds are used for CV, feature transformation occurs over all objects within partitions (folds) 𝒟 2  ,... ,𝒟 10  and then feature values for testing objects in fold 𝒟 1  are transformed using parameters from training folds 𝒟 2  ,... ,𝒟 10  . When normalization is specified, feature values for min(x)  and max (x)  are based on objects in the training folds and applied to objects in the testing fold. Boundary constraints for the range [0,1] are maintained by setting negative feature values for objects in the testing fold to zero, and values greater than 1 are set to 1. For mean-zero standardization, the feature mean and s.d. are obtained from all objects in the training folds, and then are applied to objects in the testing fold.

Default options for feature transformations during CV for class prediction runs can be set in the general default settings popup window, which are shown in the image below:

PIC

Bibliography

[1]   G.J. Klir, B. Yuan. Fuzzy Sets and Fuzzy Logic, Upper Saddle River(NJ), Prentice-Hall, 1995.

[2]   D. Dubois, H. Prade. Fundamentals of Fuzzy Sets, Boston(MA), Kluwer, 2000.

[3]   S.K. Pal, P. Mitra. Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery and Soft Granular Computing. Boca Raton(FL), Chapman & Hall, 2004.

Chapter 5
INDEPENDENCE

5.1 2 Unrelated Samples

5.1.1 Student’s T-test

Applies To

Hypothesis testing to determine if means are equal in 2 groups

Origin

Parametric statistics (based on means, variances, etc.)

Null Hypothesis

H0 : μ1 = μ2  , both means are equal

Alternative Hypothesis

Ha : μ1 ⁄= μ2  , Both means are significantly different

Variables

One normally distributed continuously-scaled variable, one grouping variable

Variants

Welch test, Behrens-Fisher problem

The t-test is one of the most commonly used tests for determining whether two means are the same. Naturally, since the t-test is based on the average and standard deviation of each sample, the test is of the parametric type. In addition, the t-test assumes normality among the data being tested, so the data should also be continuously-scaled. Specific t-tests have been developed for cases in which the variances among variables being tested are assumed to be equal or unequal, the latter of which is known as the Behrens-Fisher problem. The null and alternative hypotheses for the t-test are:


H  : μ = μ
 0   1    2
Ha : μ1 ⁄= μ2.

5.1.2 Homogeneity of Variance Test for Two Samples

Before conducting a t-test, it is instructive to know whether the variances for the two samples are equal or unequal. A simple set of hypotheses can be constructed for this scenario, where the null and the alternative hypotheses are


     2    2        2
H0 : σ1 = σ 2 = ⋅⋅⋅ = σk
H  : σ2 ⁄= σ2.
  a   i   j

Tests for equal variance among groups are also known as homogeneity of variance tests. Equal variances is known as homoscedasticity, where unequal variances is known as heteroscedasticity. Bartlett’s test  [1] is used to test for equality of variances between 2 or more group normally-distributed samples. The test statistic is calculated as

               ( ∑c      2)   ∑
       (n − c)log  --j(nnj−−c1)sj  −   cj(nj − 1)log(s2j)
χ2c−1 = ----------------(∑------------)-------,
              1+  3(1c−1)   cjnj1−1 − n1−c
(5.1)

where n is the total samples size, c is the number of independent samples, nj  (j = 1,2,...,c)  is the sample size of the j th group, and  2
sj  is the sample variance for the j th group. The test statistic is  2
χ  distributed with c − 1  degrees of freedom. If the calculated statistic exceeds the critical value of χ2c−1,0.95  , the p-value is less than α = 0.05  and the the variances are considered to be unequal. Otherwise, if the p-value exceeds 0.05, the the variances are considered equal (not significantly different).

Assume that two samples of size n1  and n2  were drawn from two infinitely large populations. Calculate the sample means ¯x1  and ¯x2  , replace the sample data, resample, recalculate means, resample, etc. The mean differences d = ¯x1 −x¯2  obtained from multiple samplings would be normally distributed with variance σ2  . A typical question is: is the difference (d = ¯x1 − ¯x2  ) between two sample means significant. The hypotheses are as follows:

H0 :x¯1 = ¯x2
Ha : ¯x1 ⁄= ¯x2.
(5.2)

Because we are limited to only samples from the infinitely large populations, we obtain samples of size n1  and n2  , which have averages ¯x1  and ¯x2  , with sample variances σ21  and σ22  . For the two independent samples, the test statistic

t = ∘x¯1-−-¯x2
     -s21 + s22
     n1   n2
(5.3)

follows a t distribution with ν degrees of freedom. When sampling during a study or experiment, a priori values of the population variances can either be known or unknown. If they are known, then they can either be equal or unequal. If they are unknown, it is a common practice to assume they are unequal, since this would be the most conservative approach. In the remaining sections, we address the formulation of the t-test when the sample variances are equal or unequal.

Understanding two-tailed and one-tailed tests. There are often times when the belief among an investigator is that the two means are not significantly different, but rather, one mean is significantly greater or significantly lower than the other. In such cases, the use of one-tailed tests are employed. Recall, the criteria for whether a test is called two-tailed or one-tailed is based on the form of the alternative hypothesis, or the belief formed by the investigator.

Let us look at some example descriptions in the table below:






Hypothesis Null Alternative # ways alternative is true Terminology





Equality H  : ¯x = x¯
  0  1    2  H  : ¯x ⁄= ¯x
 a   1   2  2 ( ¯x  < ¯x
 1    2  , ¯x > ¯x
 1   2  ) “Two-tailed”





Inequality H0 : ¯x1 ≤ x¯2  Ha : ¯x1 > ¯x2  1 ( ¯x1 > ¯x2  ) “One-tailed”





Inequality H0 : ¯x1 ≥ x¯2  Ha : ¯x1 < ¯x2  1 ( ¯x1 < ¯x2  ) “One-tailed”





The first alternative hypothesis is based on the belief that the means are significantly different, which can occur two ways: first ¯x1 < ¯x2  , and second ¯x1 > ¯x2  . A test performed under this alternative hypothesis is called a two-tailed test. On the other hand, when the belief is only that ¯x1 < x¯2  , the test is called a one-tailed test. Analogously, if the only belief is that ¯x1 > ¯x2  , then this is also called a one tailed test. Quite basically, the way to remember whether a one-tailed or two-tailed test is employed is to think of the number of ways the alternative hypothesis can be true. For a two-tailed test, there are two ways the two means can be significantly different; however, under the one-tailed test, there is only one way the alternative hypothesis can be true.

Formally, there is set of hypotheses (null and alternative) for each of the alternative hypotheses shown in the table above. Thus, for two means there can only be three sets of hypotheses, and these are:

For the two-tailed test:


H0 : ¯x1 = x¯2
Ha : ¯x1 ⁄= ¯x2

For the first one-tailed test:


H0 : ¯x1 ≤ x¯2
Ha : ¯x1 > ¯x2

For the second one-tailed test:


H0 : ¯x1 ≥ x¯2
Ha : ¯x1 < ¯x2

Variances assumed equal (s21 = s22  ). When the population variances are assumed to be equal, it makes sense to combine together the variance from each sample. This is known as the common pooled variance. To calculate the common pooled variance, we must weight each sample variance by the sample size in the form

            2          2
s2p = (n1 −-1)s1 −-(n2 −-1)s2.
         n1 + n2 − 2
(5.4)

Going back the general equation for the t-test in Eq. (5.3), a substitution is made replacing the variance for each sample with the common pooled variance, so that the test statistic now becomes

t = ∘--¯x1-−x¯2----.
       2(-1   1-)
      sp n1 + n2
(5.5)

Significance of the test statistic is compared with tail probabilities for Student’s distribution at α for ν = n1 + n2 − 2  degrees of freedom.

Worked problem. A researcher was interested in determining if the mean concentration of a protein in tumor tissue was significantly different from the mean expression in normal tissue. The mean expression in tumors was ¯x1 = 132.85  , with standard deviation s1 = 15.34  , and sample size n1 = 8  , while the mean expression in normal tissue was ¯x2 = 127.44  , with standard deviation s2 = 18.23  , and sample size n2 = 21  . Are the two means significantly different when assuming equal variances?

For the assumption of equal variances, the pooled variance is:

 2  (n1 −-1)s21 +-(n2 −-1)s22
sp =     n1 + n2 − 2    .
(5.6)

Substituting in the data, we have:

                2              2
s2p = (8−-1)(15.34)-+-(21-−-1)(18.23)-
              8+ 21− 2
  = (7)(235.3156)-+-(20)(332.3329)
                27
  = 1647.2092-+-6646.658
            27
  = 8293.87-
      27
  = 307.18
(5.7)

Given  2
sp = 307.18  , we now find that sp = 17.53  .

       ¯x1 − ¯x2
t =--∘-(-------)
   sp   n11 + 1n2

 = -132.85∘−(127.44-)
   (17.53)   18 + 121
           5.41
 = ------∘------------
   (17.53)  (0.13 + 0.05)
 = ----5.4√1----
   (17.53)  0.17
   ---5.41---
 = (17.53)0.42
   5.41
 = 7.28
 = 0.74
(5.8)

Significance of the test statistic is compared with cumulative probabilities for Student’s distribution at 1− α ∕2  for ν = n1 + n2 − 2  degrees of freedom. The tabled critical value, t(0.975;27)  , is equal to 2.05 , and since |t| ≤ t(0.975;27)  we conclude that the two means are not significantly different. In fact, the p-value is 0.4639, which is not less than α = 0.05  .

Variances assumed unequal (s21 ⁄= s22  ). When the sample variances for the populations are unequal, we keep the sample variances intact in the general equation for the t-statistic given as

t = ∘¯x1-−x¯2-,
     s21+  s22-
     n1   n2
(5.9)

however, the value of t no longer represents the t distribution. This problem was first recognized by Behrens  [2], who introduced one of the first solutions. Behrens’ approach was later verified by Fisher  [3]. We implement the approach introduced by Dixon and Massey  [4], which involves a correction to the degrees of freedom, ν , in the form

      ( 2   2)2
       sn1+ ns2
ν = (s2)12---2(s2)2.
    -n11---  -n22---
     n1− 1 +  n2−1
(5.10)

The computed value of ν is rounded to the nearest integer. The significance of the test statistic is compared with tail probabilities for Student’s t distribution at values of α and ν degrees of freedom.

Worked problem. A researcher was interested in determining if the mean concentration levels of a drug in serum plasma of placebo patients (untreated) was significantly different from the mean serum plasma concentrations in treated patients. The mean concentration in placebo patients was ¯x1 = 207.3  , with standard deviation s1 = 35.6  , and sample size n1 = 100  , while the mean concentration in treated patients was ¯x2 = 193.4  , with standard deviation s2 = 17.3  , and sample size n2 = 74  . Are the two means significantly different when assuming unequal variances?

   ---¯x1 −-¯x2-
t = ∘ (-s2-s2)
       n11 + n22
     207.3 − 193.4
 = ∘-(---2-----2)
      351.600 + 17.743-
      207.3 − 193.4
 = ∘-(1267.36--299.29)
      --100--+ --74--
        13.90
 = ∘-(12.67+-4.04)

 = √13.90-
     16.72
 = 13.90
    4.09
 = 3.40
(5.11)

Significance of the test statistic is compared with cumulative probabilities for Student’s distribution at 1− α ∕2  for ν = n1 + n2 − 2  degrees of freedom. The tabled critical value, t(0.975;172)  , is equal to 1.97 , and since |t| > t(0.975;172)  we conclude that the two means are significantly different. In fact, the p-value is 0.0008, which is less than α = 0.05  .

5.1.3 Mann-Whitney U Test

Applies To

Hypothesis testing to determine if two samples are from the same distribution

Origin

Non-parametric statistics (based on ranks)

Null Hypothesis

H0 : x1,x2 ∼ 𝒟 , both samples are from the same distribution

Alternative Hypothesis

Ha : x1,x2 ⁄∼ 𝒟 , both samples are not from the same distribution

Variables

One test variable that is at least ordinally distributed, and one categorical variable

Variants

Friedman test

The Mann-Whitney U test is a non-parametric 2-sample test to determine whether two independent samples are from the same distribution  [5]. Unlike the t-test and F-test which assume an underlying normal distribution of the data, the Mann-Whitney requires that the data be at least ordinal and not necessarily normally distributed. The asymptotic relative efficiency (A.R.E.) of the Mann-Whitney test approaches 95.5% as n increases when applied to data for which parametric tests are appropriate [6]. However, in cases where parametric tests are inappropriate, the Mann-Whitney test will have greater power to reject H0  than the t-test. The Mann-Whitney test should be applied when there is little known about the distributional properties for both samples, or the assumptions of normality do not hold.

The goal of the Mann-Whitney test is to compare two sets of data to determine if they are from the same distribution. By “same distribution” we mean that values of one distribution are less than or greater than values of the other one. If values of one distribution are less than or greater than values the other distribution, then the two distributions are different. Stated formally, the null (Ho )  and alternative (Ha )  hypotheses of the Mann-Whitney test are

Ho  : Both samples from the same distribution (null hypothesis)
Ha  : Both samples from different distributions (alternative hypothesis)

Given the hypotheses, the goal is to discredit or reject the null hypothesis, and disprove that values of the two distributions are not the same. We begin calculations for the Mann-Whitney U test by first determining the U statistic for each group in the form:

           n1(n1 + 1)
U1 = n1n2 +----2---- − R1
           n (n  + 1)
U2 = n1n2 +--2-2---- − R2,
               2
(5.12)

where n1  and n2  are the sample sizes for each of the two groups, and R1  and R2  are the sum of the ranks within each group. The U statistic is taken as the lowest value among U1  and U2  :

U = min(U ,U ).
         1  2
(5.13)

The test statistic for the Mann-Whitney test is standard normal distributed as

Z = U-−-μ
      σ
    ---U-−-n12n2---
  = ∘ n1n2(n1+n2+1).
           12
(5.14)

Worked problem. Let’s consider an example that tests whether expression of a single protein in two independent groups of tissues are from the same distribution. A laboratory assay was carried out in which fluorescent intensity levels were generated for expression of protein A in two samples of mice: wild-type (group 1) and knock-out (group 2). The data are shown in the table below:








ID Protein A R(xi)  Group R(x ∈ grp1)  R (x ∈ grp2)







1 -0.017 5 1 5







2 -0.349 1 1 1







3 0.136 8 1 8







4 0.111 7 1 7







5 -0.149 4 2 4







6 -0.181 3 2 3







7 -0.307 2 2 2







8 0.071 6 2 6







R1 = 21  R2 = 15







Ranks are determined by first sorting the data for both samples in ascending order, and then assigning ranks. Tied expression values are handled by averaging their original rank assignments. Substituting the results into the formulation gives us

U = n n  + n1(n1-+-1)− R
 1   1 2       2        1
           n2(n2-+-1)
U2 = n1n2 +    2    − R2
(5.15)

            4(4+ 1)
U1 = (4)(4)+ -------− 21 = 5.00
               2
U2 = (4)(4)+ 4(4+-1)− 15 = 11.00
               2
(5.16)

U = min(U1,U2) = min(5.00,11.00) = 5.00
(5.17)

          U − n12n2
Z = ∘-(n1n2)-----------
        12  (n1 + n2 + 1)
        5.00− (4)(4)
  = ∘-(-----)---2-----
        (4)1(42) (4+ 4 + 1)

  = ∘-5.00-− 8-
      (1.33)(9)
    −-3.00
  =  3.46
  = − 0.87.
(5.18)

Since |− 0.87| is less than 1.96, the test is not significant, and the conclusion is that the two distributions are not significantly different.

Wilcoxon Rank Sum Test. We now show the similarity between the Wilcoxon rank sum test and the Mann-Whitney test. Assume the following data:








ID Protein A R(xi)  Group R(x ∈ grp1)  R (x ∈ grp2)







1 -0.062 6 1 6







2 0.052 7 1 7







3 -0.096 4 1 4







4 -0.175 2 1 2







5 -0.166 3 2 3







6 0.089 8 2 8







7 -0.092 5 2 5







8 -0.293 1 2 1







R1 = 19  R2 = 17







While both the Mann-Whitney and Wilcoxon rank tests use the standard normal Z -variate as the test statistic, the Wilcoxon rank sum test uses a slightly different form of the numerator of Z (the denominator is the same). In spite of the similarity between these two non-parametric tests, NXG Explorer employs the Mann-Whitney test for all 2-sample non-parametric tests. Computation of the Wilcoxon rank sum test goes as follows:

            n(n +n+1)
Z = ∘--R1 −--1-12-2----
      (n1n2)(n + n  + 1)
        12    1   2
    ----19−-4(4+42+1)---
  = ∘ (-(4)(4))---------
        -12- (4+ 4 + 1)

  = ∘-19−-18--
      (1.33)(9)
  = 1.00
    3.46
  = 0.29
(5.19)

Example 1 - Two-tailed t-tests (Ha : ¯x1 ⁄= ¯x2  ). First, select Analysis → Independence → Unpaired 2 and k-sample tests shown as

PIC

Next, click on sex as the grouping feature, and then select all of the continuous features on the left side of the popup window, shown as:

PIC

Lastly, leave the default settings that are visible on the popup window for grouped tests:

PIC

Using the Retinol dataset, two-tailed t-tests were performed using continuous variables with sex considered as the grouping variable. Recall, the sex categorical feature has two levels: 1-males, 2-females. The two-tailed tests are based on the belief (alternative hypothesis) Ha :x¯1 ⁄= ¯x2  . Results are listed in Figure 5.1. For each continuous variable, the appropriate t-test is used, along with the Mann-Whitney test, and Bartlett’s variance test. Tail probabilities for t-test and Mann-Whitney test will be colored either red or blue, depending on results of Bartlett’s test. Whenever Bartlett’s test for equality of variances is significant for a given continuous variable, NXG Explorer will automatically apply the appropriate t-test, that is assuming equal or unequal variances. If Bartlett’s test is significant, the Welch t-test assuming unequal variances is used. The non-parametric Mann-Whitney test is not an alternative to the Welch t-test when variances are unequal between groups, but is nevertheless performed for each 2-group test.

PIC

Figure 5.1: Two-tailed t-tests (Ha : ¯x1 ⁄= ¯x2  ) for continuous variables in the Retinol dataset, using sex as the grouping variable.

Example 2 - One tailed t-tests, (Ha : ¯x1 > ¯x2  ). In this example, one-tailed t-tests were performed using the same continuous variables with sex as the grouping variable. The specific alternative hypothesis for the tests is Ha : ¯x1 > ¯x2  . Results are listed in Figure 5.2.

PIC

Figure 5.2: One-tailed t-tests (Ha :x¯1 > ¯x2  ) for continuous variables in the Retinol dataset, using sex as the grouping variable.

Example 3 - One tailed t-tests, (Ha : ¯x1 < ¯x2  ). In this example, one-tailed t-tests were performed using the same continuous variables with sex as the grouping variable. The specific alternative hypothesis for the tests is Ha : ¯x1 < ¯x2  . Results are listed in Figure 5.3.

PIC

Figure 5.3: One-tailed t-tests (Ha :x¯1 < ¯x2  ) for continuous variables in the Retinol dataset, using sex as the grouping variable.

5.2 Geometric and Harmonic Means

NXG Logic Explorer can use geometric or harmonic means during 2- and k -sample hypothesis tests for inequality of means. When the geometric mean is specified for a continuous feature, the mean is defined as

        (   n        )
μ = exp  1-∑  log(x ) ,    x > 0
 g       n  i    e i        i
(5.20)

and the geometric standard deviation is

        (∘ ∑n---------------)
σg = exp ( --i(loge(xi)−-μg)2) ,   xi > 0.
                 n − 1
(5.21)

If the harmonic mean is specified, then the mean of the reciprocals is

     ∑n -1
m =  --ixi,    xi ⁄= 0
       n
(5.22)

and the harmonic mean is

     --n---
μh = ∑ni-1,    xi ⁄= 0
     1  xi
   = --.
     m
(5.23)

The variance of 1∕xi  is

         (       )
 2   1∑n   1-     2
s =  n     xi − m   ,   xi ⁄= 0,
       i
(5.24)

and the standard deviation is then defined as

     ∘ --2--
σh =   -s--.
       nm4
(5.25)

5.3 k Unrelated Samples

5.3.1 Analysis of Variance (ANOVA)

Applies To

Hypothesis testing to determine if one or more pairs of means are significantly different in k > 2  unrelated groups

Origin

Parametric statistics (based on means, variances, etc.)

Null Hypothesis

H0 : μ1 = μ2 = ⋅⋅⋅μc  , all means are equal

Alternative Hypothesis

H  : not all μ are equal (j=1,2,...,c).
 a        j  , any pair of means are significantly different

Variables

One normally distributed continuously-scaled variable, one grouping variable

Variants

Welch ANOVA

Analysis of variance (ANOVA) is essentially like the 2-sample t-test for inequality of means, however, ANOVA is employed when there are k = 3  or more groups. Recall, ANOVA is for comparing means of a single continuously-scaled variable across 3 or more groups. The null hypothesis is that all of the means in the groups are equal, and the alternative hypothesis, i.e., what the belief is, that one of more pairs of means are significantly different.

Using notation, the null and alternative hypotheses for ANOVA are

H  : μ = μ = ⋅⋅⋅μ
H 0: n1ot all2 μ are eqcual (j=1,2,...,c).
  a        j

Define Y ij  as a continuously-scaled observation in group j of sample size n j(i =1,2,…,n j  ;j = 1,2,…,c)  , where c is the number of groups. The ANOVA model states that each random observation can be represented by the model

Yij = μ + αj + 𝜀ij,

where μ is the grand mean, α
  j  is the effect of treatment j , and 𝜀
 ij  is the error. Estimates of the model parameters are

                 ¯
    ¯    ¯   ˆμ = Y⋅⋅
ˆαj = Y⋅j − Y⋅⋅¯ Error between treatments
 ˆ𝜀 = Yij − Y⋅j Error within treatment.

A typical data setup for ANOVA is given as follows:







Group 1 Group 2 ⋅⋅⋅ Group j ⋅⋅⋅ Group c






Y 11  Y 11  ⋅⋅⋅ Y 1j  ⋅⋅⋅ Y 1c






Y 21  Y 22  ⋅⋅⋅ Y 2j  ⋅⋅⋅ Y 2c






Y 31  Y 32  ⋅⋅⋅ Y 3j  ⋅⋅⋅ Y 3c






.
..  .
..  .
..






Y i1  Y i2  ⋅⋅⋅ Y ij  ⋅⋅⋅ Y ic






.
..  .
..  .
..  .
..






Yn11  Yn22  ⋅⋅⋅ Ynjj  ⋅⋅⋅ Yncc






The grand mean is determined as

     ∑c n∑j
     j=1i=1 Yij
¯Y⋅⋅ =--------,
        n

and the treatment mean for group j is

      nj
      ∑ Yij
Y¯ =  i=1---.
 ⋅j    nj

Redefining the error terms, we have

Yij = Y¯⋅⋅ + (¯Y⋅j − ¯Y⋅⋅)+ (¯Yij − ¯Y⋅j).

SubtractingY
 ⋅⋅ from the right side yields

Yij − ¯Y⋅⋅ = (¯Y⋅j − ¯Y⋅⋅)+ (¯Yij − ¯Y⋅j),

and squaring all terms and summing over i and j gives us the sum of squares

∑c ∑nj            ∑c ∑nj             ∑c n∑j
      (Yij − ¯Y⋅⋅)2 =     (¯Y⋅j −Y¯⋅⋅)2 +       (Yij − ¯Y⋅j)2.
j=1i=1            j=1i=1            j=1 i=1

A more numerically efficient method for calculating the sum of squares is in the form

∑c n∑j       2   ∑c   2    2  ∑c n∑j      ∑c  2
      Y2ij − Y⋅⋅-=   Y⋅j− Y⋅⋅+       Y2ij −   Y⋅j,
j=1i=1      n   j=1nj    n   j=1i=1      j=1 nj

where the grand and treatment-specific sums are

      c nj
Y  = ∑  ∑  Y ,
 ⋅⋅  j=1 i=1  ij

and

      nj
     ∑
Y ⋅j =   Yij.
      i=1

In summary, the sum of squares total is equal to the treatment sum of squares (error between treatments) and the error sum of squares (within treatments), given as

SST O (total) = SST(between) + SSE (within).

Dividing each of these sums of squares by their individual degrees of freedom, one can obtain the mean squares due to treatment and the mean squares due to error:

M ST = -SST-
       c − 1
M SE = -SSE-
       n − c
(5.26)

F = M--ST ∼ Fc−1,n−c.
    M SE

Worked problem. In the example below, the expression of a single serum plasma protein, called “PSA” (prostate-specific antigen), was measured in laboratory mice treated with Drug A or Drug B. A third group was used for placebo or untreated “Control” mice. Perform an F-ratio test using ANOVA to determine if mean expression is significantly different between the 3 groups.

      Table 1. Mouse PSA expression after treatment with Drug A and Drug B.






Control Drug A Drug B





8.24 16.94 2.92
11.79 21.46 2.18
9.44 21.48 1.80
12.82 18.24 2.15
7.82 15.79 3.02
11.36 18.34 2.37
9.32 2.33
7.84





nj :  8 6 7 n =  21
Y  :
 .j  78.63 112.25 16.76 Y  =
 ..  207.64
∑nj  Y2 :
  i=1  ij  798.84 2127.38 41.29 ∑c   ∑nj  Y2 =
  j=1   i=1  ij  2967.52





Substituting in our results, for the sum of squares treatment (SST) we have:

      ∑c
SST =     nj(Y¯.j − ¯Y..)2
       j=1
      ∑c  Y2    2
    =     -⋅j-− Y⋅⋅-
       j=1 nj    n
       (78.63)2  (112.25)2   (16.76)2   (207.64)2
    =  ---8---+ ---6----+  --7----− ---21---
    = 859.89
(5.27)

The degrees of freedom for SST are determined next, based the number of groups

df = c− 1
  = 3− 1

  = 2
(5.28)

The SSE or “residual error,” which picks off the error between each observation and its treatment mean reveals the within treatment error. When SSE is large, the residual error can outweigh the treatment effects, resulting in a non-significant test result.

       c  nj
SSE = ∑   ∑  (Y  − ¯Y )2
       j=1 i=1  ij   .j
       c  nj      c   2
    = ∑   ∑  Y2− ∑   Y⋅j-
       j=1 i=1  ij  j=1 nj
               [      2         2         2]
    = 2967.52−  (78.63)- + (112.25)-+ (16.76)-
                   8         6         7
    = 54.50
(5.29)

df = n− c
   = 21 − 3

   = 18
(5.30)

The F-statistic can now be calculated as the ratio of the two mean square errors

    M-ST-   859.89∕2-  429.94
F = M SE  = 54.50∕18 =  3.03 = 142.01
(5.31)

The tabled F0.95;2,18  is 3.55, suggesting that the means are not all equal, with a p-value less than 0.05.

5.3.2 Welch ANOVA Test for Unequal Variances

If Bartlett’s test for equality of variances is significant, meaning that the variances are significantly different across groups, i.e., heteroscedasticity, then the Welch ANOVA test should be used [7]. Calculation of the F-ratio statistics and adjusted degrees of freedom are provided below. The weight for each k th group is defined as

     nk-
wk = s2k,
(5.32)

and the sum of the weights is

    ∑
w.=    wk.
     k
(5.33)

A weighted mean is then calculated as

     ∑
μ  = --kwk-μk.
  .     w.
(5.34)

The F-ratio statistic is given as

            ∑         2
            --kwkk(μ−k1−μ.)
F = ---2(k−2)∑--(----)-(------)2,
    1+ -k2−1   k nk1−1   1− wwk.
(5.35)

where the numerator and denominator degrees of freedom are

                   -------k2-−-1--------
df1 = k− 1,   df2 =  ∑   (-1--)(    wk)2.
                   3  k  nk−1  1 − w.
(5.36)

5.3.3 Kruskal-Wallis Test

Applies To

Hypothesis testing to determine if three or more samples (i.e., k ≥ 3  ) are from the same distribution

Origin

Non-parametric statistics (based on ranks)

Null Hypothesis

Ho : x1,x2,...,xk ∼ 𝒟 , variates from all samples are from the same distribution

Alternative Hypothesis

Ha : x1,x2,...,xk ⁄∼ 𝒟 , samples are not from the same distribution

Variables

One test variable that is at least ordinal, and one categorical variable

Variants

Friedman test

The Kruskal-Wallis test is used to determine if data for a single variable in k groups are the from the same distribution.

Worked problem. A research team performed a study to explore concentrations of a serum marker in laboratory mice 24 hours after treatment with Drug A, Drug B, or no treatment (Control). Perform a Kruskal-Wallis test using the data listed in the table below, and provide the calculated test statistic, and comment on the p-value and your conclusion.

      Table: Mouse serum plasma concentration of protein marker 24 hours after treatment.







Control Rank Drug A Rank Drug B Rank






8.24 10 16.94 17 2.92 6
11.79 14 21.46 20 2.18 3
9.44 12 21.48 21 1.80 1
12.82 15 18.24 18 2.15 2
7.82 8 15.79 16 3.02 7
11.36 13 18.34 19 2.37 5
9.32 11 2.33 4
7.84 9






n1 =  8 n2 =  6 n3 =  7
R1 =  92 R2 =  111 R3 =  28






The test statistic used for the Kruskal-Wallis test is the chi-squared distribution with k− 1  degrees of freedom.

 2        12    ∑k R2j
χk−1 = n(n+-1)-   n--− 3(n + 1)
               j=1[  j                 ]
       ---12---- (92)2  (111)2  (28)2
     = 21(21+ 1)   8  +   6   +   7   − 3(21+ 1)
       12 [ (8464)   (12321)   (784)]
     = 462  --8---+ --6----+ -7--- − 66

     = (0.0260)[1058.0000+ 2053.5 +112]− 66
     = (0.0260)(3223.5000) − 66
     = 17.73
(5.37)

Since there are k = 3  groups, the chi-squared statistic is approximately distributed with 2 d.f. The calculated test statistic of 17.73 is significant, since it exceeds the 1 − α critical value of chi-squared for α = 0.05  and 2 d.f., χ2    = 5.99
  0.95,2  . Therefore, the p-value would be less than 0.05, i.e., p < 0.05  .

Example 4 - ANOVA and Kruskal-Wallis Tests. In this example, we provide NXG Explorer results for ANOVA and KW analysis of the continuous variables in the Retinol, with vitamin use status vituse as the grouping variable. Recall, there are 3 levels of the vitamin use category: 1=Yes, 2=Sometimes, 3=No. To begin, import the Retinol.xlsx file. Next, select Analysis→ Independence→ Unpaired 2 and k-sample tests shown as

PIC

Next, specify vituse as the grouping feature, and select all of the remaining features (left panel) as the test features:

PIC

Themn click on the Apply in the next popup window that appears:

PIC

When the run is complete, click on the Grouped analysis icon showing in the treenodes:

PIC

The panel with the tabular results of the ANOVA test will appear:

PIC

The above results were generated with the following default setting in Preferences→ General default settings

PIC

which will exclude non-parametric test results for the Kruskal-Wallis test as well as the Bartlett’s equality of variance test results. Un-check the box shown, and then rerun the ANOVA for the same features, and the following results will be generated:

PIC

which shows the Kruskal-Wallis test results, as well as Bartlett’s equality of variance test results.

You can also specify default settings for outputting only parametric or non-parametric test results. Another important point is that NXG Explorer will default to the Welch ANOVA when Bartlett’s equality of variance test is significant, and in this case, you will notice that d.f. for the tests will change. This cannot be changed, since it is inappropriate to not use the Welch ANOVA in the presence of heteroscedasticity.

Note: NXG Explorer always performs Bartlett’s test for equality of variance for 2-sample and k-sample tests. If heteroscedasticity exists, then the “unequal variance” t-test will be used for 2-sample tests, and the Welch ANOVA test will be used for k-sample tests. You can determine when these tests are used by noticing a significant Bartlett’s test and a change in the test’s d.f. which is reported.

5.4 Related Samples

5.4.1 T-test for Paired Samples

Confounding occurs when the variables not being tested are independently related to the outcome (e.g., age, gender) and the exposure. Matching is used to make subjects in different exposure groups more similar in terms of extraneous variables that are independently associated with the outcome. The goal of matching is therefore to achieve differences between outcome between the exposure groups that are more likely to be due to exposure rather than confounding variables. The overall effect of matching may reduce the standard deviation of differences rather than the differences themselves. Since standard deviation is in the denominator of the t -test, the reduced standard deviation increases the test statistic. This reduction in variance cause an increase in statistical efficiency, which is the power to detect a difference for a fixed sample size.

For n paired samples, the t -test is

    d¯− δ
t = --√---
    sd∕  n
 =  --¯d√--,
    sd∕  n
(5.38)

where ¯d is the mean of the n pairwise differences di = xi1 − xi2  for each record, and sd  is the standard deviation of differences given as

     ∘ ∑n-----------
sd =   --i=1(di −-¯d)2
           n− 1
     ┌│ [∑-------∑----2]-
     ││    id2i − (-indi)-
   = ∘ ----------------.
             n− 1
(5.39)

Worked problem. Perform a matched paired t-test to determine if concentration of a protein A is significantly different in tissue and plasma. The table below shows the data:






ID Protein A(plasma) Protein A (tissue) d
 i  d2
 i





1 0.289 1.027 -0.738 0.54





2 -0.243 0.592 -0.835 0.70





3 0.082 0.795 -0.713 0.51





4 -0.168 1.393 -1.561 2.44





5 0.007 0.814 -0.808 0.65





¯d = − 0.93





∑
   di = − 4.65  ∑
   d2i = 4.84





Substituting in the data, we have for the s.d. 

    ┌│ [--------∑-----]-
    ││  ∑  d2−  (-idi)2-
    ∘ ---i-i-----n----
sd =        n− 1
    ┌│ [------------]-
    │∘  4.84− (−4.655)2
  =   --------------
    ∘ -------4---
      [4.84−-4.33]
  =        4
    ∘ [0.51]
  =   --4--
    √ ----
  =   0.13
  = 0.36
(5.40)

and for the t-statistic, we get:

t =-¯d-
   √sdn-
   − 0.93
 = -0.3√6-
      5
 = −-0.93
    0.16
 = − 5.85
(5.41)

The tabled critical value, t(0.975;4)  , is equal to 2.78 , and since |t| > t(0.975;4)  we conclude that the two means are significantly different. In fact, the p-value is 0.0042, which is less than α = 0.05  .

Wilcoxon Signed Rank Test. The Wilcoxon signed rank test is the non-parametric variant of the paired t-test. Let’s perform a Wilcoxon signed rank test for matched pairs data and then show worked equations.









ID Protein A Protein B di  |di| R(|di|)  R(|di| | d− )  R(|di| | d+)








1 0.122 0.004 0.117 0.117 3 3








2 0.456 0.418 0.038 0.038 2 2








3 -0.139 0.522 -0.661 0.661 8 8








4 0.155 -0.048 0.203 0.203 5 5








5 0.158 0.481 -0.323 0.323 6 6








6 -0.112 0.066 -0.179 0.179 4 4








7 0.374 0.380 -0.006 0.006 1 1








8 -0.032 0.373 -0.404 0.404 7 7








∑    −
  R (d  ) = 26  ∑    +
  R(d ) = 10








In the table above, di  is the difference between the two values in a row, |di| is the absolute value of the difference, R(|di|)  is the rank of the absolute difference, R (|di| | d− )  is the rank of |di| when the original (raw) difference is negative, and R (|di| | d+)  is the rank of |di| when the original (raw) difference is positive.

Next, we substitute in the value of R
 d+  along with the sample size n to get

       R  + − n(n+1)
Z = ∘----d------4------
      n(n + 1)(2n + 1)∕24
         10 − 8(8+1)
 =  ∘-----------4--------
      8(8 + 17)(22× 8 + 1)∕24
 =  ∘-10-−-4----
      (72)(17)∕24
    10− 18.00
 =  --7.14---
    − 8.00
 =  7.14-
 = − 1.12
(5.42)

Since |− 1.12| < 1.96  , we cannot reject the null hypothesis that the two concentrations are equal, and conclude that they are not significantly different.

Example 5 - Paired t-test and Wilcoxon signed rank tests. In this example, both the paired t-test and Wilcoxon signed rank tests results are run for the fat and fiber features of the Retinol dataset. To begin, specify the “Paired mean inequality tests” option of the Independence pull-down menu, shown as follows:

PIC

Figure 5.4: Paired t-test commands.

Next, select two continuous variables for testing:

PIC

Figure 5.5: Paired t-test options; selecting 2 continuous variables.

The output for both the paired t-test and the Wilcoxon signed rank tests will be visible after clicking on the output icon, which is shown below

PIC

Figure 5.6: Paired t-test and Wilcoxon signed rank test results.

Results of the tests indicate that the mean values for fat and fiber measurements are significantly different.

5.5 Equality Tests for 2 Independent Proportions

Tests for equality of 2 independent proportions are commonly used in statistics. The notation used for tests of proportions uses  p1  and p2  , where p1  is the proportion of successes(failures) out of the total sample size for sample 1, and p2  is the proportion of successes(failures) among the total population in sample 2. Let xij  be a binary outcome feature taking on values (0,1) which may represent (1=success,0=failure), (1=tumor,0=normal), etc., for the i th object in the j th group (i = 1,2,...,n;j = 1,2)  . The proportion in each group can be determined by either summing the x-values within each group to obtain the total number of occurrences for which xij = 1  , or by using a logical statement reflecting the total number of ones in each group.

    ∑n1  xi1   #{i,xi1 = 1}
p1 =--i=n1---=  ----n------,
        1           1
(5.43)

where #{i,xi1 = 1} represents the total number of ones among all values of xi1  . Similarly, p2  is defined as

    ∑
    --ni2=1xi2   #{i,xi2-=-1}
p2 =   n2   =      n2     .
(5.44)

NXG Explorer employs a pooled and unpooled version of the test, whose Z statistic is distributed standard normal, N(0,1). The pooled version is functionally composed as

        ----p1 −-p2--
Zpooled = ∘--(-1----1)-,
          ¯p¯q  n1 + n2
(5.45)

where ¯p = (p n + p n )∕(n + n )
     1 1   2 2   1    2  and q¯= 1− ¯p . The unpooled variant of the test is

Zunpooled = ∘-p1 −-p2-.
            p1nq1+ p2nq2
              1     2
(5.46)

Example 6 - Test for 2 independent proportions. For this example, we will use NXG Explorer to create the binary outcome feature. To begin, open the Retinol dataset and select the test of proportions option from the Independence command of the Analysis pull-down menu, shown as

PIC

Figure 5.7: Pull-down command menu for test of two independent proportions.

Using the Retinol dataset, select sex as the grouping variable, and smokstat as the outcome variable. Next, we will define a success xij  as subjects that have values of 3 for smoking status (smokstat=3) and failure for subjects with other values of smokstat equal to 1 or 2. (Recall, the categorical levels of smoking status in the Retinol dataset are 1=Never, 2=Former, 3=Current Smoker). In addition, select an alternative hypothesis of p1 < p2  in order to specify a one-tailed test. This can be done using options in the popup window for test of proportions:

PIC

Figure 5.8: User control for test of proportions.

The results of the one-tailed test of proportions are shown below, which indicate that the proportion of subjects with values of smokstat=3 are not significantly different because the p-values of 0.726 and 0.713 for the pooled and unpooled tests are not less than 0.05.

PIC

Figure 5.9: Test results for two independent proportions.

5.6 Chi-Squared Contingency Table Analysis

Data are often collected to identify the number of subjects which fall into categories. Examples are the number of new patients with breast cancer, lung cancer, or no cancer; low-intermediate-high risk, or diagnostic category A, B, C, etc. The  2
χ  test is suitable for analyzing count data arranged into k = 2  or more categories [9]. The method is a goodness-of-fit test to determine if a significant difference exists between observed and expected counts based on the null hypothesis. In order to test a hypothesis, we need to be able to determine expected frequencies. The null hypothesis (Ho  ) states that the proportion of counts falling into the categories for an equal number of counts form the presumed population. Thus, we deduce the expected frequencies form the null hypothesis. The  2
χ  test determines how close the observed frequencies are to the expected frequencies under Ho  , and is tested with the chi-squared function

 2   ∑k (Oi −-Ei)2
χ  =       Ei    ,
     i=1
(5.47)

where Oi  is the number of observed counts and Ei  is the number of expected counts in the i th cell. When differences between the observed and expected (Oi − Ei)  are small, the chi-squared test statistic will be small. Whereas for large differences in (Oi − Ei)  , the chi-squared test statistic will be large. Squaring the non-zero differences results in positive values, and therefore the chi-squared test statistic is always greater than or equal to zero. For a one-way chi-squared test, the number of degrees of freedom (d.f.) is equal to k− 1  . For a multiway chi-squared test, the d.f. are equal to the number of rows minus one times the number of columns minus one, that is, d.f.= (r− 1)(c − 1)  .

5.7 Two-way Contingency Tables

A two-way table provides a cross-tabulation of counts as a function of population and class. This can be shown in the following configuration:

Class 1 Class 2 Total


Population 1
O11
O12
n1


Population 1
O21
O22
n2


Total C
  1  C
 2  n

If we consider the counts which are used for the chi-squared test, we have





Characteristic B


Characteristic A Present Absent Total




Characteristic Present n11  n12  n1.
Absent n21  n22  n2.




Total n.1  n.2  n..

which can be transformed into proportions according the tabular notation:





Characteristic B


Characteristic A Present Absent Total




Characteristic Present p11  p12  p1.
Absent p
 21  p
 22  p
 2.
Total p.1  p.2  p..




Thus, the chi-squared test is really a test of equal proportions among the various categories involved

       ∑2 ∑2  (|pij − pi.p.j|− 1∕(2n..))2
χ2 = n..       --------p-p----------,
       i=1j=1          i..j
(5.48)

which translates to our original formulation

 2   ∑2 ∑2 (Oij −-Eij)2
χ  =           Eij   .
     i=1 j=1
(5.49)

The the expected number of counts in each cell, Ei  is determined by the relationship

     n C
Eij =-i-j ,
       n
(5.50)

and ni  is the total counts in row i (i = 1,2,...,r)  and Cj  are the total counts in column j (j = 1,2,...,c)  . Let’s consider an example chi-squared test involving the frequency of breast cancer among mothers whose first age at giving birth was less than or greater than age 30:





First age at giving birth


Cancer status 30+  ≤ 29  Total




Case O11 = 683  O12 = 2537  n1 = 3220
Control O21 = 1498  O22 = 8747  n2 = 10245




Total C1 = 2181  C2 = 11284  n = 13465

First, we need to calculate the number of expected cases in each cell. Recall, the expected frequency in each cell is

     niCj
Eij =--n- .
(5.51)

Upon substitution of the data, we have

E   = n C ∕n = (3220)(2181)∕13465 = 520
  11    1 1
E   = n C ∕n = (3220)(11324)∕13465 = 2700
  12    1 2
E   = n C ∕n = (10285)(2181)∕13465 = 1661
  21    2 1
E   = n C ∕n = (10285)(11324)∕13465 = 8624
  22    2 2

and finally, substituting the observed and expected frequencies into the equation for chi-squared, we have

     ∑2 ∑2           2
χ2(1) =       (Oij −-Eij)-
      i=1 j=1     Eij
(5.52)

      ∑2 ∑2           2
χ2(1) =      (Oij −-Eij)
      i=1j=1    Eij
      (O  − E  )2   (O   − E  )2
    = --11----11-- + --12---12--+
          E11   2       E12   2
    = (O21 −-E21) + (O22-−-E22)-
          E21           E22
      (683−-520)2  (2537−-2700)2
    =     520    +      2700     +
      (1498−-1661)2  (8787−-8624)2
    =     1661     +     8624
    = 79.99
(5.53)

The calculated value of 79.99 far exceeds the lookup critical value of 3.84 for the  2
χ1−α  with 1 d.f. Hence, we conclude that there is a significant association between first age at giving birth and breast cancer based on these data.

5.8 Fisher’s Exact Test

Fisher’s exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables [10]. Let there exist two such variables X and Y , with r and c observed states, respectively. Now form an r × c matrix in which the entries nij  (i = 1,2,...,r,j = 1,2,...,c)  represent the number of observations in which x = i and y = j .





Marker


Disease Positive Negative Total




Present n11  n12  n1.
Absent n21  n22  n2.




Total n.1  n.2  n..




where the row total is given by

ni.= ∑c nij,
    j=1
(5.54)

and column total by

     ∑r
n.j =   nij,
      i=1
(5.55)

with a grand table total

     ∑r ∑c      ∑r      ∑c
n..=       nij =   ni.=    n.j
     i=1j=1     i=1     j=1
(5.56)

Next, calculate the conditional probability of getting the actual matrix given the particular row and column sums, given by

    ∏r ∏c
P =       ni.!n.j!
    i=1j=1n..!nij
    (n1.!n2.!⋅⋅⋅nr.!)(n.1!n.2!⋅⋅⋅n.c!)
  = --------n-!∏--∏--n-!-------,
             ..   i j  ij
(5.57)

which is a multivariate generalization of the hypergeometric probability function. Next, find all possible matrices of nonnegative integers consistent with the row and column sums ni.  and n.j  . For each one, calculate the associated conditional probability P (b)  , where the sum of these probabilities must be 1.

To compute the p-value of the test, the tables must then be ordered by some criterion that measures dependence, and those tables that represent equal or greater deviation from independence than the observed table are the ones whose probabilities are added together. There are a variety of criteria that can be used to measure dependence. In the 2 × 2  case, which is the one Fisher looked at when he developed the exact test, either the Pearson chi-squared or the difference in proportions (which are equivalent) is typically used. Other measures of association, such as the likelihood-ratio-test, G-squared, or any of the other measures typically used for association in contingency tables, can also be used.

The test is most commonly applied to 2× 2  matrices, and is computationally unwieldy for large r or c . For tables larger than 2 × 2  , the difference in proportions can no longer be used, but the other measures mentioned above remain applicable (and in practice, the Pearson statistic is most often used to order the tables). In the case of the 2× 2  matrix, the p-value of the test can be simply computed by the sum of all P(b) ≤ P .

For an example application of the 2 × 2  test, let X represent the positive or negative result of a protein marker, and Y be the number of patients with and without a disease. If there are 5 diseased patients and 1 non-diseased which test positive, and 0 diseased and 4 non-diseased that test negative, the the configuration of these data would be





Marker


Disease +  − Total




Present 5 0 5
Absent 1 4 5




Total 6 4 10




Computing P gives

P = (5!5!6!4!)∕(10!(5!0!1!4!)) = 0.0238  ,

and the other possible matrices and their P s are





Marker


Disease +  − Total




Present 4 1 5
Absent 2 3 5




Total 6 4 10




P (1) = (5!5!6!4!)∕(10!(4!1!2!3!)) = 0.2381  ;





Marker


Disease +  − Total




Present 3 2 5
Absent 3 2 5




Total 6 4 10




P (2) = (5!5!6!4!)∕(10!(3!2!3!2!)) = 0.4762  ;





Marker


Disease +  − Total




Present 2 3 5
Absent 4 1 5




Total 6 4 10




P (3) = (5!5!6!4!)∕(10!(2!3!4!1!)) = 0.2381  ;





Marker


Disease +  − Total




Present 1 4 5
Absent 5 0 5




Total 6 4 10




P (4) = (5!5!6!4!)∕(10!(1!4!5!0!)) = 0.0238  ,

which indeed sum to 1, as required. The sum of P(b)  values less than or equal to p=0.0238 is then 0.0476 which, because it is less than 0.05, is significant. Therefore, in this case, there would be a statistically significant association between the marker and disease.

When the table being analyzed is too large to consider all possible configurations, permutations are used by re-shuffling categories for one of the two categorical variables, each time recalculating  (b)
P  (b = 1,2,...,B )  in order to simulate the null distribution. After B iterations, the empirical p-value is

          (b)
P = #-{b : P-<-P}-.
          B
(5.58)

5.9 Multiway Contingency Tables

Tables with more than two rows and columns are called multiway contingency tables. The table below shows the setup of a multiway contingency table:

Class
1 2 ... c Totals




Population 1
O
 11
O
 21
⋅⋅⋅
O
  1j
n
 1




Population 2
O21
O22
⋅⋅⋅
O2j
n2




⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
⋅⋅⋅
⋅⋅⋅




Population r
Or1
Or2
⋅⋅⋅
Orc
nr




Totals C1  C2  ... Cc  n

The same notation using an r for the number of rows and a c for the number of columns is the same as for two-way and one-way tables [11]. Using an r× c table, we can test whether cell probabilities differ from sample to sample. We can also use an r× c table for a single sample whose counts have been partitioned into r different categories according to one criterion and into c different categories according to another criterion. The application of chi-squared testing for both methods is the same.

Example 7 - r× c Contingency table analysis. Gleason scoring is one of the most reliable pathological scoring systems for staging prostate cancer severity and growth. The higher the Gleason score for a tumor tissue sample, the more severe the prostate cancer. The table below lists the number of prostate cancer patients as a function of Gleason score category and diagnostic grade.

Class
1 2 ... c Totals




7.3
30.3
38.0
5.4




Gleason 1-3
23
40
16
2
81




18.6
77.5
97.1
13.8




Gleason 4-6
11
75
107
14
207




9.1
38.2
47.9
6.8




Gleason 7-9
1
31
60
10
102




Totals 35 146 183 26 390

Null Hypothesis: The proportion of cases in Gleason categories is the same for all stages. The alternative hypothesis is that the proportions of cases in Gleason categories are different across stage.

Statistical test:

     ∑r ∑c (O  − E  )2
χ2 =       --ij---ij-.
     i=1 j=1     Eij
(5.59)

The expected number of counts in each cell is

       niCj
E(ij) =  n ,
(5.60)

where ni  is the total counts in row i (i = 1,2,...,r)  and Cj  are the total counts in column j (j = 1,2,...,c)  .

Significance: Let α = 0.05  , n = 390  , the total number of samples.

Sampling Distribution: The χ2  distribution for an r× c table has (r− 1)(c− 1)  d.f. Specifically, for this example we get (3− 1)(4 − 1) = 6  d.f.

Rejection Region: Ho  will be rejected if the probability of occurrence of  2
χ  with one 6 d.f. under then null hypothesis is less than or equal to α = 0.05  .

Decision Rule: Reject values of χ2  that exceed χ21−α  for 6 d.f.

Example calculation:

    ∑r ∑c           2
χ2 =       (Oij-− Eij)-
    i=1j=1    Eij
    (23− 7.3)2  (40− 30.3)2   (16− 38.0)2   (2 − 5.4)2
  = ----7.3----+ ---30.3--- + ---38.0--- + --5.4----
        (11− 18.6)2   (75 − 77.5)2  (107− 97.1)2   (14− 13.8)2
      + ----------+  ----------+ ----------- + ----------
           18.62        77.5 2        97.12         13.28
      + (1−-9.1)--+ (31-−-38.2)-+ (60−-47.9) + (10−-6.8)-
           9.1         38.2          47.9         6.8
  = 33.8+ 3.1 + 12.7+ 2.1+ 3.1+ 0.08 + 1+ 0.003 + 7.3+ 1.4+ 3.1+ 1.5
  = 69.2
(5.61)

The calculated test statistic of 69.2 far exceeds the critical value of 22.46 for χ21−0.999  for 6 d.f., so the decision rule is to reject Ho  , and the conclusion is that there exists a significant association between Gleason score and tumor grade, with p< 0.001.

5.10 Exact Tests for Multiway Tables

Consider the table

1 2 ⋅⋅⋅ j ⋅⋅⋅ c Totals








1 n11  n12  ⋅⋅⋅ n1j  ⋅⋅⋅ n1c  n1.








2 n21  n22  ⋅⋅⋅ n2j  ⋅⋅⋅ n2c  n2.








...  ...  ...  ...  ...








i ni1  ni2  ⋅⋅⋅ nij  ⋅⋅⋅ nic  ni.








...  ...  ...  ...  ...








r nr1  nr2  ⋅⋅⋅ nrj  ⋅⋅⋅ nrc  nr.








Totals n.1  n.2  ⋅⋅⋅ n.j  ⋅⋅⋅ n.c  n

For each row, the multinomial probability of the row total {ni.} is

          --ni.!- ∏   nij
P ({ni.}) = ∏j nij!  πj|i ,
                  j
(5.62)

where πj|i  is the conditional probability in column j (j = 1,2,...,c ) given row i (Freeman and Halton, 1951; Agresti, 2002).

The probability of the table {n }
  ij is the product of all of the multinomial row probabilities

             (             )
          ∏     n !  ∏
P ({nij}) =   ( ∏-i.--   πnjij|i) .
           i    j nij! j
(5.63)

Under independence, we assume that in each column the conditional cell probabilities are equal (including the row total), shown as π   = π   = ⋅⋅⋅ = π  = π
 j|1    j|2        j|r   .j  . Thus, let’s substitute π
 .j  for π
 j|i  and n
 .j  for n
  ij  in the above equation and obtain

            (              )
P({n }) = ∏ ( ∏-ni.!- ∏ πn.j)
    ij     i     jnij! j  .j
          ∏    ∏    n.j
       =  -i∏ni.∏!-jπ-.j-.
             i  jnij!
(5.64)

The distribution of the cell counts in each row, however, also depend also on π.j  . Fisher assumed these to be sufficient nuisance parameters, and conditioned on them so that the resulting conditional probability does not depend on them. The contribution of  π.j  to the product multinomial distribution depends on the data through n.j  , which are the sufficient statistics. Each column total n.j  has the multinomial distribution

             n!  ∏   n
P ({n.j}) = ∏----    π.j.j.
            jn.j! j
(5.65)

The probability of {nij} conditional on {n.j} equals the probability of {nij} divided by the probability of {n.j} , so we invert the above ratio and multiply as

               (∏  n !∏  πn.j) ( ∏  n  !)
P({nij}|{n.j}) =  --i∏-i.∏--j-.j-   ∏--jn..jj-
               ∏    i∏ jnij!       jπ.j n!
                 ini.!  jn.j
             = n!∏--∏--nij!.
                   i  j
(5.66)

P above is first calculated from the observed data. Permutations are then used by re-shuffling categories for one of the two categorical variables being considered, each time recalculating P(b)  (b = 1,2,...,B )  repeatedly in order to simulate the null distribution. After B iterations, the exact p-value is

    #-{b : P(b) <-P}
P =       B       .
(5.67)

Randomization Test. You will notice that even for small tables, e.g. 3x4, 4x4, 4x5, 5x5, the number of iterations required for fully implementing the exact form of the test can exceed than   4
10  configurations. Therefore, NXG Explorer uses the following randomization test:

  1. Calculate the raw chi-squared value for the table, call this  2
χ  .
  2. Randomly shuffle (permute) one of the grouping variable’s values.
  3. Calculate chi-squared using the grouping variables (with one of them shuffled), call this χ2(b)  .
  4. Repeat steps 2-3 B = 10,000  times (default).

The p-value for this randomization test is

          2     2
    #{b-: χ(b) ≥-χ-}
P =       B       ,
(5.68)

that is, the number of times the chi-squared value for the permuted configurations exceeds or is equal to the chi-squared value for the raw data configuration (table), divided by the number of iterations.

5.11 McNemar Test for Paired Binary Outcomes

McNemar’s test is a chi-squared test of the significance of change among objects for which the presence (+) or absence (-) of a trait is determined before and after an exposure (treatment) is given. As an example, consider an intervention study to determine whether hearing loss could be significantly improved after cochlear implant therapy among patients with severe tinnitus (ringing in the ears). The number of patients testing positive for severe tinnitus before and after cochlear implant therapy is listed in the table below. McNemar’s test will allow you to determine whether the therapy significantly changed the frequency of tinnitus, and whether the outcome after therapy was better or worse.

The data table for patients undergoing cochlear implant therapy and the presences of tinnitus is given below:





Before


After +  − Total




+  330 141 n1.  = 471




− 351 383 n2.  = 734




n.1  = 681 n.2  = 524 n = 1205




     (b− c)2
χ21 = -------
      b+ c     2
   = (141−-351)-
      141+ 351
   = (− 210)2
       492
   = 89.63
(5.69)

The degrees of freedom (d.f.) for this test is equal to 1. The calculated test statistic of 89.63 is significant, since it exceeds (or is equal to) the 1− α critical value of chi-squared for α = 0.05  and 1 d.f., χ20.95,1 = 3.84  . Therefore the p-value would be less than 0.05, i.e., p < 0.05  .

Example 8 - McNemar Test. Open the mcnemar_data.xlsx file that is in the Data folder. Next, select Independence→ McNemar in the Analysis pull-down menu:

PIC

Figure 5.10: Independence pull-down menu for the pair categorical McNemar test.

It is important to realize that both features used for the McNemar test must be binary (0,1) codes. Binary features will have a PIC icon associated with them. Next, select both binary features:

PIC

Figure 5.11: Feature selection during McNemar test.

The results are shown below. While the change is not significant, there were 22% more subjects that changed from positive to negative when compared with subjects that changed form negative to positive, so the pattern indicates that, on average, the cochlear implant therapy resulted in an improvement. Obviously, in clinical medicine, for an efficacious treatment you would expect many more subjects that changed from positive to negative, and few subjects whose conditions worsened.

PIC

Figure 5.12: McNemar test results.

Bibliography

[1]   M.S. Bartlett. Properties of sufficiency and statistical tests. Proc. Royal Stat. Soc., Series A. 160: 268282; 1937.

[2]   W. Behrens. Ein beitrag zu fehlerberechnung bei wenige beobachtungen. Landwirtsscheftliche Jahrbucher. 68:807-837, 1929.

[3]   R. Fisher. The comparison of samples with possibly unequal variances. Annals of Eugenics, 9:174–180, 1932.

[4]   W. Dixon, F. Massey. Introduction to Statistical Analysis. New York (NY), McGraw-Hill, 1969.

[5]   H. Mann, D. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat., 18:50–60, 1947.

[6]   A.M. Mood. On the asymptotic efficiency of certain non-parametric two-sample tests. Ann. Math. Stat., 25: 514–522, 1954.

[7]   B.L. Welch. On the comparison of several means values. Biometrika, 38(3/4): 330–336, 1951.

[8]   W. Kruskal, W. Wallis. Use of ranks on one-criterion variance analysis. Amer. Stat. Soc., 47:583–621, 1952.

[9]   S. Siegel. Nonparametric Statistics. New York(NY), McGraw-Hill, 1956.

[10]   G.H. Freeman, J.H. Halton. Note on an exact treatment of contingency, goodness of fit and other problems of significance. Biometrika, 38: 141–149, 1951.

[11]   A. Agresti. Categorical Data Analysis. New York(NY), John Wiley and Sons, 2002.

Chapter 6
ASSOCIATION

6.1 Covariance

Association is defined as the co-occurrence of a linear relationship between two continuously-scaled features. For example, the classical example of two associated features in clinical medicine is age and systolic blood pressure (mmHg). As age increases, there is a tendency for systolic blood pressure to increase as well. Recall that the variance of a normally-distributed continuously-scaled random variable xi  is defined in the form

      1  ∑n
s2x =-----   (xi − ¯x)2.
    n − 1i=1
(6.1)

If we consider another continuous feature yi  , then the covariance between x and y is

     --1--∑n
sxy = n − 1   (xi − ¯x)(yi − ¯y).
          i=1
(6.2)

The above definitions apply to the bivariate case, where association between a pair of features is considered. The covariance for multiple features is covered at the end of this chapter.

6.2 Parametric Correlation: Pearson Product Moment

Pearson Product Moment Correlation. The Pearson product moment correlation coefficient is similar to covariance, but does not include the effect of each feature’s scale. Pearson correlation between features x and y is the ratio of their covariance to the product of their standard deviations given as

                 ∑
      sxy-   ------ni=1(xi −-¯x)(yi −-¯y)---
rxy = sxsy = ∘∑n---(xi −-¯x)2∑n-(yi −-¯y)2.
                i=1          i=1
(6.3)

The correlation coefficient is scale independent and dimensionless, with range ± 1  . When rxy  approaches 1, then there is a strong linear association between x and y . However, as rxy  approaches 0, the linear association between x and y gets weaker. A negative value of rxy  indicates that as one feature goes up in scale, the other feature goes down. Recall that the population-based notation for covariance and standard deviation are σxy  and σx  , σy  , respectively, and thus correlation can be written as

ρ  =  ∘σxy---
 xy     σ2xσ2y

   =  σxy-.
      σyσy
(6.4)


Figure 6.1: Examples of correlation values as a function of sample size (with significance testing results).

Correlation: Z-score implementation. A much simpler formula for correlation involves the cross-product between the Z-scores for x and y in the form

z (i) = xi −-¯xj,
 x       sx
(6.5)

       yi − ¯yj
zy(i) =--s---,
          y
(6.6)

in the form

      ∑n z (i)z(i)
      i=1--x---y--
rxy =    n − 1   .
(6.7)

We can test the significance of a correlation coefficient using the null hypothesis Ho : r = r0  vs. the alternative hypothesis Ha : r ⁄= r0  by use of Fisher’s z -transformation. For the sample estimate we use

           ( 1+ r)
zr = 0.5loge  ----- ,
             1− r
(6.8)

whereas for the hypothesized coefficient r0  we use

           ( 1+ r )
zr0 = 0.5loge  ----0  .
             1− r0
(6.9)

The two statistics are then used to calculate a standard normal deviate

z = zr −-zro,
      σx
(6.10)

where standard error of x is

     ∘--1--
σx =   n−-3.
(6.11)

Worked problem. A research study measured log(fluorescent intensity) values of expression of Corin (protein) in (a) prostate tumors and (b) serum plasma of n =8 prostate cancer patients. [Note: log(fluorescent intensity) implies that expression values can be negative, e.g., log(0.6)=-0.51)]. Determine the covariance and Pearson correlation between the two “tissue compartments”, and test whether correlation is significantly different from r0 = 0  .

Table: Covariance and Pearson correlation between expression of Corin in prostate tumor tissue and serum plasma.









Plasma Tumor (xi1 − ¯x1)  (xi2 − ¯x2)  Cov.      (      )
zi1 = xi1s−1 ¯x1      (     )
zi2 =  xi2−s2¯x2 Corr.
log(intensity) log(intensity) (A) (B) (A)(B) (C) (D) (C)(D)








-1.69 -1.65 -1.07 -1.12 1.20 -0.77 -0.81 0.62
0.10 0.98 0.71 1.51 1.08 0.51 1.09 0.56
-0.04 -0.31 0.57 0.23 0.13 0.41 0.17 0.07
-0.89 -0.68 -0.27 -0.14 0.04 -0.20 -0.10 0.02
-2.25 -2.22 -1.64 -1.68 2.76 -1.17 -1.21 1.42
1.94 1.29 2.55 1.82 4.65 1.82 1.32 2.39
0.00 0.47 0.61 1.01 0.62 0.44 0.73 0.32
-2.08 -2.17 -1.46 -1.63 2.39 -1.04 -1.18 1.23








¯x =
 1  -0.61 ¯x =
 2  -0.54 ∑  =
  i  12.87 ∑  =
  i  6.62
s1 =  1.40 s1 =  1.39 ∑  ∕(n − 1) =
  i  1.8384 ∑  ∕(n − 1) =
   i  0.9455








The calculated value of the correlation coefficient is r =0.9455. Let’s now test whether this value of correlation is significantly different from zero, using the null hypothesis Ho : r =  0.00. Substituting in the values, we have

        ( 1+-0.9455)
zr = 0.5 ln  1− 0.9455
        ( 1.9455 )
  = 0.5 ln  ------
          0.0545
  = 0.5 ln(0.1060)
  = − 1.1222.
(6.12)

For r0  =0.00 we use

        ( 1+ 0.00)
zr = 0.5 ln  -------
        ( 1− 0.)00
  = 0.5 ln  1.00-
          1.00
  = 0.5 ln(1.0000)
  = 0.0000.
(6.13)

Substituting in the remaining items, we get

     ∘ -----
σx =   --1--= 0.4472
       8 − 3
(6.14)

    zr −-zr0
Z =   σx
    − 1.1222− 0.0000
 =  ----0.4472-----
 = − 2.5094
(6.15)

So we reject the null hypothesis of zero correlation, since |Z| ≥ 1.96  .

6.3 Non-parametric Correlation: Spearman Rank

The Spearman rank correlation coefficient, rs  is an alternative method for determining the correlation between two features x and y . Spearman correlation is a non-parametric method that does not depend on means and standard deviations of the pair of vectors x and y . Let Rx1,Rx2,...,Rxnx  be the ranks of x and  y   y
R1,R 2,...,Ryny  be the ranks of y . Spearman correlation is determined as

rs = 1 −-63D--,
        n − n
(6.16)

where:

    ∑n      n∑
D =    d2i =   (Rx(i) − Ry(i))2.
    i=1     i=1
(6.17)

Significance testing of rs  is based on the standard normal distribution. Using the expected value of D as

E (D ) = n3 −-n-= n(n−-1)(n+-1).
          6           6
(6.18)

       n2(n+-1)2(n-−-1)
V(D) =       36       .
(6.19)

    D − E (D)
Z = -∘-------.
       V (D)
(6.20)

Worked problem. A research group studied log(expression) of two serum plasma proteins (A & B) in n =8 laboratory mice after several weeks of feeding via a high-fat diet. [Note: log(expression) implies that expression values can be negative, e.g., log(0.6)=-0.51)]. Determine the Spearman rank correlation between the two proteins, and test whether correlation is significantly different from r0 = 0

Table: Spearman correlation between expression values of two serum plasma proteins.







Protein A Protein B Rx  Ry  (Rx − Ry )  (Rx − Ry )2
log(expression) log(expression) (A) (B) (A)-(B) ((A)-(B))2






-0.18 0.26 5 7 -2 4
-0.57 -1.03 4 2 2 4
0.59 0.18 7 6 1 1
1.78 1.50 8 8 0 0
-1.97 -1.46 1 1 0 0
0.53 -0.26 6 5 1 1
-0.83 -0.75 2 4 -2 4
-0.69 -0.85 3 3 0 0






∑   =
  i  14






Substituting in the results, we get:

    ∑n      ∑n        y
D =    d2i =    (Rx(i) − R(i))2
    i=1     i=1
  = 14(from the table)
(6.21)

        (6)(D)
rs = 1− n3 −-n

  = 1−  (6)(14)-
        512− 8
  = 1−  84-
        504
  = 1− 0.1667
  = 0.8333
(6.22)

Significance testing of rs  is based on the standard normal distribution. Using the expected value of D as

       n3 − n
E(D ) =--6---
       n(n− 1)(n+ 1)
     = ------6------
       (8)(7)(9)
     = --------
       5046
     = ---
        6
     = 84,
(6.23)

and the variance of D as

        2      2
V(D ) = n-(n+-1)-(n−-1)
              36
     = 82(8+-1)2(8-−-1)
             36
       (64)(81)(7)
     =     36
       36288
     =   36
     = 1008.00
(6.24)

and upon substitution, we have

Z =  D∘−-E(D)-
        V(D)
     14−-84-
Z =  √1008-
     − 70
Z =  31.75-
Z = − 2.20
(6.25)

So we reject, since |Z | ≥ 1.96  .

The advantages of Spearman correlation are as follows:

  1. Less sensitive to bias due to outlier effects;
  2. Does not require assumption of normality, although requires assumptions about symmetry of a Gaussian-like distribution.

The disadvantages are:

  1. Lower power if assumptions for using parametric (Pearson) are true;
  2. Calculations become tedious in spreadsheets;
  3. Ties are important and must be factored into computation.

6.4 Force Plots from Minimum Spanning Trees

NXG Explorer also provides force plots derived from the minimum spanning trees of the correlation matrices. Firstly, the distance between features(objects) j and k is determined as      ∘ ---------
djk =   2(1− rjk)  , where rjk  is the Pearson or Spearman rank correlation. Prim’s [1] algorithm is then applied to the distance matrix to construct the minimum spanning tree.

Example 1 - Pearson and Spearman rank correlation. This run will determine the Pearson and Spearman correlation between all pairs of continuous features in the Retinol data. To begin, first open the Retinol dataset. Next, select Analysis → Association → Correlation, Covariance, or Euclidean distance from the Analysis pull-down menu.

PIC

Next, select all the continuous features in the Retinol dataset, then use the default settings in the popup window shown below:

PIC

The resulting correlation matrix will be shown after clicking on the treenode icon for “Pearson correlation (html)”:

PIC

Using html format, the Pearson and Spearman correlation coefficients are provided along with color coding to distinguish significant negative and positive correlation coefficients. In other words, a test of significance is automatically performed at run-time, and testing results are revealed via color coding:

PIC

An additional plot is made showing the p-value results based on the same color coding used for the html output:

PIC

Click on the Force plot(Pearson) and Force plot(Spearman) icons in the treenodes, and the following images will appear:

PIC

The minimum spanning trees shown above depict the various relationships between the features employed, and differences between force plots constructed from Pearson correlation and Spearman correlation matrices is due to the range and scale of the features used. In cases where outliers are present in the data, Spearman rank-based force plots will reveal a better picture of the true minimum spanning tree.

Example 2 - Object-by-Object correlation. Correlation is commonly applied to the features, however, it can also be applied to objects to identify objects with similar correlation across their feature values. This example will use a dataset based on expression of 23 genes in 4 classes of n = 63  small round blue cell tumors (SRBCT). First, open the SRBCT_wo_classes.xlsx dataset. Transpose the entire dataset by selecting the following commands:

PIC

Then select all the features in the dataset, without selecting the Text-based features.

PIC

Then, select all of the objects for correlation analysis. The results of the significance plot for Pearson correlation are shown below:

PIC

The 23 genes in the SRBCT dataset were identified as the best class predictors for the four tumor types, and the high co-regulation (co-expression) among genes within the four classes (EWS-Ewing’s sarcoma, BL-Burkitt’s lymphoma, NB-neuroblastoma, and RMS-rhabdomyosarcoma) can be seen in the correlation significance plot, where large red areas indicate statistically significant positive correlation. The significant positive correlation (red) is only apparent within each class, but not between each class. This object-by-object approach to correlation can provide tremendous strength to exploratory analysis to confirm impressions on correlation between objects.

6.5 Multivariate Forms of Covariance and Correlation

Let Mn,p(ℜ)  be the set of all data features over the field of real numbers, and Mp (ℜ)  as the set of symmetric p -square matrices over the field of real numbers. Define the data matrix Y ∈ Mn,p(ℜ )  with n rows and p columns as a function on the pairs of integers (i,j),1 ≤ i ≤ n,1 ≤ j ≤ p , with values in Y in which yij  designates the value of Y at the pair (i,j)  shown as

      ( y    y   ⋅⋅⋅  y  )
      | y11  y12  ⋅   y1p |
 Y  = ||  2.1   2.2  .    2p. || .
n×p   (  ..    ..   ..   .. )
        yn1  yn2 ⋅⋅⋅  ynp
(6.26)

The sample mean for the j th column of Y is

      ∑n
¯yj = 1   yij,
    n i=1
(6.27)

so that the p ×1  sample mean vector is given by

    (    )
    | y¯1 |
¯y = || y¯2 || .
    (  ... )
      y¯p
(6.28)

The n × p standardized data matrix of Y is

      (                  )
        z11  z12  ⋅⋅⋅  z1p
      || z21  z22   ⋅   z2p ||
pZ×p = |(  ...    ...  ...   ... |) ,
        zn1  zn2 ⋅⋅⋅ znp
(6.29)

with elements

zij = yij-−y¯j.
       sjj
(6.30)

The sample covariance matrix of Y is

      ( s11  s12  ⋅⋅⋅ s1p )
      | s21  s22   ⋅  s2p |
 S  = ||  .    .  .    .  || ,
p×p   (  ..    ..   ..  ..  )
        sp1  sp2  ⋅⋅⋅ spp
(6.31)

with diagonal elements determined with the form

     --1--∑n         2
sjj = n − 1   (yij − ¯yj),
          i=1
(6.32)

and off-diagonal elements as

           n
sjk = -1--∑  (yij − ¯yj)(yik − ¯yk).
      n− 1 i=1
(6.33)

Below is a plot of the covariance matrix for all features in the Retinol dataset:

PIC

The p− symmetric sample correlation matrix of Y is

     (                  )
     |  r11  r12  ⋅⋅⋅  r1p |
R  = ||  r21  r22  ⋅   r2p ||
p×p  (   ...   ...   ...   ... )
        rp1  rp2  ⋅⋅⋅  rpp
(6.34)

with elements

     --sjk--
rjk = √sjjskk.
(6.35)

6.6 Euclidean Distance

The proximity matrix D contains the Euclidean distances between each pair of features and appears in the form

     (  d   d    ⋅⋅⋅  d  )
     |  d11  d12  ⋅   d1p |
D  = ||   21.   2.2  .    2p. ||
p×p   (   ..   ..   ..   .. )
       dp1  dp2  ⋅⋅⋅  dpp
(6.36)

with elements

     (  n          )1∕2
djk =  ∑  (yij − yik)2   .
       i=1
(6.37)

The Euclidean distance matrix for all Retinol dataset features is shown below:

PIC

Columns of Y are usually measured on different scales (units), and it is therefore possible that one or more features will strongly influence the elements in D simply because of their large scale differences. To remove scale differences, the correlation matrix R is often used for the covariance matrix because its elements are scale independent and have potentially much less variation across the features. We advocate use of the correlation matrix R generated from the standardized data matrix Z rather than from Y in order to minimize additive and relative changes across features while removing possible scale effects.

6.7 Matrix Formulation of Covariance and Correlation

The definitions of bivariate and multivariate covariance and correlation described in the preceding sections can be further expounded on from the perspective of matrix algebra. The use of covariance and correlation matrices allows us to consider additional concepts involving their determinants, eigenanalysis, and orthogonality.

Design (Data) Matrix Consider the following (data) design matrix:

       ⌊                               ⌋
        x11  x12  x13  ⋅⋅⋅ x1j  ⋅⋅⋅  x1p
       ||x21  x22  x23  ⋅⋅⋅ x2j  ⋅⋅⋅  x2p ||
       || ..             ..        ..      ||
 X   = || .             .        .      ||
(n×p)   ||xi.1  xi2   xi3  ⋅⋅.⋅  xij  ⋅⋅.⋅  xip ||
       ⌈ ..             ..        ..      ⌉
        xn1  xn2 xn3  ⋅⋅⋅ xnj  ⋅⋅⋅ xnp.
(6.38)

By definition, the population covariance matrix is

  ∂ef        T
Σ  = E[(x− μ) (x − μ)],
(6.39)

where x is a p -tuple (xi1,xi2,...,xip)  for the i th object and μ is a p -tuple with means for the p features. Because it is impossible to measure Σ , we are limited to working with the sample covariance matrix, given as

    1 ∑n
S = --   (xi − ¯x )T(xi − ¯x).
    n i=1
(6.40)

Let D = diag(S)  , the sample correlation matrix is

      −1   −1
R = D   SD   .
(6.41)

During computation, elements rjk  of R can be determined individually using the relationship

       sjk
rjk = √s--s---
      s jj kk
   = --jk-.
     sjsk
(6.42)

If the means of columns of X are known, covariance elements sjk  of S can be directly calculated using the functional form

       n∑
sjk = 1-  (xij − ¯xj)(xik − ¯xk),
     n i=1
(6.43)

whereas, if the standard deviations and correlations are already known, then the covariance matrix elements can be calculated using

sjk = rjksjsk.
(6.44)

Notation.

  1. Σ : Population covariance matrix
  2. ρ : Population correlation matrix
  3. C : Sample covariance matrix
  4. R : Sample correlation matrix
  5. CE  = ΛE    (C = EΛET  )
  6. RE  = ΛE    (R = E ΛET )
  7. λ1 ≥ λ2 ≥ ⋅⋅⋅ ≥ λp > 0  : Eigenvalues of Σ or ρ
  8. l1 ≥ l2 ≥ ⋅⋅⋅ ≥ lp > 0  : Eigenvalues of C or R

Properties of Covariance Matrix.

  1. Elements σjk = ρjkσjσk
  2. ∑p               ∑p
  j=1λj = tr(Σ) =   j=1 σjj
  3. Loadings           ∘ --
ρyj,xk = ekj λj√σkk
  4. Determinant       ∏p
|Σ | = j=1λj

Properties of Correlation Matrix.

  1. Elements ρjk = σjk∕√σjjσkk
  2. ∑p              ∑p
  j=1λj = tr(ρ) =  j=1ρjj = p
  3. Loadings ρyj,xk = ekj∘ λj
  4. Determinant      ∏p
|ρ| =  j=1λj
  5. |C| = ∏p  λj ⁄= |R | = ∏p λj
       j=1            j=1

Independence and Multicollinearity. Under independence of features (no covariance), the properties of C and R are:

For multicollinearity:

Orthogonality

  1. Features are orthogonal if their covariance (correlation) is zero.
  2. Under orthogonality, C = I , R = I . An example of an identity matrix is
          ⌊ 1 0  0  ⋅⋅⋅ 0  ⋅⋅⋅ 0⌋
      | 0 1  0  ⋅⋅⋅ 0  ⋅⋅⋅ 0|
      || 0 0  1  ⋅⋅⋅ 0  ⋅⋅⋅ 0||
      || .        .      .   ||
 I  = || ..        ..      ..   ||
(p×p)  || 0 0  0  ⋅⋅⋅ 1  ⋅⋅⋅ 0||
      |⌈ ..        ..      ..   |⌉
        .        .      .
        0 0  0  ⋅⋅⋅ 0  ⋅⋅⋅ 1
    (6.45)

Bibliography

[1]   R.C. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal. 36(6): 1389–1401, 1957.

Chapter 7
DEPENDENCY

7.1 Linear Regression: Single Predictor

Linear regression focuses on the dependency between a dependent variable (feature) y and an independent variable (feature) x  [12]. Define the following estimators:

           β : regression coefficient for y- intercept, i.e., value of y when x = 0
            0
           β1 : regression coefficient for slope, i.e., change in y for a 1-unit change of x
           yi : observed y-value for the ith object
yˆi = β0 + β1xi : predicted value of y for the ith object
   e = y − ˆy : residual, difference between observed and predicted y-values
    i   i   i
(7.1)

Next, let the objective function Q(β0,β1)  , be dependent on the parameters β0  and β1  . The objective function is minimized using the method of least squares, for which the sum of the squared residuals, ∑  2
  ei  , is minimized. Recall that the residuals, ei = yi − ˆyi = yi − β0 − β1x .

Q(β ,β ) = β +β x  +e
   0  1    0    1in  i
              min∑  e2
                 i=1 i
              s.t. β0,β1
(7.2)

The above section introduced linear regression based on a single predictor; however, it is more common to solve regression problems with multiple predictors in order to adjust for the effects of confounders or nuisance factors. Therefore, we focus more on multiple linear regression, introduced below.

7.2 Multiple Linear Regression

Multiple linear regression employs more than one predictor feature, and still only uses one outcome y -feature. The calculations used for determining regression coefficients are based on matrix algebra, which is described in the section below.

Matrix Form of Multiple Linear Regression. A standard multiple linear regression model is

y = X β + e.
(7.3)

In matrix form, the model equation is expressed as

⌊y1⌋   ⌊x11  x12  ⋅⋅⋅  x1p⌋⌊β1⌋   ⌊𝜀1⌋
|y2|   |x21  x22  ⋅⋅⋅  x2p||β2|   |𝜀2|
|| ..|| = || ..    ..   ..    .. |||| ..|| + || ..|| .
⌈ .⌉   ⌈ .    .    .   . ⌉⌈ .⌉   ⌈ .⌉
 yn     xn1  xn2  ⋅⋅⋅  xnp  βp     𝜀n
(7.4)

where X is the data matrix with dimensions n × p , n is the number of rows in X , p is the number of columns in X → # features, i is the i th row (i = 1,2,...,n)  , and j is the j th feature (column) (j = 1,2,...,p)  .

In theory, since we cannot access an infinitely large universal population to make inferences, we have to generalize results using a sample. When doing so, we replace the β terms with the maximum likelihood estimates βˆ . Without loss of generality, we nevertheless use βˆj  interchangeably with βj  .

By use of calculus (involving derivatives of several matrices) it can be shown that the normal equations can be rearranged to solve for the regression coefficients in the form

      ⊤   −1 ⊤
β = (X X )  X  y.
(7.5)

The necessary matrix operations for solving for the regression coefficients are

      (        )−1
 β  =  (pX×p)⊤(Xn×p)   (Xp×n)⊤ y
(p×1)  (⌊               (n⌋×⌊1)             ⌋)−1⌊               ⌋⌊  ⌋
      ||xx1121  xx1222 ⋅⋅⋅⋅⋅⋅ xx12nn||xx1211 xx1222  ⋅⋅⋅⋅⋅⋅ xx12pp||  |xx1211  x1x222  ⋅⋅⋅⋅⋅⋅  x1x2nn||yy12|
    = ||(||⌈ ..   ..   ..   .. ||⌉||⌈ ..   ..   ..   .. ||⌉||)  ||⌈ ..    ..  ..   ..||⌉||⌈ ..||⌉.
        x.  x.  ⋅⋅.⋅ x.   x.  x.   ⋅⋅.⋅ x.      x.   x.  ⋅⋅.⋅  x.  y.
         p1   p2      pn   n1  n2      np      p1  p2      pn   n
(7.6)

Matrix notation when no constant term is used. When no constant term is used, the matrix notation is:

⌊  ⌋   ⌊                 ⌋ ⌊  ⌋  ⌊   ⌋
|y1|   |x11  x12  ⋅⋅⋅ x1p| |β1|  | 𝜀1|
||y2.|| = ||x21.  x2.2  ⋅⋅⋅ x2.p|| ||β2.||+ || 𝜀.2||
⌈ ..⌉   ⌈  ..   ..   ...   .. ⌉ ⌈ ..⌉  ⌈ .. ⌉
 yn     xn1  xn2  ⋅⋅⋅ xnp   βp     𝜀n
(7.7)

Matrix notation when constant term is used. Using expanded form, when a constant term is introduced into the regression model, a one is “padded” into the left side of the  X matrix in column one:

⌊  ⌋   ⌊                   ⌋ ⌊  ⌋   ⌊  ⌋
|y1|   |1  x11  x12 ⋅⋅⋅  x1p| |β0|   |𝜀1|
||y2.|| = ||1.  x2.1  x.22 ⋅⋅.⋅  x2p|| ||β1.|| + ||𝜀2.||
⌈ ..⌉   ⌈..   ..    ..   ..     ⌉ ⌈ ..⌉   ⌈ ..⌉
 yn     1  xn1  xn2 ⋅⋅⋅ xnp   βp     𝜀n
(7.8)

Once regression coefficients β are fitted, we can cross-multiply them against their relevant x-values and add the products to obtain the predicted y-value, also known as “y-hat,” given as

         p
ˆyi = ˆβ0 + ∑ βˆjxij,
        j=1
(7.9)

where ˆβ0  is an estimate of the intercept, and ˆβj  is an estimate of the j th regression coefficient for feature xj  . Using the y-hats, the residual for each observation can then be determined as

ei = yi − ˆyi.
(7.10)

Variance Partitioning – Sums of Squares. Multiple linear regression uses matrix algebra to determine the partitioned sum-of-squares. The total sum of squares for multiple regression is

              (1 )
SST O = y⊤y −  -- y ⊤11⊤y,
               n
(7.11)

where

y⊤y = ∑n y2i,
      i=1
(7.12)

and

(  )           ∑n     2
 -1  y⊤11⊤y = (--i=1yi).
 n                n
(7.13)

The sum of squares error is functionally composed as

        ⊤     ⊤  ⊤
SSE  = y y − β X  y,
(7.14)

while the sum of squares regression is

                (  )
        ⊤  ⊤     1-  ⊤   ⊤
SSR = β  X  y −  n  y  11 y.
(7.15)

The sums of squares helps us partition the variance which is attributable to the regression and the resulting error after the regression coefficients are fit. If SSE=0, then the models fits perfectly to the data. As SSE increases, the original y -values become further from their predicted values, ˆy
 i  . If there is too much variation in the residuals, i.e., σ2  , the SSE will be large. In addition, the straight-line model assumption may be inappropriate. If we assume that the linear straight-line assumption holds, then an estimate of σ2  can be made from using SSE. We call this estimate, the mean square error, or MSE. The square root of the MSE, that is, √M--SE , is called the root mean square error.

         2
M SE  = σ
        --1--∑n        2
      = n − p   (yi − ˆyi)
             i=1
      = --1--SSE.
        n − p
(7.16)

Note: MSE is an unbiased estimator of σ2  , which is the variance of the residuals, ei = yi − yˆi  . σ2 ≈ M SE and its square root, σ ≈ RM SE , are used throughout later calculations, especially for the t-tests for regression coefficients, prediction intervals, and confidence intervals. Please note that an alternative notation for σ2  is S2y|x  .

Finally, we calculate the RMSE using the notation

         √ -----
RM  SE =   M SE = σ
(7.17)

The mean squares due to regression, MSR, is also calculated, but it is only used in the ANOVA table for the global F-test. Whereas the MSE is used in multiple later calculations.

After partitioning the variance due to our regression model, we must next construct an ANOVA table, which will allow us to set up the additional calculation for the F-ratio test to test the hypothesis: H0 : β1 = β2 = ⋅⋅⋅ = βj = ⋅⋅⋅βp  =0. The alternative hypothesis (your belief) is that at least one of the coefficients is not equal to zero: H1 : βj ⁄= 0  . The test statistic for the F-ratio test is equal to F = M SR ∕M SE . The calculated F statistic approximates a variate from the F-ratio probability distribution, which has two parameters: numerator degrees of freedom, ν , and denominator degrees of freedom, ω . In a linear regression problem, the F-statistics’s numerator degrees of freedom, ν , is always equal to p− 1  , while the denominator degrees of freedom, ω , is always equal to n − p . The operation p − 1  implies subtracting one from the number of regression coefficients, and in our single predictor model we know that p = 2  : due to β0  and β1  . The denominator degrees of freedom ω is simply equal to the sample size n minus the number of regression coefficients, which, again is p = 2  for a single predictor regression model. The degrees of freedom are required because SSE and SSR are estimators, which are penalized by a certain degrees of freedom during their formation. Obviously, as the F-statistic increases, it suggests that the regression model explains more of the variance of y than SSE does. The ANOVA table is provided below.







Source of variation Computational formula df MS F






Due to regression       ∑n
SSR =   i=1(ˆyi −y¯)2  p − 1  M SR =  SS1R-  F =  MMSSRE-
Due to error SSE  = ∑n  (y − ˆy)2
         i=1  i   i  n − p M SE =  SSE-
        n−p






Total         ∑
SST O =   ni=1(yi −y¯)2  n − 1






Example. A researcher was interested in studying the dependency of a dependent feature, Y , on three independent features, X1  , X2  , and X3  . Determine the percentage of variance explanation of Y by the X -predictors. Assume that linear regression is appropriate for determining the relationship between Y and the independent features. The table below shows the values of the features assumed.






ID Y X1 X2 X3





1 1.2 -1.0 -0.7 0.9





2 2.0 -0.9 1.5 0.4





3 1.9 0.9 -0.2 1.3





4 1.7 -1.5 0.1 -0.1





5 1.6 1.4 0.9 -0.9





6 1.8 -0.9 1.5 0.7





7 0.0 -1.1 -3.0 0.6





8 0.6 -0.3 1.0 -1.5





9 1.9 0.6 -1.5 0.7





10 1.5 -1.4 1.5 -1.4





11 0.9 0.1 -1.4 -1.6





12 1.0 0.3 -0.2 -0.1





13 1.5 -0.8 0.1 1.5





14 1.5 -0.3 0.3 -0.3





15 0.6 -0.6 -0.6 0.3





16 1.1 1.8 -1.1 -1.1





17 2.2 0.9 -0.9 0.2





18 0.9 -0.2 -0.4 1.1





19 1.5 0.4 -0.1 -0.1





20 1.0 -1.1 1.5 0.2





ANOVA Table Results. Below is the resulting ANOVA table:






Source of variation

SS df MS F





Regression

2.35 3 0.78 3.35





Error

3.74 16 0.23





Total

6.09 19





In the ANOVA table, the formal statement of the null and alternative hypotheses is

H0 : β1 = β2 = ⋅⋅⋅ = βj = ⋅⋅⋅βp = 0
H1 : βj ⁄= 0
(7.18)

If H0  holds, F ∼ F0.95,p−1,n−p  . Large values of F lead to conclusion H1  . For this example, since the calculated test statistic for F was equal to 3.35 and exceeded the lookup critical value of F based on 1− α = 1− 0.05 = 0.95  , ν =3 ω =16, 3.24, we conclude that H1  is true, suggesting that at least one of the coefficients is not equal to zero.

Notation for the variance-covariance matrix of coefficients is:

                    ⌊                                   ⌋
                      σ2(β1)  σ (β1,β2)  σ(β1,β3)  σ(β1,β4)
 2   ⊤  − 1         ||σ(β2,β1)  σ2(β2)   σ(β2,β3)  σ(β2,β4)||
σ (X  X )  = V (β ) = ⌈σ(β3,β1) σ (β3,β2)  σ2(β3)   σ(β3,β4)⌉
                     σ(β4,β1) σ (β4,β2)  σ(β4,β3)   σ2(β4)
(7.19)

Results for the variance-covariance matrix of coefficients:

                    ⌊                           ⌋
                    |0.0124 0.0031  0.0015  0.0003|
σ2(X ⊤X )−1 = V (β ) = |⌈0.0031 0.0153 0.0037 0.0036|⌉
                     0.0015 0.0037  0.0098  0.0028
                     0.0003 0.0036  0.0028  0.0150
(7.20)

The source of s.e.(βj)  for the denominator of the t-test for each βj  is:

                    ⌊∘ ------                           ⌋
                       σ2(β1) σ∘ (β1,β2)  σ(β1,β3)  σ(β1,β4)
 2   ⊤  − 1         ||σ(β2,β1)   σ2(β2)  σ∘(β2,β3)- σ(β2,β4)||
σ (X  X )  = V (β ) = ⌈σ(β3,β1) σ (β3,β2)   σ2(β3)  σ(β3,β4)⌉
                     σ(β4,β1) σ (β4,β2)  σ(β4,β3)  ∘ σ2(β4)
(7.21)

After taking the square root of each diagonal element, we get:

                   ⌊                                                               ⌋
                     s.e.(β1) = 0.1114
σ2(X⊤X )−1 = V(β) = ||                s.e.(β2) = 0.1239                               ||
                   ⌈                                 s.e.(β3) = 0.0988               ⌉
                                                                     s.e.(β4) = 0.1224
(7.22)

Let us now determine if the regression coefficient β
 j  , is statistically significant. If the regression coefficient β
 j  is a reliable estimate of the change in Y as a function of a 1-unit change in feature X
 j  , then its value will will be large, and either positive or negative. A positive slope suggests that Y increases with increasing X
  j  , whereas a negative slope indicates that Y decreases with increasing X
 j  . Using the standard error of each regression coefficient, i.e., s.e.(β )
     j  , we now calculate the t-test for each coefficient, β (j = 1,2,...,p)
 j  , using the relationship

tj = --βj--.
     s.e.(βj)
(7.23)

The hypothesis for any regression coefficient is: The formal statement of the null and alternative hypotheses is:

H0 : βj = 0.
H  : β ⁄= 0
  1  j
(7.24)

If H0  holds, t ∼ t1−α∕2;n−p  . Large values of the calculated test statistic |tj| lead to concluding H1  is true.

Regression coefficient results. Below are the regression coefficients, and t-test results:






Variable βj  s.e.(βj)  t Prob





Const 1.39434 0.11144 12.5119 0.0000





X1 0.26015 0.12386 2.1004 0.0519





X2 0.26492 0.09878 2.6821 0.0164





X3 0.22297 0.12244 1.8210 0.0874





We notice that feature X2  is a significant predictor of Y , since (P < 0.05)  , but that X1  and X3  are not.

Multiple Coefficient of Determination,  2
R  . One of the overall goals of regression is to estimate the degree of variance explanation of Y by introduction of regression coefficients. This is determined using the coefficient of determination, R2  , and the multiple r-squared, R . Notice that R2  reveals the “percentage of variance explanation” of the dependent feature, y , by the independent x -predictors in the model.

              ∑
  2  -SSR--   --ni=1(ˆyi −-¯y)2
R  = SST O =  ∑ni=1(yi − ¯y)2
     2.34856
   = 6.09066
   = 0.38560
(7.25)

    √ -2-
R = √ R------
  =   0.38560
  = 0.62097
(7.26)

Similarity between regression and correlation. Although similar to correlation, the difference between correlation and regression is that correlation is essentially the tightness of data points around a straight line irrespective of slope, whereas regression focuses on the dependency of the dependent y -feature(s) on one or more predictor x -features. The correlation coefficient is a standardized regression coefficient for the regression Y = β1X , since

       ∘ ---
β = ρ    σ2x.
 1   xy  σ2y
(7.27)

The goal of regression is to determine the amount of variance explanation due to the independent features and the slope measured by regression coefficients due to each independent feature. The degree of slope reflects the linear relationship between the dependent feature(s) and independent features.

Centering and estimation of Intercept Term. Prior to each regression run, NXG Explorer centers each predictor feature by subtracting off the mean value from each observation:

xij = xij − ¯xj,
(7.28)

where ¯xj  is the mean for feature j . After a regression run, NXG Explorer determines the y -intercept (constant term) as follows:

        ∑p
β0 = ¯y−    βj¯xj.
        j=1
(7.29)

The standard error of the y -intercept is determined using the relationship:

         ┌│ --------(-----)---p------------p−1--p----------------
         │∘ (n−-1)s2y-SSSSTEO--  ∑   2     2   ∑   ∑
s.e.(β0) =      n(n− p)    +    ¯xjσ(βj) + 2        ¯xj¯xkcov(βj,βk),
                            j=1           j=1k=j+1
(7.30)

where n is the number of objects, p is the number of features in the model, s2y  is the variance of y , and cov(βj,βk)  is element (j,k)  of the variance-covariance matrix of coefficients, σ2(X⊤X )−1  .

7.2.1 Regression Diagnostics

Regression diagnostics are statistical techniques for detecting conditions which can lead to inaccurate or invalid regression results. Techniques include:

Model Interpretation: DACOD Acronym. We can use the DACOD acronym for assuring valid regression analysis and interpretation:

  1. Distribution - is normality assumed?
  2. Adjustments - what else is in the model?
  3. Coefficients - s.e.(βj)  , direction, i.e., βj > 0,βj < 0  , and p-values.
  4. Outliers - can strongly bias results.
  5. Diagnostics - goodness-of-fit (R2  ), leverage, multicollinearity, etc.

The majority of regression analyses often include a constant term (y -intercept) in the regression model. The basic data setup for regression diagnostics starts with the residuals, which are based on subtraction of the predicted y, or “y-hat” based on notation,   ˆyi  , from the observed y-value, yi  . Recall, the subscript i represents the i th object or subject (human, mouse, rabbit, etc). Let’s look at the many ways the regression model can be shown:

Recall, for multiple linear regression the vector and matrix notation is

y = X β + e,
(7.31)

which is equal to

⌊  ⌋   ⌊                   ⌋ ⌊  ⌋   ⌊  ⌋
|y1|   |1  x11  x12 ⋅⋅⋅  x1p| |β0|   |𝜀1|
||y2.|| = ||1.  x2.1  x22 ⋅⋅.⋅  x2p|| ||β1.|| + ||𝜀2.||
⌈ ..⌉   ⌈..   ..   ...   ..     ⌉ ⌈ ..⌉   ⌈ ..⌉
 yn     1  xn1  xn2 ⋅⋅⋅ xnp   βp     𝜀n
(7.32)

To calculate the predicted y-hat ˆyi  for the i th object, we simply cross multiply the x-values for the i th object with the corresponding fitted coefficients and add the constant term, given as

yˆi = β0 + β1xi1 +β2xi2 + ⋅⋅⋅+ βpxp1
          ⊤
  = β0 + x β
         ∑p
  = β0 +    xijβj.
         j=1
(7.33)

where  ⊤
x β , ∑
  xβ or xb is called the linear predictor.

Residuals, ei  . Residuals are defined as

ei = ˆyi − yi,
(7.34)

and reflect the disagreement between the predicted and observed y -values. Residuals are distributed with mean zero and common variance,       2
N (0,σ )  .

A data setup including the y-hats and residuals, would be listed as








y x1 x2 ⋅⋅⋅ xp yhat resid














y1  x11  x12  ⋅⋅⋅ x1p  ˆy1  e1







y2  x21  x22  ⋅⋅⋅ x2p  ˆy2  e2







y3  x31  x32  ⋅⋅⋅ x3p  ˆy3  e3







y4  x41  x42  ⋅⋅⋅ x4p  ˆy4  e4







y5  x51  x52  ⋅⋅⋅ x5p  ˆy5  e5







.
..  .
..  .
..  ⋅⋅⋅ .
..  .
..  .
..







yi  xi1  xi2  ⋅⋅⋅ xip  ˆyi  ei







.
..  .
..  .
..  ⋅⋅⋅ .
..  .
..  .
..







yn  xn1  xn2  ⋅⋅⋅ xnp  ˆyn  en







Once the residuals are determined, the sum of square error can also be estimated as follows:

        n
SSE  = ∑  e2
       i=1 i
(7.35)

            1∑n
σ2 = M SE = --   e2i
            n i=1
(7.36)

             √-----
σ = RM SE  =  M SE.
(7.37)

Residuals

The initial focus in regression diagnostics involves determination of residuals for identifying outliers. An outlier is an observations (record) that is overly influential on the regression coefficients, leading to biased coefficients. In the sections below, we define a majority of the most commonly used residuals for multiple linear regression.

Standardized residuals, zi  . Standardized residuals are defined as

zi = ei,
    s
(7.38)

where s = √M-SE- . Standardized residuals are distributed N (0,1)  . Obviously, if a standardized residual for a single observation is below -2 or above 2 (very close to ± 1.96  ), we know that the residual, ei = Yi − Yˆi  , is greater than two standard deviations from the mean. Any observation whose standardized residual is more than two standard deviations from the mean of all residuals is viewed as an outlier.

Studentized Residuals, ti  . Studentized residuals are defined as

        ei
ti = s√1-−-h--,
           ii
(7.39)

where    √ -----
s =  M SE and hii  is the leverage of the i th observation (leverages are discussed below). Studentized residuals are t-distributed with n− p − 1  d.f.

Deletion Variance, s2
 −i  . Another measure of the residual is the change in variance resulting from deletion of the i th observation from a regression analysis. Deletion variance is functionally defined as

 2   SSE--−-e2i∕(1−-hii)
s−i =    n − p− 1     ,
(7.40)

where hii  is the leverage of the i th observation.

Jackknife Residuals, t−i  . Jackknife (“deletion”) residuals are defined as

t−i =---√ei----.
     s−i  1− hii
(7.41)

Jackknife residuals are t-distributed with n− p − 2  d.f. Thus, if the absolute value, |t−i| , exceeds the 95th percentile of the relevant t-distribution (i.e., α = 0.10  ), then it is reasonable to say that the i th observation is an outlier.

Cook’s Distance, Di  . Cook’s distance reflects the change in regression coefficients when an observation is deleted from a regression model.

        e2h
Di = ----i-ii2-2-.
     (1 − hii) sp
(7.42)

Observations for which Di > 1  should be scrutinized.

Leverages. The leverage, hii  , is a measure of extremeness of an observation and is derived from the diagonal elements of the n× n Hat matrix, given as

H = X (X⊤X )−1X ⊤.
(7.43)

Another interpretation is that leverage is equal to the distance between the i th observation and the mean of all observations. The average leverage value is

¯h = ∑ni hii= p-.
      n     n
(7.44)

Hoaglin and Welsh  [3] recommended scrutinizing observations for which hii > 2p∕n . Some useful properties of the leverage are 0 ≤ hi ≤ 1  and ∑i hii = p .

DFFITS, DF FIT Si  . DFFITS is a scaled measure of the change in the predicted value for the i th observation, ˆyi  , and is calculated by deleting the i th observation. DFFITS are in the form

             ∘ --h----
DF  FITSi = ti  ---ii--.
               1 − hii
(7.45)

Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch  [4] is  ∘ ----
2  p∕n .

DFBETAS, DF BET AS
          ij  . DFBETAS are a measure of the change in the j th regression coefficient as a result of removing the i th observation in a regression run.

               ----------tiZij-----------
DF  BET ASij = (√1−-h--)(∘ [(X-⊤X-)−-1]-),
                     ii             jj
(7.46)

where Zij  is an element of the n × p matrix

      ⊤   −1  ⊤
Z = (X  X)  X  .
(7.47)

Large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch  [4] recommend 2 as a general cutoff value to indicate influential observations and   √ --
2∕  n as a size-adjusted cutoff.

Multicollinearity

Multicollinearity is a problem caused by correlation between predictor features. Multicollinearity is not really associated with the dependency between dependent feature and independent features, but rather, the correlation among independent features. Multicollinearity mostly biases the coefficients and their standard errors, causing the regression model to become too sensitive to the data. Using our input data matrix X with n rows and p columns, i.e., p features, we first determine the correlation coefficient r for all possible pairs of predictor features. We then arrange these correlation coefficients in a correlation matrix denoted by R. We also know that R is a square symmetric matrix, which is given in the form

    ⌊ 1  r    ⋅⋅⋅  r ⌋
    |r    112  ⋅⋅⋅  r1p|
R = || 21.   .   .    2p.|| ,
    ⌈ ..   ..   ..   ..⌉
     rp1  rp2  ⋅⋅⋅   1
(7.48)

where the diagonal consists of ones, and the off-diagonals are comprised of the between-feature correlation coefficients rjk  , where j ⁄= k are features j,k = 1,2,...,p . In the next section, we introduce variance inflation factors, which are used for detecting multicollinearity. Rather than exploit the correlation matrix R, the VIFs are based on the model R2  .

Variance Inflation Factors, V IFj  . Collinearity involves association or correlation between predictor features. Variance inflation factors  [5] represent the degree of collinearity between an independent feature of interest compared with other independent features.

       ---1--
V IFj = 1 − R2j,
(7.49)

where  2
Rj  is the   2
R  value when feature Xj  is regressed on the remaining independent features. Independent features whose V IF is greater than 10 represent a severe collinearity problem.

Residual Criteria Based on n and p . The table below lists the residual screening criteria based on the values of n and p for this worked example with n =20 objects (records) and p =3 predictor features:




Residual Outlier criteria Calculated results



StdResid ± 2 ± 2



StudResid ±tn−p−1,0.95  ± 1.753



DelResid ±tn−p−2,0.95  ± 1.761



CooksD 1 1



Leverage 2p∕n 0.400



DFFITS    ∘ ----
± 2  p∕n ± 0.894



DFBETAS    √ --
± 2∕ n ± 0.447



VIFs 10 10



Residual Analysis Results. The table below lists the calculated residuals, with outliers highlighted in red, for the example 3-predictor model:









Object Resid StdResid StudResid DelResid CooksD Leverage DFFITS








1 0.06714 0.13884 0.15009 0.14543 0.00095 0.14431 0.05972








2 0.35000 0.72372 0.79182 0.78215 0.03089 0.16461 0.34719








3 0.01340 0.02772 0.03244 0.03141 0.00010 0.26989 0.01910








4 0.70871 1.46544 1.60404 1.69539 0.12742 0.16534 0.75459








5 -0.17228 -0.35624 -0.42753 -0.41634 0.02012 0.30569 -0.27626








6 0.15012 0.31042 0.34535 0.33564 0.00709 0.19204 0.16363








7 -0.47882 -0.99009 -1.41193 -1.46115 0.51516 0.50827 -1.48553








8 -0.65117 -1.34646 -1.52667 -1.59925 0.16640 0.22214 -0.85464








9 0.61214 1.26576 1.38208 1.42602 0.09180 0.16124 0.62525








10 0.40670 0.84095 1.02515 1.02689 0.12770 0.32707 0.71591








11 0.25318 0.52351 0.64305 0.63084 0.05260 0.33723 0.44999








12 -0.44711 -0.92453 -0.95618 -0.95346 0.01592 0.06511 -0.25161








13 -0.03844 -0.07949 -0.08852 -0.08573 0.00047 0.19366 -0.04201








14 0.19589 0.40506 0.41841 0.40735 0.00293 0.06275 0.10541








15 -0.53488 -1.10602 -1.14977 -1.16231 0.02667 0.07467 -0.33017








16 -0.17943 -0.37103 -0.45101 -0.43949 0.02429 0.32324 -0.30373








17 0.77230 1.59694 1.71366 1.83629 0.11123 0.13158 0.71477








18 -0.55056 -1.13844 -1.21376 -1.23337 0.05035 0.12026 -0.45601








19 0.10185 0.21059 0.21832 0.21170 0.00089 0.06953 0.05787








20 -0.57873 -1.19668 -1.30675 -1.33871 0.08215 0.16138 -0.58726








DFBETAS Results. The table below lists the resulting DFBETAS for the example 3-predictor model:





Obs X1 X2 X3




1 -0.03176 -0.02308 0.02208




2 -0.04873 0.24620 0.11188




3 0.01269 0.00493 0.01434




4 -0.62583 -0.13159 -0.18269




5 -0.20924 -0.13755 0.04991




6 -0.02122 0.11440 0.07088




7 0.79397 1.31937 0.11380




8 0.11546 -0.23569 0.63815




9 0.25813 -0.29100 0.21920




10 -0.38005 0.18566 -0.45673




11 -0.09139 -0.27079 -0.36516




12 -0.11312 -0.01085 0.01138




13 0.00446 -0.00729 -0.03367




14 -0.01739 0.01565 -0.03729




15 0.13180 0.14106 -0.03280




16 -0.20027 0.06009 0.11454




17 0.46860 -0.14184 0.15433




18 -0.06106 0.00665 -0.33761




19 0.02980 0.00643 0.00023




20 0.17956 -0.37680 -0.06967




Variance Inflation Factor.

Below is a the table of VIFs:



Feature VIF


X1 1.138


X2 1.133


X3 1.093


Because none of the VIFs exceeded a value of 10, there does not appear to be a multicollinearity problem.

7.2.2 Partial F-tests

This section introduces computational methods used for determining partial F-tests for the predictor features X1  , X2  , and X3  .

Extra Sum-of-Squares for Variable X ∗ . When regressing Y separately on the sets of predictors {X }
   1 , {X  ,X }
   1  2 , and {X  ,X ,X }
   1  2   3 , we can determine the extra sum-of-squares, defined as follows:

  1. SS(X1  ): the sum of squares explained by only using feature X1  to predict Y
  2. SS(X2|X1  ): the extra sum of squares explained by using X2  in addition to X1  to predict Y
  3. SS(X |X ,X
  3  1  2  ): the extra sum of squares explained by using X
  3  in addition to X
 1  and X
  2  to predict Y

The results of regressing Y separately on the sets of predictors {X1} , {X1,X2 } , and {X1,X2, X3} , are listed below:

Regression of Y=X1






Source of variation SS df MS F





Regression 0.31 1 0.31 0.96





Error 5.78 18 0.32





Total 6.09 19










Variable βj  s.e.(βj)  t Prob





Const 1.35401 0.12936 10.4669 0.0000





X1 0.13323 0.13607 0.9791 0.3405





Regression of Y=X1,X2






Source of variation SS df MS F





Regression 1.57 2 0.79 2.96





Error 4.52 17 0.27





Total 6.09 19










Variable βj  s.e.(βj)  t Prob





Const 1.38915 0.11875 11.6979 0.0000





X1 0.20609 0.12818 1.6078 0.1263





X2 0.22356 0.10247 2.1818 0.0435





Regression of Y=X1,X2,X3






Source of variation SS df MS F





Regression 2.35 3 0.78 3.35





Error 3.74 16 0.23





Total 6.09 19










Variable βj  s.e.(βj)  t Prob





Const 1.39434 0.11144 12.5119 0.0000





X1 0.26015 0.12386 2.1004 0.0519





X2 0.26492 0.09878 2.6821 0.0164





X3 0.22297 0.12244 1.8210 0.0874





The extra sum of squares for the first regression run is simply based on the SSR for that model, therefore, SS(X1 ) = SSR . However for the second and third regression runs, we need to use the following formulation to determine the extra sum of squares:

SS (X ∗|X ,X ,...,X  ) = SSR (X ,X ,...,X ,X ∗)− SSR (X ,X ,...,X )
        1  2      p          1  2     p             1  2      p
(7.50)

where  ∗
X denotes a single predictor X -feature,      ∗
SS(X  |X1, X2,...,Xp )  is the extra sum of squares for feature   ∗
X , SSR (X1  , X2, ... ,      ∗
Xp, X )  is the SSR from the full model containing feature   ∗
X , and SSR (X1,X2,...,Xp)  is the SSR from the reduced model without feature   ∗
X . Using the SSR from the reduced model for the regression of Y = X1  , and the SSR from the full model for the regression Y = X1,X2  , we calculate the extra sum-of-squares SS(X2 |X1 )  using the following relationship:

SS (X |X  ) = SSR (X ,X )− SSR (X  )
     2  1          1  2         1
          = 1.57− 0.31 = 1.27
(7.51)

Analogously, SS(X3 |X1,X2  ) is determined as follows:

SS (X3|X1,X2) = SSR (X1, X2,X3)− SSR (X1,X2 )
              = 2.35− 1.57 = 0.78
(7.52)

Partial F-test for Variable X ∗ . Once the extra sum of squares is determined for all the predictor X -features, the partial F-test for each feature  ∗
X is determined using the relationship

                      SSR (X∗|X1,X2,...,Xp)
F (X ∗|X1, X2,...,Xp ) = M-SE-(X-,X-,...,X-,X-∗)
                             1  2     p
(7.53)

As an example, the partial F-test for feature X2  is determined as

           SSR (X  |X  )
F(X2|X1) = -------2--1--
           M SE (X1, X2)
         = 1.2650-= 4.7603
           0.2657
(7.54)

Partial F-test Results. The table below lists the partial F-test results:









Regression X∗|X ,X ,...,X
    1   2     p  SS (X∗|X ,X  ,...,X )
        1   2     p  d.f. MSE F (X ∗|X ,X ,...,X )
       1  2      p     F P-value








Y=X1 SS (X1 )  0.3080 18 0.32 0.96 4.41 0.3405








Y=X1,X2 SS (X2|X1 )  1.2650 17 0.27 4.76 4.45 0.0435








Y=X1,X2,X3 SS(X3 |X1, X2)  0.7756 16 0.23 3.32 4.49 0.0874








d.f. - degrees of freedom used for MSE.
F - denotes the critical value based on F0.95,1,n−p  .

Example 1 - Multiple linear regression of Retinol data. This example will use NXG Explorer to perform multiple linear regression of the serum plasma retinol concentration (retplasma) on several predictor features to identify significant predictors. Retinol is one of the A vitamins, and is important for night vision, tooth and bone formation, and skin health. Therefore, it would be desirable clinically to know which features may be predictive of serum retinol in order to make public health recommendations.

Before we begin this example run, we make sure the appropriate criteria for model building strategies are set in the Preferences→ General default settings,

PIC

which specifies that only univariate regression models for which the p-value of univariate predictor β
 1  is less than 0.05, i.e. p < 0.05  , are used in the full multiple linear regression model. The second entry in the default settings specify that we want all the coefficients shown in multivariable model, based on (p ≤ 1)  (which implies all features).

To begin, import the Retinol dataset.

Next, select the command Analysis → Regression → Multiple linear from the Analysis pull-down menu, shown as:

PIC

Then specify the dependent feature, categorical features, and continuously-scaled features, as follows:

PIC

Then accept the default values for the options popup window:

PIC

The univariate regression coefficients results based on each predictor regressed separately are first listed:

PIC

Next, the ANOVA table and multiple regression coefficients based on a full model with all predictors regressed simultaneously is presented:

PIC

Coefficients of partial determination are listed next:

PIC

For regression diagnostics, the residual criteria are presented first, followed by the residuals, with outliers (overly influential records) identified by red highlighting in the output:

PIC

PIC

DFBETAS are listed next:

PIC

Lastly, VIFs are listed:

PIC

7.3 Multivariate Linear Regression

We now introduce a linear multivariate regression model for which there are multiple dependent features and multiple independent features  [6]. Let xi = (xi1,xi2,...,xij,...,xip)  be a vector of values for p features for the i th object and yi = (yi1,yi2,...,yiω,...,yiΩ)  be the vector of dependent features for the same object, where Ω  is the total number of y -dependent features. The regression equation for each object and each dependent feature is

yiω = β0ω + β1ωxi1 +β2ωxi2 + ⋅⋅⋅+ βpωxip + 𝜀iω,
(7.55)

where β0ω  is the intercept, or constant term, βjω  is the regression coefficient for feature xj  , and 𝜀iω  is the random error term or residual. Regression coefficients are used for predicting the y -values for each object, using the relationship

           ∑p
ˆyiω = β0ω +   βjωxij.
           j=1
(7.56)

Coefficients are determined directly from the matrix operation

Y = XB  + Ξ.
(7.57)

The matrix form of the dependent features is given as

      ⌊                 ⌋
       y11  y12 ⋅⋅⋅  y1Ω
      ||y21  y22 ⋅⋅⋅  y2Ω ||
nY×Ω = |⌈ ...    ...   ...   ... |⌉.
       yn1  yn2 ⋅⋅⋅  ynΩ
(7.58)

The typical matrix of x -values is

       ⌊                    ⌋
       |1  x11  x12  ⋅⋅⋅ x1p|
 X   = ||1  x21  x22  ⋅⋅⋅ x2p|| .
n×p+1  ⌈1    ...   ...   ...   ... ⌉
        1  xn1  xn2  ⋅⋅⋅ xnp
(7.59)

Regression coefficients are defined using the following matrix notation

        ⌊                ⌋
        |β01  β02 ⋅⋅⋅ β0Ω|
  B   = ||β1.1  β1.2 ⋅⋅⋅ β1Ω.|| ,
p+1×Ω   ⌈ ..    ..  ...   ..⌉
         βp1  βp2 ⋅⋅⋅ βpΩ
(7.60)

and the matrix of random error terms is in the form

      ⌊                 ⌋
      |𝜀11  𝜀12 ⋅⋅⋅  𝜀1Ω |
 Ξ  = ||𝜀21  𝜀22 ⋅⋅⋅  𝜀2Ω ||.
n×Ω   ⌈ ...    ...   ...   ... ⌉
       𝜀n1  𝜀n2 ⋅⋅⋅  𝜀nΩ
(7.61)

Rearranging the matrices, we can solve for the regression coefficients using the one-step operation

       ⊤   −1 ⊤
B = (X  X )  X  Y.
(7.62)

The inverse of the dispersion matrix   ⊤
X  X can be obtained using the Jacobi method, singular value decomposition, or some other matrix inversion technique. Once the coefficient matrix B is obtained, we obtain the predicted y -values by applying B to the data matrix X using the matrix operation Yˆ= XB .

The variance-covariance matrix for the coefficients is equal to the direct matrix product of the residual sums of square and the inverse of the variance-covariance matrix of the coefficients, given as

           [(   1  )     ]
V (βj,βk) =   ----- Ξ ⊤Ξ ) ⊗ (X ⊤X )−1
           [( n − p )             ]
         =    --1-- (Y ⊤Y − Yˆ⊤Yˆ) ⊗ (X⊤X )−1
           [( n − p )                ]
              --1--    ⊤     ⊤  ⊤          ⊤   −1
         =    n − p (Y  Y − β  X  Xβ ) ⊗ (X X )  .
(7.63)

The null hypothesis Ho : B = 0 which applies to all of the regression coefficients is determined by comparing statistics derived from the eigenvalues of the non-symmetric matrix   −1
E   H , where the hypothesis matrix is

      ⊤  ⊤        ⊤
H = B  X  Y − ny¯¯y ,
(7.64)

and the sum-of-squares matrix is

      ⊤      ⊤  ⊤
E = Y  Y  − B X  Y.
(7.65)

The first test is based on Wilks’ Λ  using the eigenvalues λ1,λ2,...,λ Ω  of   −1
E   H , given in the form

     Ω
Λ = ∏  ---1--.
    j=11 + λj
(7.66)

Let

p = #Coefficients, w/o intercept,
(7.67)

g = Ω + 1,
(7.68)

           p − g+ 2
a = N − g− --------,
               2
(7.69)

   {  ∘----------
        pp22+(g(−g1−)12)2−−45;  if p2 + (g − 1)2 − 5 > 0
b =   1;            if p2 + (g − 1)2 − 5 ≤ 0
(7.70)

and

c = p(g−-1)−-2 .
        2
(7.71)

Then

    (        ) (       )
      1−-Λ1∕b   -ab−-c-   ⋅
F =     Λ1∕b     p(g− 1)  ∼ Fp(g−1),ab−c.
(7.72)

The Pillai trace V statistic is also computed as

    ∑Ω  --1---
V =     1+ λj.
    j=1
(7.73)

Let

s = min(p,g− 1),
(7.74)

   |g−-1-− p|−-1
t =      2     ,
(7.75)

and

    N − g− p− 1
u = ------------.
         2
(7.76)

Here, under Ho

    (          )(      )
F =   2u+-s-+1-   --V--  ⋅∼ Fs(2t+s+1),s(2u+s+1).
      2t+ s+ 1    s− V
(7.77)

The Lawley-Hotelling U statistic is

    ∑Ω
U =    λj.
    j=1
(7.78)

Then

    (            )
F =   -22(su-+1)--- U ∼⋅Fs(2t+s+1),2(su+1).
      s(2t+ s+ 1)
(7.79)

Lastly, Roy’s test is based on the greatest root λ1  , and the test statistic is

    (          )
F =  n-−-(p+-1)  λ1∼⋅Fp,n− (p+1).
         p
(7.80)

Example 2 - Multivariate regression of Retinol data. For this example, we will regress the four outcome measurements reflecting subject-specific values of beta-carotene and retinol from dietary assessment and serum plasma measurements on continuously-scaled features to identify significant predictors. First, open the Retinol dataset, and select Analyis → Regression → Multivariate from the Analysis pull-down menu, shown as

PIC

Next, specify the four continuous dependent (“target”) features betadiet, retdiet, betaplamsa, and retplasma, and then select the continuous independent features age, quetelet, calories, fat, fiber, alcohol, and cholesterol – which is shown below:

PIC

The results of the regression run will listed in the treenodes on the left side of the output window:

PIC

Click on the Full model Tests icon, and the following table will appear:

PIC

Results of the full model tests indicate that all of the tests suggests that one or more of the independent features is a significant predictor of one or more of the dependent features. To find out which independent features are significant predictors of one or more of the dependent features, we need to look at the feature-specific test results, which are shown below. Click on the Feature Tests icon, and the following table will appear:

PIC

In the table above for feature-specific tests, we see notice that age, quetelet, fiber and cholesterol were significant. To see the specific-contribution of each independent feature to each dependent feature, we look at the tables of regression coefficients by clicking on the Coefficients icon, which will display the following output:

PIC

In the table above for which betadiet is the dependent feature, we notice that fiber was the only significant predictor. In addition, betadiet increased 138.3 units for each one-unit increase in fiber. For the dependent feature retdiet, we notice that cholesterol was the only significant predictor, with a slight increase in retdiet per unit increase in cholesterol. Coefficient tables for the dependent feature betaplasma and retplasma are shown below. While both quetelet and fiber were significant predictors of betaplasma, only age was a significant predictor of retplamsa.

PIC

7.4 Logistic Regression

Logistic regression is a type of categorical regression for which the outcome features (dependent features) are discrete, and input features (independent features) are discrete or continuous. There are several types of logistic regression, and these are listed in the table below:





Type of logistic regression Dependent feature Scale Example




Unconditional yi = 0,1  Binary Case: yi = 1  , Control: yi = 0




Ordinal yi = 1,2,...,k Ordinal yi = 2  : subject in 2nd group




Conditional y = {1,0,0,0}
 i Matched-pairs y = {1,0,0,0}
i : record with 1 case and 3 controls




Polytomous yi = {0,0,0,1} Class membership Subject in class 4: yi = {0,0,0,1}




When there is a binary (yes/no) outcome for each object, the y− dependent (outcome) feature is coded as a binary feature, consisting of zeroes and ones, i.e., (0,1), to represent failure or success. In failure analysis, objects that failed are commonly assigned a y -value of 1, while items that didn’t fail are assigned a y -value of 0. In clinical medicine and epidemiology, this binary coding scheme is amenable for analysis of case-control studies, where cases (diseased) are assigned y = 1  while controls (non-diseased) are assigned y = 0  . Among the various types of logistic regression, NXG Explorer implements two variants: unconditional and polytomous. The theoretical developments for each of these types of logistic regression models are described in the following sections.

7.4.1 Unconditional Logistic Regression

Let y = 1
 i  represent disease and y = 0
 i  non-disease for subject i having covariate vector x . The probability of disease for subject i is

                  ---eg1(x)----
πi1 = P (yi = 1|x) = eg0(x) +eg1(x),
(7.81)

and the probability of not having disease is

                          ---eg0(x)----
πi0 = 1− πi1 = P (yi = 0|x) = eg0(x) + eg1(x),
(7.82)

where gj(x)  is the logit for response category j = 0,1  . The general definition for the logit is given as

         (    )
g(x) = log  πij
 j         πi0
         ( P(yi = j|x))
    =  log  P(yi =-0|x)
        ⊤
    =  x β
    =  β0 + β1x1 + β2x2 + ⋅⋅⋅+ βpxp.
(7.83)

Using the general definition, the logit for non-disease is

          ( π  )
g0(xi) = log -i0-
            πi0
     = log(1)
     = 0,
(7.84)

and thus, eg0(xi) = 1  . The logit for disease is

           (   )
g1(xi) = log πi1
            πi0
      = x⊤β
      = β0 + β1x1 + β2x2 + ⋅⋅⋅+ βpxp.
(7.85)

Through simple substitution of the logit, we get the conditional probabilities

     --eg1(x)-
πi1 = 1 + eg1(x),
(7.86)

and

         1
πi0 =----g1(x).
     1 + e
(7.87)

Let x  = 1
 i  represent exposure and x  = 0
  i  represent non-exposure to a risk factor, and y = 1
i  represent disease and y  = 0
 i  no disease. The logit, g(x) = β + β x
       0   1 . Hence, our model has one x -feature and two regression coefficients: β
 0  and β
 1  .

Cross-tabulation of disease and exposure status for x = 1  , vs. x = 0

Disease (yi = 1  ) No disease(yi = 0  ) Total


Exposed (xi = 1  )
     -eβ0+β1-
πi1 = 1+eβ0+β1
        ---1---
1− πi1 = 1+eβ0+β1
1.0


Non-exposed (xi = 0  )
πi0 =-eβ0β-
     1+e 0
1 − πi0 =-1β--
         1+e 0
1.0


The odds ratio can now be determined as

     -π1∕(1-− πi1)
OR = πi0∕(1 − πi0)
      π1(1 − πi0)
   = π--(1−-π-)
     (i0β0+β1i1)(     )
       1+eeβ0+β1  1+1eβ0
   = (--eβ0-)-(---1---)
       1+eβ0   1+eβ0+β1
     eβ0+β1
   = --eβ0--
   = eβ1.
(7.88)

For the OR, the null and alternative hypotheses are:

H0 : OR = 1
H1 : OR ⁄= 1

Since OR  = eβ  , we know that when β = 0  the OR = e0 = 1  . Therefore, in logistic regression we have

H0 : β = 0
H1 : β ⁄= 0

The significance of each regression coefficient is based on a standard normal variate equal to

    --β---
Z = s.e.(β)
(7.89)

and if |Z| > 1.96  then p < 0.05  . The p-value is determined using the standard normal distribution, based on 2 × [1 − Φ (Z)]  if Z ≥ 0  and 2 ×[Φ(Z )]  if Z < 0  . The 100% × (1− α)  CI for OR is

100% × (1− α )CIofOR  = eβ±1.96×s.e.(β)

                    = [L[L, UL ]                 ]
                    =  eβ−1.96×s.e.(β),eβ+1.96×s.e.(β) .
(7.90)

Development of the Likelihood

We now move into discussion surrounding how the coefficients are determined via the iterative maximum likelihood method. The logistic likelihood function is defined as

      ∏n
ℒ(𝜃) =  (πi0)1−yiπyi1i
      i=1
      ∏n
    =   (1 − πi1)1− yiπyii1.
      i=1
(7.91)

Taking the natural logarithm gives the log-likelihood in the form

          ∑n
log(ℒ(𝜃)) =   log(1− πi1)− yilog(1− πi1)+ yilog(πi1)
          i=1
          ∑n              {                    }
        =    log(1− πi1)+ yi log(πi1)− log(1 − πi1)
          i=1
          ∑n      (  πi1  )
        =    yilog  1-− πi1 + log(1− πi1)
          i=1n      (   )
          ∑        πi1
        =    yilog  πi0  + log(πi0)
          i=1n            (         )
        = ∑  yig1(x)+ log  ---1----
          i=1              1+ eg1(x)
          ∑n
        =    yig1(x)− log(1 + eg1(x))
          i=1
          ∑n                   ⊤
        =    yi(x⊤ β)− log(1 +e(x β)).
          i=1
(7.92)

Once the log-likelihood function is formed, we can use the Newton-Raphson method to perform a gradient ascent approach to find maximum likelihood values of the solution parameters β . This entails calculation of the scores or first derivatives of log(ℒ(𝜃))  w.r.t parameters and the information (Hessian) matrix of second partial derivatives. The score is

                       (              )
       ∂-logL-(β-)  ∑n         --e(x⊤β)--
sj(β ) =   ∂βj   =    xi  yi − 1+ e(x⊤β) ,
                  i=1
(7.93)

and element (j,k)  of the information matrix

Ij,k(β) = − ∂2logL(β)-= ∑n xijxik--e(x⊤β-)--.
           ∂βj∂βk     i=1     (1+ e(x⊤β))2
(7.94)

Similarly, maximum likelihood estimates of each βj  can be found by determining the vector β iteratively with the matrix manipulation

              − 1
βi+1 = βi + I(β ) s(β)
(7.95)

until convergence is reached when ||β|| < 𝜖 . Values of 𝜖 are typically in the range  − 8       −4
10  ≤ 𝜖 ≤ 10  .

Assessing Model Goodness of Fit

Recall, the residual for a sample is essentially (yi − ˆyi)  . The fitted regression value for the i th subject and the j th class is yˆij  , is determined as

                 -----egj(x)-----
πˆij = P (y = j|x) = 1 + ∑ Ω−1egk(x).
                       k=1
(7.96)

The Pearson residual is

r(yij,πˆij) = ∘-yij −-πˆij-.
             πˆij(1 − ˆπij)
(7.97)

The Pearson goodness-of-fit (GOF) statistic is

  2      ∑n         2
χ n−p+1 =   r(yij,πˆij) .
         i=1
(7.98)

The deviance residual is

           ∘ ---------
d(yij,πˆij) =   2|log( ˆπij)|.
(7.99)

The deviance goodness-of-fit (GOF) statistic is

    ∑n
D =    d(yij, ˆπij)2.
    i=1
(7.100)

The log-likelihood ratio statistic for a feature is minus twice the ratio of the log-likelihood of a model without the feature and the log-likelihood of the model with the feature:

      (        )
G = − 2 Lreduced
         Lfull
= − 2(LLreduced − LLfull),
(7.101)

and is chi-square distributed with 1 d.f., that is   2
χ (1)  should exceed 3.84 as each feature is added to the model. Values of G exceeding  2
χ(1) = 3.84  are suggestive that the additional feature in the full model is statistically significant in terms of contribution to the likelihood. Assuming the model fits, the distribution of Pearson and deviance residuals is chi-square distributed with n − p+ 1  degrees of freedom. For deviance, the GOF statistic D is the likelihood ratio of a model with n parameters and one with p− 1  parameters. GOF values of  2
χ  or D that are less than the d.f. (n − p+ 1  ) are suggestive of a significant fit, that is, the model used and data appropriately describe sample membership in the outcome classes.

Example 3 - Unconditional logistic regression on Low Birth Weight. This example will employ unconditional logistic regression using the low birth weight data. The feature named low is a binary feature (1-Yes, 0-No) representing the birth of an infant with low birth weight, and will be used as the dependent feature. The remaining features such as age (age of the mother in years), lwt(weight in pounds at the last menstrual period), race (1=white, 2=black, 3=other), smoke (smoking status during pregnancy: 1=Yes, 0=No), ht (history of hypertension (1=Yes, 0=No), and ui (presence of uterine irritability (1=Yes, 0=No) will serve as the independent predictors.

First, open the Low Birth Weight dataset. Select Analysis → Regression → Binary Logistic from the Analysis pull-down menu:

PIC

You will then see the following popup window:

PIC

Select low as the discrete output feature, race as a disrete input feature, and age, lwt, smoke, ht, and ui as continuous input features. A popup window with options will appear, so accept the default values and click Apply. The logistic regression analysis will then run. The treenodes after the run will appear as follows:

PIC

Select the Coefficients icon to display the results. At the top of the output you will notice results for all possible univariate logistic regression models, including univariate odds ratios (OR) and their 95% confidence intervals(CI). This will appear as follows:

PIC

During feature input, NXG Explorer automatically recoded the 3-level race feature into two binary features representing the two race groups (2-Black and 3-Other), respectively. This type of recoding is called corner-point constraints recoding of categorical features, for which the effect of the lowest category appears in the Constant term (or y-intercept). An alternative approach would involve use of sum-to-zero constraints recoding, which would force all three binary features into the model. For the univariate models, the features lwt, smoke, ht, and ui were significant predictors of low birth weight. The univariate ORs and 95% CIs are also listed in the coefficient table. Toward the bottom of the Coefficient output, we can see the multiple logistic regression runs, for which all the specified independent features are used in a single model. The multiple logistic regression coefficients are followed by a tabular listing of the ORs and 95% CIs.

PIC

Notice that in the multiple logistic regression model, in addition to lwt, smoke, ht, and ui being significant, the two binary features (0,1) for black and other race categories were significant (these categories were not significant in the univariate models). Looking at the ORs, we notice that the OR for smoking (smoke) was 2.79(95% CI, 1.29-6.05), OR for hypertension (ht) was 6.41(95% CI, 1.66-24.72), OR for uterine irritability (ui) 2.45(95% CI, 1.02-5.90), OR for black race (b_race_2) was 3.60(95% CI, 1.28-10.10), and OR for other race (b_race_3) was 2.46(95% CI, 1.05-5.77), which are all quite elevated. Using results of these observations, a public health-based recommendation could be reported as follows: “given the observed results from empirical data, these findings indicate that smoking and hypertension during pregnancy as well as race category were significant predictors of having a low birth weight infant.” Lastly, the class membership predictions are listed in tabular notation:

PIC

Next, click on the Residuals icon, and you will see a listing of records (objects) that are considered to be outliers due to their higher-than-expected residual values. For each outlier observation, the criterion which didn’t pass is highlighted in red, as shown below:

PIC

7.4.2 Polytomous Logistic Regression

Polytomous logistic regression was fundamentally developed for multiple class problems, unlike the dichotomous 2-class nature of unconditional logistic regression described in the previous section. To begin, assume a multiclass problem in which the dependent feature yji  for each i th sample is coded 0 or 1 to represent membership in the j th class, where j = 0,1,...,J , and J is the number of classes minus one. Let’s assume that the reference class (corner point) is the first class for which j = 0  . The general definition of the logit for polytomous logistic regression remains the same, as in

          (   )
gj(x) = log πij-
          ( πi0       )
     = log  P(yi =-j|x)
            P(yi = 0|x)
     = x⊤β

     = βj0 + βj1x1 + βj2x2 + ⋅⋅⋅+βjpxp.
(7.102)

However, the conditional probability adds together the exponentiated logits in the denominator, shown in functional form as

πij = P(yij = j|x ) =-∑egj(x)----.
                  1+   Ωk=1 egk(x)
(7.103)

For a 3-class example (j = 0,1,2)  the probability that a sample x has membership in each of the 3 classes is

            eg0(x)
πi0 =-g0(x)---g1(x)---g2(x)
     e    + e1    + e
   = ----g1(x)---g2(x)
     1 + e g (x+)e
πi1 =-----e-1-------
     1 + eg1(x) +eg2(x)
     -----eg2(x)-----
πi2 = 1 + eg1(x) +eg2(x).
(7.104)

The likelihood is

      ∏n      ∑J     ∏J
ℒ(𝜃) =   (πi0)1−  j=1yij   πyiijj
      i=1            j=1
      ∏n     J∑       ∑J    ∏J
    =    (1 −    πij)1− j=1yij   πyiijj.
      i=1    j=1            j=1
(7.105)

Taking the natural logarithm gives the log-likelihood in the form

                (         )      (         )
          ∑n        ∑J               ∑J      ∑J     ∑J      yij
log(ℒ(𝜃)) =    log  1−    πij  − log  1−    πij     yij +   log(πij )
          i=1    (   j=1   )      (   j=1   ) j=1     j=1
          ∑n        ∑J               ∑J      ∑J     ∑J
        =    log  1−    πij  − log  1−    πij     yij +   yijlog(πij)
          i=1    (   j=1   )        { j=1     j=1     j=1    }
          ∑n        ∑J        ∑J                (    J∑    )
        =    log  1−    πij  +    yij  log(πij)− log 1 −    πij
          i=1        j=1       j=1                     j=1
          ∑n    (   ∑J    )   ∑J   {    (     π       ) }
        =    log  1−    πij  +    yij  log  (----∑Jij----)
          i=1        j=1       j=1          1 −   j=1 πij
          ∑n ∑J      ( π  )     (    ∑J   )
        =       yijlog  -ij + log  1−    πij
          i=1 j=1       πi0           j=1
          ∑n ∑J
        =       yijgj(x)+ log(πi0)
          i=1 j=1
          ∑n ∑J             (      1      )
        =       yijgj(x)+ log  ----∑J-------
          i=1 j=1             1 +   j=1 egj(x)
          ∑n ∑J             (   ∑J      )
        =       yijgj(x)− log  1+    egj(x)
          i=1 j=1                j=1
          ∑n ∑J             (    ∑J       )
        =       yij(x⊤β )− log  1+    e(x⊤β) .
          i=1 j=1                 j=1
(7.106)

Notice that in unconditional (binary) logistic regression the coefficients were obtained in the form of a p× 1  column vector β , while in polytomous logistic regression, however, there is a total of p(J − 1)  coefficients, resulting in a p(J − 1)× 1  column vector, which is equal to the total number of features times the number of classes minus one. The score value for the j th logit (j = 1,2,...,J )  and the k th feature (k = 1,2,...,p)  will occupy the s(p(j − 1)+ k)  element in the score vector given as

                        ∑n    (           (x⊤β)    )
sp(j−1)+k (β ) = ∂-logL-(β-)=   xik  yij − ---∑e---------
               ∂ βjk     i=1          1+   Jk=1e(x⊤β)
                        ∑n
                      =    xik(yij − πij).
                        i=1
(7.107)

The information matrix I , or Hessian matrix, is a partioned matrix of size p(J − 1)× p(J − 1)  . Thus, if there are 3 classes (J = 3)  , there will be 2 logit functions, each of which is represented with a p× p matrix Ijj

      (             )
I(β ) = I11(β)  I12(β)  .
       I21(β)  I22(β)
(7.108)

By definition, the second partial derivative for different features (k,k′)  within the same logit (j,j)  , that is, same matrix is

          −-∂2log-L(β)    ∑n
Ijk,jk′(β ) = ∂ βjk∂βjk′  = −   xik′xikπij(1 − πij),
                          i=1
(7.109)

whereas the second partial derivative for different features     ′
(k,k )  in different logits     ′
(j,j)  , that is, different matrices is

              2          ∑n
Ijk,j′k′(β) = −-∂-logL(β) =   xik′xikπijπij′.
            ∂ βjk∂βj′k′   i=1
(7.110)

Similarly, using the Newton-Raphson method, maximum likelihood estimates of each β
 jk  can be found by determining the vector β iteratively with the matrix manipulation

βi+1 = βi + I(β )− 1s(β)
(7.111)

until convergence is reached when ||β|| < 𝜖 . Values of 𝜖 are typically in the range 10− 8 ≤ 𝜖 ≤ 10−4  .

Example 4 - Tumor class prediction with SRBCT gene expression dataset. This example uses the class-specific dataset for SRBCT data. First, import the Excel dataset using class-specific format as shown below:

PIC

Next, specify the commands Analysis → Regression → Polytomous logistic, as shown below:

PIC

Next, specify the Class feature as the class feature, and all continuous features as the predictors, as shown below:

PIC

Regarding the output, the treenode will only contain a Coefficients icon (see below):

PIC

Once you click on the Coefficients icon, you will first see the following tabular notation with class-specific univariate model results:

PIC

The first thing you will notice for a polytomous logistic analysis is that results will be presented for #classes-1 since the first class is treated as a referent group. In the output listed above, notice that each predictor is significant for predicting at least one of the diagnostic classes. This is expected, since the set of 23 gene expression features in this dataset were identified as the best predictors of the four diagnostic classes out of 2,308 features. Following the class-specific univariate results, the log-likelihood values are listed as a function of iteration:

PIC

This is followed by the class-specific multivariate logistic regression coefficients:

PIC

Lastly, the predicted class membership probabilities are listed:

PIC

7.5 Additive and Multiplicative Poisson Regression

Poisson regression models are commonly used for modeling risk factors and other covariates when the rates of disease are small, i.e., cancer  [7891011]. An area of research in which Poisson regression of cancer rates is commonly employed is the investigation of radiation-induced cancer among Hiroshima and Nagasaki atomic bomb survivors  [121314] and individuals occupationally  [15], medically  [16] and environmentally  [17] exposed to ionizing radiation. Poisson regression has also been used to investigate the effects of information bias on lifetime risks  [18], diagnostic misclassification  [1920], and dose-response curves for radiation-induced chromosome aberrations  [21]. This report documents the NXG Explorer Poisson regression approach for fitting additive, multiplicative, and power models, and regression diagnostics using leverages of the Hat matrix, deletion and standardized residuals. The critical data required to perform Poisson regression using NXG Explorer are:

7.5.1 Poisson Assumption

The Poisson assumption states that, for stratified data, the number of deaths, d , in each cell approximates the values x =0,1,2,... according to the formula

              x
P(d = x) = (λn)-exp(− λn-),
                x!
(7.112)

where λ is the rate parameter, which is equal to the number of deaths in each cell divided by the respective number of person-years (n ) in that same cell. According to McCullagh and Nelder  [22], a generalized linear model is composed of the following three parts:

where λi  is the observed rate parameter and λi(x⊤β )
    i  is the predicted rate parameter. It should be pointed out that the rate parameter, λi = (x ⊤β)
       i  , serves as its own link function. Thus, the rate parameter provides a linearization of our distributional models so that we can use multiple linear regression to obtain estimators. In effect, the rate parameter allows us to reduce non-linear (non-Gaussian) models to linear models so that we can solve the system of linear equations. The rate parameter can take on various linear and non-linear functions. These are shown in Table 7.1. The power transformation was adapted from  [23].


Table 7.1: Link function and predicted rates.




Model Type risk Link function Fitted rate




Additive Absolute λ λi(x⊤β ) = (x⊤β)
    i       i




Multiplicative Relative log(λ)  λi(x⊤i β ) = exp(x ⊤i β)




Power Geometric λρ            {(x⊤ β)1∕ρ  0 < ρ ≤ 1
λi(x⊤i β ) =  i  ⊤
           exp(xi β ) ρ = 0




Parameters β are solved through the use of iteratively reweighted least squares, where the parameter update vector at each iteration is

        ⊤    −1  ⊤
Δ β = (Z  WZ  ) Z  Wy,
(7.115)

with data matrix elements

        ˆ
zij = ni-∂λi,
      ∂ βj
(7.116)

with weights

wii = (niˆλi)−1,
(7.117)

and dependent features

          ˆ
yi = di − niλi,
(7.118)

where the fitted rates are

ˆλi = λi(x⊤i β ).
(7.119)

Iterations start by setting the vector of regression coefficients β equal to small near-zero starting values. Successive iterations involve adding the gradient-based update vector to the most recent vector of coefficients in the form

β(t+1) = β(t) + Δ β,
(7.120)

until convergence is reached when ∥Δ β ∥ < 10−4  .

7.5.2 Residuals and Regression Diagnostics

The error residual, ei  has mean 0 and measures the lack of concordance of the observed deaths di  against the fitted value  ˆ    ˆ
di = niλi  . The greater the value of ei  , the worse the fit of the model. We will now introduce a number of residuals which are helpful for determining whether or not certain observation are overly influential on the fit of a model. Firstly, the Pearson residual, rP  , for each observation is estimated as

     di − dˆi
rP = -√di--,
(7.121)

and provide a measure of the residual variation. If we take into account the number of fitted parameters (coefficients) in a model, we can determine an adjusted residual as

r =  √-rP---,
 A    1 − hi
(7.122)

where hi  is the leverage for the i th observation (described later). Values for rA  that exceed 1.96 tend to be outliers at the α = 0.05  level of significance. The deviance residuals

rD = di[log di] + (dˆi − di),
          ˆdi
(7.123)

allow the investigator to determine the goodness of fit of the model, i.e., how well the observed data agree with the fitted values.

The leverage  [22] for each observation, hi  , provides information about the geometric distance between a given point (zi1  , zi2  , ⋅⋅⋅ , zij  ) and the centroid of all points (¯zi1  , ¯zi2  , ⋅⋅⋅ , z¯ij  ) in the predictor space. Individual leverages, hi  , are obtained from the diagonal of the “Hat” matrix given by

    √ ---  ⊤     −1 ⊤ √---
H =   WZ (Z  WZ )  Z   W.
(7.124)

Because ∑
  hi = trace(H) = p , observations with high leverage can approach unity. A criterion introduced by Hoaglin and Welsch  [24] for determining high leverage for an observation states that a point has high leverage if hi > 2p∕n . Points with high leverage are located at a substantial distance from the fitted regression line, and act to “pull” the fitted line toward their location. Increases in the goodness of fit of a model can be obtained by re-fitting a model after the observation(s) with high leverage have been removed.

Cook  [25] introduced the residual, Δ(β)−ij  , which measures the difference between β when the observation is included in the model and β when the observation is not in the model. The more positive the values of Δ(β)−ij  , the more influence an observation has when compared with other observations. Individual Δ(β)−ij  residuals are obtained by the n× p matrix Δ (β)−ij  , by the relationship

          − (Z⊤WZ-)−1zijyiwij-
Δ(β)−ij =       1− hij      .
(7.125)

7.5.3 Goodness-of-Fit Statistics

The Pearson  2
χ  goodness of fit is

     ∑n
χ2 =    rP2,
     i=1
(7.126)

and the deviance goodness of fit is

     n
D = ∑  r ,
    i=1 D
(7.127)

which is also  2
χ  distributed. If a model fits, its  2
χ  and D will be lower than the degrees of freedom (n − p ).

Example 5 - British Doctors data - Purely Additive Model (ρ = 1  ). This example will use the British doctor data in Table 9.1 to determine the additive risk ρ = 1  (see Table 7.1) of coronary-related deaths due to smoking. Table 9.1 lists the British doctor data for coronary disease and smoking  [6]. The rates for non-smokers, λi(0)  , are simply the ratio of the number of deaths to person-years ni  in age group i and exposure group 0  , whereas the rates for smokers, λi(1)  , are the ratio of deaths to person-years ni  in age group i for the exposure group 1  .

Table 7.2: Deaths from coronary disease among British male doctors  [6].








Non-smokers
Smokers
Age group, i di  ni
λi(0)
di  ni
λi(1)














35-44 2 18,790 0.1064 32 52,407 0.6106
45-54 12 10,673 1.1243 104 43,248 2.4047
55-64 28 5,710 4.9037 206 28,612 7.1998
65-74 28 2,585 10.8317 186 12,663 14.6885
75-84 31 1,462 21.2038 102 5,317 19.1838







Note: Notice that in the table of rates for coronary deaths among British doctors, that the rate for each group was multiplied by 1,000. This is commonly done so that the majority of rates are greater than one, in order to facilitate the numerical methods used.

First, import the British doctors coronary death dataset, and you will see the following arrangement in the datagrid:

PIC

Next, select Analysis → Regression → Poisson from the Analysis pull-down menu, and select the person-year feature, count feature, and the covariates for age and smoke:

PIC

Next, select Additive from the model choices, and click Apply (shown below):

PIC

After the run, the treenodes resulting from this run will appear as follows:

PIC

Click on the Coefficients icon, and the additive risks will be listed in the diplay as follows:

PIC

Firstly, notice the value of the deviance GOF of 7.43. Because this run used a purely additive model (ρ = 1)  , the predicted baseline rate for each age group is equal to the value of the regression coefficients for the specific age group. The absolute risk (AR) is equal to the regression coefficient for smoke (βsmoke = 0.591  ), and is actually an excess rate that adds to the baseline rate. A key point to be made is that the AR is a rate that serves as a fixed constant which adds to the baseline rate of any group. The table below lists the calculated baseline rates (deaths per 1,000 person-years) for each age group based on the fitted coefficients, as well as the fitted rate among smokers for each age group.





Feature Coef. Baseline rate(non-smokers) Rate among smokers




age_35-44 0.0841 0.0841 0.0841+0.5907=0.6748




age_45-54 1.6407 1.6407 1.6407+0.5907=2.2314




age_55-64 6.3035 6.3035 6.3035+0.5907=6.8942




age_65-74 13.5241 13.5241 13.5241+0.5907=14.1148




age_75-84 19.1696 19.1696 19.1696+0.5907=19.7603




smoke 0.5907 AR=0.5907




Next, click on the Residuals icon, and the following table will be shown:

PIC

We notice that two observations (#1 and 6) were overly influential on the model’s fit. Next, click on the DFBETAs icon, and the following table will be shown:

PIC

Recall, the DFBETAS reveal the influence of each record each covariate if the record was removed and the model was ran again. In the DFBETAS output listed above, there appears to be a greater number of residuals for the older age groups.

Example 6 - British Doctors data - Purely Multiplicative Model (ρ = 0  ). In this example, we will employ a purely multiplicative model to identify the relative risk (RR) of smoking. Import the Britisk doctor data again, select Analysis → Regression → Poisson from the Analysis pull-down menu, select the features, and select Multiplicative for the model choices (shown below):

PIC

Click on the Coefficients icon and the following table will be displayed:

PIC

The value of deviance GOF for this multiplicative mode is 12.13, which implies a worse fit when compared with the purely additive model. Using the coefficient values, the fitted baseline rates (deaths per 1,000 person-years) among non-smokers and smokers are listed in the table below:





Feature Coef. Baseline rate(non-smokers) Rate among smokers




age_35-44 -1.0116 exp(-1.0116)=0.3636 0.3636× 1.4225 = 0.5184




age_45-54 0.4724 exp(0.4724)=1.6038 1.6038× 1.4225 = 2.2863




age_55-64 1.6159 exp(1.6159)=5.0324 5.0324× 1.4225 = 7.1737




age_65-74 2.3389 exp(2.3389)=10.3698 10.3698× 1.4225 = 14.7822




age_75-84 2.6885 exp(2.6885)=14.7096 14.7096× 1.4225 = 20.9685




smoke 0.3545 RR=exp(0.3545)=1.4255




The relative risk (RR) is determined by exponentiating the coefficient for exposure (i.e., smoke) and multiplying the RR by each fitted baseline rate. Next, click on the Residuals icon, and the following table will be shown:

PIC

We notice that four observations were overly influential on the model’s fit, and since there is greater number of outliers for the multiplicative model, this may imply that the multiplicative models fits worse. Next, click on the DFBETAs icon, and the following table will be shown:

PIC

The DFBETAS results indicate fewer issues with the multiplicative model.

Example 7 - British Doctors data - Power Model (ρ = 0.55  ). NXG Explorer can be used to evaluate power models with varying values of ρ where (0 < ρ ≤ 1). To begin, import the British doctors’ dataset, select Analysis → Regression→ Poisson from the Analysis pull-down menu, select the features, and then specify “Evaluate rho at 0.1 intervals,” from the popup windows (shown below):

PIC

After the run, you will see the following icons in the treenode:

PIC

Click on the Power evaluation icon, and the following output table will appear:

PIC

Note that the deviance GOF is lowest when ρ = 0.55  , which indicates that a power model with ρ = 0.55  fit better than a purely additive model (ρ = 1  ) or a purely multiplicative model (ρ = 0  ). Below is a plot of the deviance values as a function of ρ :

PIC

Next let’s run a Poisson regression model using a fixed value ρ = 0.55  . Import the dataset, select the Poisson regression command and features, and now specify ρ = 0.55  in the options popup window:

PIC

Click on the Coefficients icon, and the following table will appear:

PIC

Note the low value of deviance of 2.14, which was similar to an estimate of the interpolated value in the power table listed above, and is much lower than the deviance of 7.43 for the purely additive model and 12.13 for the purely multiplicative model. Again, using the coefficient values, the fitted baseline rates (deaths per 1,000 person-years) among non-smokers and smokers are listed in the table below:





Feature Coef. Baseline rate(non-smokers) Rate among smokers




Feature Coef. Baseline rate(non-smokers) Rate among smokers




age_35-44 0.276 (0.276)(1∕0.55) = 0.0963  0.0963+0.2767=0.373




age_45-54 1.1145 (1.1145)(1∕0.55) = 1.2179  1.2179+0.2767=1.4946




age_55-64 2.4563 (2.4563)(1∕0.55) = 5.1239  5.1239+0.2767=5.4006




age_65-74 3.8593 (3.8593)(1∕0.55) = 11.6514  11.6514+0.2767=11.9281




age_75-84 4.7632 (4.7632)(1∕0.55) = 17.0822  17.0822+0.2767=17.3589




smoke 0.4933 P R = (0.4933)(1∕0.55) = 0.2767




There were no outliers for this model:

PIC

Nor were there any overly influential DFBETAs for this power model:

PIC

Summary of fitted rates. In summary of the additive, multiplicative, and power models, the Table 7.3 shows the basic interpretation of coefficients and how the AR and RR are estimated and then applied to the baseline rate among non-smokers to obtain the rate among smokers.


Table 7.3: Link function and predicted rates.



Link function Fitted rate Rate among exposed



λ λi(x⊤β ) = (x⊤β)
   i       i  λe = λi(x⊤ β)+ AR  = βj + βexp
        i



log(λ)  λi(x⊤β ) = exp(x⊤β)
   i          i  λe = λi(x⊤ β)× RR  = exp(βj)× exp(βexp)
        i



λ ρ  λ(x⊤β ) = (x⊤β)1∕ρ
 i i       i  λ  = λ (x⊤ β)+ P R = β1∕ρ + β1∕ρ
 e    i i            j     exp



Bibliography

[1]   F. Galton. Regression towards mediocrity in hereditary stature. J. Anthropol. Inst., 15:246–263, 1886.

[2]   M.S. Bartlett. Further aspects on the theory of multiple regression. Proc. Cambridge Phil. Soc, 34:33–40, 1938.

[3]   D.C. Hoaglin, R.E. Welsch. The hat matrix in regression and ANOVA. Am. Stat. 32:17-22, 1978.

[4]   D.A. Belsley, E. Kuh, R.E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York (NY), John Wiley and Sons, 1980.

[5]   J. Neter, W. Wasserman, M.H. Kutner. Applied Linear Statistical Models. 2nd Ed. Homewood (IL), Richard D. Irwin, 1985.

[6]   M.S. Bartlett. Multivariate analysis. J. Royal Stat. Soc., 9:176–197, 1947.

[7]   E.L. Frome, M.H. Kutner, J.J. Beauchamp. Regression analysis of Poisson-distributed data. JASA. 68:935–940, 1973.

[8]   E.L. Frome. The analysis of rates using Poisson regression methods. Biometrics. 39:665–674, 1983.

[9]   Frome, E.L. PREG: A computer program for Poisson regression analysis. Oak Ridge Associated Universities Report ORAU-178 (1981). Oak Ridge National Laboratory. Oak Ridge(TN), ORAU.

[10]   C.R. Muirhead, S.C. Darby. Modelling the relative and absolute risks of radiation-induced cancers. J. R. Stat. Soc. A. 150:83–118, 1987.

[11]   N.E. Breslow, N.E. Day. Statistical Methods in Cancer Research. Volume II: The design and anaylsis of cohort studies. Chapter 4. IARC Scientific Publication No. 82. Lyon, IARC, 1987.

[12]   D.A. Pierce, Y. Shimizu, D.L. Preston, M. Vaeth, K. Mabuchi. Studies of mortality of atomic bomb survivors. Report 12, Part I. Cancer: 1950-1990. Radiat. Res. 146:1–27, 1996.

[13]   D.E. Thompson, K. Mabuchi, E. Ron, M. Soda, M. Tokunaga, S. Ochikubo, S. Sugimoto, S. Ikeda, M. Terasaki, S. Izumi. Cancer Incidence in Atomic Bomb Survivors. Part II: Solid Tumors, 1958-1987. Radiat. Res. 137:S17-S67, 1994.

[14]   D.L. Preston, S. Kusimi, M. Tomonaga, S. Izumi, E. Ron, A. Kuramoto, N. Kamada, H. Dohy, T. Matsui, H. Nonaka, D.E. Thompson, M. Soda, K. Mabuchi. Cancer Incidence in Atomic Bomb Survivors. Part III: Leukemia, Lymphoma and Multiple Myeloma, 1950-87. Radiat. Res. 137, S68–S97, 1994.

[15]   E. Cardis, E.S. Gilbert, L. Carpenter, G. Howe, I. Kato, B.K. Armstring, V. Beral, G. Cowper, A. Douglas, J. Fix, S.A. Fry, J. Kaldor, C. Lavé, L. Salmon, P.G. Smith, G.L. Voelz, L.D. Wiggs. Effects of low doses and low dose rates of external ionizing radiation: Cancer mortality among nuclear industry workers in three countries. Radiat. Res. 142: 117–132, 1995.

[16]   E. Ron, J.H. Lubin, R.E. Shore, K. Mabuchi, B. Modan, L.M. Pottern, A.B. Schneider, M.A. Tucker, J.D. Boice, Jr. Thyroid cancer after exposure to external radiation: a pooled analysis of seven studies. Radiat. Res. 141:259-277, 1995

[17]   R.A. Kerber, J.E. Till, S.E. Simon, J.L. Lyon, D.C. Thomas, S. Preston-Martin, M.L. Rallison, R.D. Lloyd, W.S. Stevens. A cohort study of thyroid disease in relation to fallout from nuclear weapons testing. J. Am. Med. Assoc. 270:2076–2082, 1993.

[18]   L.E. Peterson, W.J. Schull, B.R. Davis, P.A. Buffler. Information Bias and Lifetime Mortality Risks of Radiation-Induced Cancer. U. S. NRC NUREG Report GR-0011. Nuclear Regulatory Commission, Washington, D.C., 1994.

[19]   R. Sposto, D.L. Preston, Y. Shimizu, K. Mabuchi. The effect of diagnostic misclassification on non-cancer and cancer mortality dose-response in A-bomb survivors. Biometrics. 48:605-617, 1992.

[20]   A.S. Whittemore, G. Gong. Poisson regression with misclassified counts: application to cervical cancer mortality rates. Appl. Stat. 40; 81–93, 1991.

[21]   E.J. Frome, R.J. DuFfrain. Maximum likelihood estimation for cytogenetic dose-response curves. Biometrics. 42:73–84, 1986.

[22]   P. McCullagh, J.A. Nelder. Generalized Linear Models. London, Chapman and Hall, 1983.

[23]   F.J. Aranda-Ordaz. An extension of the proportional-hazards model for grouped data. Biometrics. 39:109–117, 1983.

[24]   D.C. Hoaglin, R.E. Welsch. The Hat matrix in regression and ANOVA. Am. Statistician. 32:17–22, 1978.

[25]   R.D. Cook. Detection of influential observations in linear regression. Technometrics. 19:15–18, 1977.

[26]   R. Doll, A.B. Hill. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Nat’l. Cancer Inst. Monogr. 19:205–268, 1966.

Chapter 8
SURVIVAL ANALYSIS

Survival data analysis involves use of time-to-event data for both failures and censored outcomes. In order to perform survival analysis, there are two outcome measures required from each object (animal, human subject, plant, etc.), which are (i) the time under observation from the beginning of the study until the objects fails (“failure”) or the object withdraws or survives until study end without failure (“censored”), and (ii) whether or not the object failed or not (0-No, 1-Yes). NXG Explorer can handle two kinds of survival data analyses, which are Kaplan-Meier grouped-based analysis and Cox proportional hazard regression analysis.

8.1 Kaplan-Meier Analysis and Logrank Test

8.1.1 Product Limit Estimate of Survivorship Function

The primary goal of Kaplan-Meier (KM) analysis is to determine the product limit estimate of the survivorship function, S (t)  , based on the survival times and censoring status of n objects which have been observed for failure over a given follow-up period [1]. The survivorship function, S(t)  , is defined as the probability of surviving (not failing) beyond time t , and is only determined for failure times, not censoring times for objects which have dropped out of follow-up monitoring (i.e., lost to follow-up, missing). Objects that have been censored, that is, have not failed throughout the entire study or dropped out from follow-up, are assigned a censoring time equal to the time they left the study early (without failing) or the time until the end of the study. If a human research subject withdrew consent from a clinical trial before the end of the trial at 4.5 months after enrollment, then the subject’s survival (censoring) time is 4.5+ months, where +  denotes censored (did not fail). Otherwise, if a subject failed at 13.7 months after treatment by having a primary outcome, for example, tumor progression, stroke, myocardial infraction, etc., then the survival (failure) time for this subject would be 13.7 months.

The table below lists n = 10  subjects who were monitored during a 65-day study (6 who failed and 4 who were censored) along with their survival times (failure times or censoring times) and product limit estimates of the survivorship function values, S (t)  calculated at the unique failure times.






Time, t Rank, r h(t) = nn−−rr+1        ∏
S (t) =  tr≤th(tr)





5 1 910  0.9
13+ 2 - -
21+ 3 - -
28 4 67  0.771 = 910 × 67
31 5 56  0.643 = 190 × 67 × 56
33+ 6 - -
45 7 3
4          9   6  5   3
0.482 = 10-× 7 × 6 × 4
53+ 8 - -
58 9 1
2  0.241 = 9-× 6×  5× 3 × 1
       10  7   6  4   2
63 10 0 0





      +  denotes censored object, i.e. did not fail.

To construct the table, we first sort the censoring and failure times of objects in ascending order, so that the shortest time is first and the longest time is last. Next, assign a rank, r , to each observation using values ri = 1,2,...,n . At each unique failure time, calculate the hazard rate

      -n-−-r--
h(t) = n− r +1 ,
(8.1)

where r is the rank of the observation at failure time t . The survivorship function value at each failure time t is then calculated as the product of all hazard rates up to time t using the relationship

      ∏
S(t) =   h(tr).
      tr≤t
(8.2)

The Nelson-Aalen baseline cumulative hazard function is then determined as

H0(t) = − log(S(t)).
(8.3)

8.1.2 Log-rank Test

Let τ be the greatest observed study time among the subjects being studied. Let t1 < t2⋅⋅⋅ < ti < ...< tD  be the ordered unique failure (death) times for the pooled sample. At time ti  (i = 1,2,...,D)  we observe dij  deaths in the j th sample (j = 1,2,,...,K )  out of nij  subjects at risk.

The hypotheses applied to test for inequality of hazard functions for each group is

H  : h (t) = h (t) = ⋅⋅⋅ = h (t) ∀t ≤ τ
  0   1     2           k
Ha : at least one of the hj(t)’s is different for some t ≤ τ

In order to develop a hypothesis test, we first obtain differences between the observed and expected number of failures, using the form

       ∑D [        (  ) ]
zj(τ) =    dij − nij di  .
       i=1          ni
(8.4)

Following Lawless  [2], the K − 1  variance-covariance matrix of zj(τ)  has elements

        D              (       )
vjk(τ) = ∑  (dini −-di)nij δ− nik     j,k = 1,2,...,K − 1,
         i  (ni − 1) ni     ni
(8.5)

where δ = 1  if j = k . Note that ∑K
  j=1 zj(τ) = 0  , so the  2
χ  test statistic with K − 1  d.f. is constructed by selecting any K − 1  of the zj(τ)  . The variance-covariance matrix of these statistics is given by the (K − 1)(K − 1)  matrix V , and again, only K − 1  of the variances and covariances are needed. The test statistic is based on the quadratic form

                                       ⌊ z1(τ) ⌋
         [                       ]     | z2(τ) |
zTV −1z = z1(τ) z2(τ) ⋅⋅⋅  zK −1(τ) V−1 ||   ..   ||
                                       ⌈   .   ⌉
         ◟------------------◝◜----------zK−1(τ)◞
                           1×1
(8.6)

An α level test of H0  rejects when χ2  is greater than the α th upper percentage point of a χ2  random variable with K − 1  degrees of freedom.

Example 1 - Kaplan-Meier Group Test (Edema) of Primary Biliary Cirrhosis. In this example, we will run a KM analysis on the Edema categorical feature in the Primary Biliary Cirrhosis (PBC) data. To begin, import the pbc.xlxs Excel file. Next, select Analysis → Survival analysis → Kaplan-Meier group test from the Analysis pull-down menu, shown as follows:

PIC

Select edema as the grouping feature, fudays as the survival time feature, and dead as the failure feature, shown as follows:

PIC

Following the run, the treenodes will appear as follows:

PIC

Click in the Kaplan-Meier Survival By Group icon, and the survival plot will appear as follows:

PIC

Note: You can generate your own survivorship function, S(t)  , plots by opening the .csv output file, which contains sets of columns with time and S(t)  for each group. You would need to employ a graphics package which can generate line plots from multiple XYXY data, where each group consists of a pair of x- and y-values.

Click on the Logrank Test icon and the results of the logrank test for equality of hazard(survival) functions between the 3 edema groups is provided:

PIC

The p-value indicates the hazard functions for the 3 groups are significantly different.

Kaplan-Meier Group Test (Drug) of Primary Biliary Cirrhosis. In this example, we will perform KM analysis of the drug effect, with two options for recoding. Recall that the drug feature is coded as a 1 for drug and 2 for placebo. We now show how to create a new feature called treated (0-placebo, 1-drug) based on recoding and labeling of the drug and placebo groups. To get started, select Preferences → Labels and transforms, and in the editor of open the script file called pbclabel.nxgs. Select Run, so that the new treated feature is generated. Next, select Analysis → Survival analysis → Kaplan-Meier group test, and specify dead as the failure feature, fudays as survival time, and treated as the group feature. In addition, repeat these steps while using drug (1-drug, 2-placebo). The results below illustrate how changing the coding resulted in a change in the plot and the KM logrank observed and expected counts. Quite basically, the groups are only swapped, but the point of this example is to remind the user to pay particular attention to coding schemes.

PIC

The logrank test results indicate that there is no significant difference in hazard(survivorship) functions between the two groups. Furthermore, you can notice how the survival curves overlap one another, providing visual confirmation that there is no significant drug effect of survival.

8.2 Cox Proportional Hazards Regression

Cox proportional hazards (PH) regression is a multiplicative maximum-likelihood technique to derive parameters for modeling “time-to-event” failures or survival times [3]. Each object used for fitting the model requires a failure(censoring) time and a binary outcome response indicating whether the object failed or was censored, that is, either left(migrated out of) the study or did not fail before the study ended. When an object fails, the time value is set to the duration of time under observation until the failure time (i.e., survival time), and its failure status is assigned a one (1) to indicate failure. However, an object that doesn’t fail under study will have a failure status of zero (0) and a survival time set equal to the time it was observed until the end of the study or the time it dropped out or migrated out of the study. Either way, every object under study receives two outcome codes: one for the survival time and one for the binary code indicating failure or no failure (censoring).

To begin, assume the ordered unique failure times t(1) < t(2) < ⋅⋅⋅ < t(k)  . We recall that subjects in the risk set at time t(i)  , i.e., R (t(i))  , are those subjects with survival times that are at least t(i)  . In other words, the group of subjects at risk at time t(i)  are subjects which survived up to time t(i)  and subjects who failed or were censored during or after t(i)  (since they too were also at risk before failure time t(i)  ). The probability contribution of each failure at time t(i)  conditional on those subjects at risk, R(t(i))  , is

                       (          )
                    exp ∑pj=1βjxij
p(fail|R (t(i),β) = ∑-----------(∑--------)-.
                  ℓ∈R (t(i))exp    pj=1 βjxℓj
(8.7)

For k total failures, the conditional likelihood becomes

                 (∑p        )
       k∏  ----exp---j=1βjxij------
L(β) =    ∑       exp( ∑p   βx  ).
      i=1   ℓ∈R(t(i))       j=1  j ℓj
(8.8)

We first take the log of the likelihood above. Taking the log of the numerator gives us

   ⌊   (  p     ) ⌋    p
log⌈exp( ∑  βjxij) ⌉ = ∑  βjxij.
         j=1          j=1
(8.9)

The denominator term in the likelihood cannot be reduced by taking the log and remains as

  ⌊  ∑       ( ∑p      )⌋
log⌈       exp(    βjxℓj)⌉ .
    ℓ∈R(t(i))     j=1
(8.10)

Looking at the likelihood kernel p(fail|R (t(i),β)  , we also notice that it is a ratio, so we have to use the property for logarithms log(A∕B ) = log(A )− log(B)  . Combining the logarithms of the numerator and denominator of the kernel yields the log likelihood function

          (              ⌊          (        )⌋ )
       ∑k { ∑p              ∑        ∑p         }
LL(β) =   (    βjxij − log ⌈     exp (   βjxℓj)⌉ ) .
        i=1  j=1           ℓ∈R(t(i))    j=1
(8.11)

Next, let’s only work with partial derivatives for each i th risk set, so for now we can drop the summation symbol ∑k
  i=1  . Therefore, we are dealing with the log likelihood for each risk set in the form

              ∑p          ⌊  ∑       ( ∑p     ) ⌋
LL(β,R(t(i))) =    βjxij − log⌈      exp(    βjxℓj) ⌉ .
              j=1           ℓ∈R(t(i))     j=1
(8.12)

Now we take the first partial derivative of the left term ∑
  pj=1 βjxij  in the log likelihood function w.r.t. each β term to get the score vector. We already know that for the function

f (x) = β0 + β1x1 + β2x2,
(8.13)

the partial derivative of the function with respect to β1  is

∂(f(x))
-------= x1.
  ∂β1
(8.14)

Therefore, the first partial derivative for each β in the left term of the log likelihood is

  ∑p
∂(--j=1βjxij) = xij.
     βj
(8.15)

At this point, we know that the first part of the score function is

s(βj) = xij ...
(8.16)

We are left with the right term of the log likelihood in the form

  ⌊⌈  ∑       (( ∑p      ))⌋⌉
log        exp     βjxℓj   .
    ℓ∈R(t(i))     j=1
(8.17)

Notice that this is a log function. We know that the derivative of the logarithm of some function u , log(u)  is the inverse of the function, du∕u . This gives us a hint that part of the right term of the score contains the inverse

-----------∂u∕∂βj------------
  [∑           (∑p        )].
log   ℓ∈R(t(i))exp   j=1βjxℓj
(8.18)

Since we already handled the logarithm, we can now attack the function itself. Our target is βj  , and we note that the subscript  j indicates the coefficients are not a function of each subject ℓ in the risk set R(t(i))  considered. Thus, β values don’t change across risk sets, so their addition across risk sets (and anywhere they appear in the summation) has to be recognized and left intact. Given this, we can now work with the terms for a single individual ℓ in the risk set who contributes the factor

    ( p      )
exp (∑  βjxℓj) .
     j=1
(8.19)

Taking the first partial derivative of this function w.r.t. βj  gives

  (   (          ))
∂  exp ∑p   βjxℓj           ( ∑p     )
---------j=1------- = xℓjexp(    βjxℓj) .
        ∂βj                   j=1
(8.20)

For the entire group of subjects in the risk set, we can now include the first partial derivative into the summation and rewrite this as

            (  p      )
  ∑   x  exp( ∑  β x  ) .
ℓ∈R (t ) ℓj     j=1 j ℓj
    (i)
(8.21)

Going back to our original log function on the right side of the log likelihood, we now have

∑              (∑         )
  ℓ∈R(t(i))xℓjexp   pj=1βjxℓj
--∑-----------(∑p-------)--.
    ℓ∈R(t(i)) exp    j=1βjxℓj
(8.22)

For the risk set considered, we now have the score

                           (          )
            ∑ ℓ∈R(t )xℓ1exp ∑pj=1β1xℓ1
s(β1) = xij −-∑---(i)-----(-∑--------)--.
                ℓ∈R(t(i))exp    pj=1 β1xℓ1
(8.23)

Now that the first partial derivative (score) of the log-likelihood function w.r.t. β1  is solved, we can calculate the second partial derivative of the log-likelihood function w.r.t. β2  . First, we drop the term xij  on the left of the first partial derivative since it is independent of β2  . Looking at the remaining terms, we notice that the first partial derivative is a quotient in the form

∑              (∑p        )
--ℓ∈R(t(i))xℓ1exp---j=1β1xℓ1-
  ∑           (∑p       )  ,
    ℓ∈R (t(i))exp    j=1β1xℓ1
(8.24)

so we use the quotient rule for differentiation. The quotient rule is

f(x)   g(x)∂f∂(xx)−-f(x)∂g∂(xx)-
g(x) =        g(x)2       ,
(8.25)

so we have

              (        ) ∑             (∑p      )
f(x)   ∑      ( ∑p     )∂--ℓ∈R(t(i))xℓ1exp--j=1β1xℓ1-
 g(x) = ℓ∈R(t )exp j=1β1xℓ1            ∂β2           −
         (i)      (       )  ∑          (∑p      )∕
        ∑        ( ∑p    ) ∂--ℓ∈R(t(i))exp--j=1β1xℓ1--
      ℓ∈R(t )xℓ1 exp  j=1β1xℓ1           ∂β2
          (i)  (        )2
       ∑      ( ∑p     )
      ℓ∈R(t )exp  j=1β1xℓ1  ,
         (i)
(8.26)

and after rearranging gives

f(x)   ∑      ( p∑     )   ∑           (∑p     )
g(x) =     exp(   β1xℓ1)       xℓ1xℓ2exp(   β1xℓ1)−
     ℓ∈R(t(i))    j=1      ℓ∈R(t(i))         j=1
       ∑         ( p∑     )  ∑         (∑p     ) ∕
            xℓ1exp(   β1xℓ1)       xℓ2exp(   β1xℓ1)
      ℓ∈R(t(i))      j=1      ℓ∈R(t(i))      j=1
       ∑      ( p∑     ) 2
           exp(   β1xℓ1)  ,
     ℓ∈R(t(i))    j=1
(8.27)

and finally

     ∑          (∑p      ) ∑              (∑p       )
f(x)-=--ℓ∈R(t(i))exp---j=1β1xℓ1--ℓ∈(R(t(i))xℓ1xℓ2)e2xp---j=1-β1xℓ1-−
g(x)               ∑ ℓ∈R(t(i))exp  ∑pj=1β1xℓ1
     ∑            (∑p       )∑            (∑p       )
     --ℓ∈R(t(i))xℓ1exp---j=1β1xℓ1-(-ℓ∈R(t(i))xℓ2)exp---j=1-β1xℓ1-.
                  ∑ ℓ∈R(t(i))exp  ∑pj=1β1xℓ1 2
(8.28)

Cancelling out terms, we get

                         (          )
        ∑        xℓ1xℓ2exp ∑p    β1x ℓ1
I(1,2) = --ℓ∈∑R(t(i))------(∑---j=1--)---−
             ℓ∈R (t(i))exp    pj=1 β1xℓ1
     ∑              (∑p        )∑              (∑p        )
     --ℓ∈R(t(i))xℓ1exp---j=1β1xℓ1---ℓ∈R(t(i))xℓ2-exp----j=1β1xℓ1-
                    ∑           (∑p       )2               ,
                      ℓ∈R (t(i))exp    j=1 β1xℓ1
(8.29)

which forms element (1,2)  in the information matrix (second partial derivative “Hessian” matrix). Each element in the information matrix is calculated using the equation above and by substituting in appropriate values for subjects in all risk sets. A running total of elements in the score vector and information matrix is kept as the covariates (e.g., xℓ1  ) for each subject are looped through within each risk set and over each βj  term.

A gradient ascent approach is employed using the Newton-Raphson method to find maximum likelihood estimates of the solution parameters β in the form

              −1
βi+1 = βi + I(β) s(β).
(8.30)

Convergence is reached when ||β|| < 𝜖 . Values of 𝜖 are typically in the range 10−8 ≤ 𝜖 ≤ 10−4  .

8.2.1 Regression Diagnostics

This section covers diagnostics for assessing the fit of a Cox proportional hazards regression run, as well as determining overly influential observations used during an analysis.

Cox-Snell (CS) residuals. The Cox-Snell (CS) residual  [4] for the i th observation is

             (        )
 cs            ∑p
ri  = ˆH0(t)exp (   xijβj) ,
              j=1
(8.31)

where xij  is the covariate value for the i th object and j th feature, βj  is the regression coefficient for the j th feature, and Hˆ0 (t)  is Breslow’s cumulative hazard function [4], defined as

          (                         )
       ∑               d
ˆH0(t) =    ( ∑------(----i∑p--------)-) .
       ti≤t    ℓ∈R (ti)  exp ( j=1xℓjβj)
(8.32)

Once the CS residuals are determined for each object, their values are employed as the survival time during a new KM analysis, with the same censoring information for each object. The values of S(t)  from the KM analysis are then transformed to new values of the Nelson-Aalen (NA) baseline cumulative hazard using (8.3). A plot of the CS residual vs. Nelson-Aalen baseline cumulative hazard derived from using CS residuals as survival times is then generated. If the Cox PH model fits well, then the CS residuals will closely fall on a line with a slope of one, assuming that that CS residuals follow an exponential distribution with parameter unity, i.e., exp(1).

Martingale (M) residuals. The Martingale residual  [4] is determined for each risk set (for objects having unique failure times) using the relationship

rMi = di − rcis ,
(8.33)

where di  is the censoring indicator for the i th risk set.

Deviance (D) residuals . The deviance residual  [4] is also determined for each risk set using the relationship

 dev       M ∘ ---[-M-------------M-]
ri  = sign(ri ) − 2 ri + dilog(di − ri ),
(8.34)

where  M
ri  is the Martingale residual and di  is the censoring indicator for the i th risk set.

Schoenfeld (SCH) residuals. Schoenfeld residuals  [4] are determined for each observation and each feature, based on the functional form

rsch= di(xij − aij),
 ij
(8.35)

where a
 ij  is

     ∑             ∑p
     --ℓ∈R(ti)xℓjexp(--m=1-xℓmβm)-
aij =  ∑ ℓ∈R(t)exp(∑pm=1xℓm βm) ,
            i
(8.36)

and R(ti)  is the risk set of individuals at risk up to time ti  for the i th observation.

Scaled Schoenfeld (SCA) residuals. Scaled Schoenfeld residuals [4], on the other hand, are determined as

rscia = ˆβ+ dV (ˆβj,β ˆk)rsich,
(8.37)

where ˆβ is the p × 1  vector of regression coefficients, d is the total number of failures, V(ˆβj, ˆβk)  is the p× p variance-covariance matrix of the coefficients, and rsich  is a p × 1  vector of Schoenfeld residuals for the i th observation.

The table below lists the representation of and existence at each level of measurement for the residuals and hazards described above. The Cox-Snell, Martingale, and deviance residuals along with the Nelson-Aalen cumulative hazard are only determined for observations with unique failure times, whereas the Schoenfeld and scaled Schoenfeld residuals are determined for each observation and each feature.




Residual Representation Existence



Cox-Snell Risk sets Unique failure time, t
Martingale Risk sets Unique failure time, t
Deviance Risk sets Unique failure time, t
Nelson-Aalen cum hazard Risk sets Unique failure time, t
Schoenfeld Observations Each observation and feature, i,j
Scaled Schoenfeld Observations Each observation and feature, i,j



Plots for proportional hazards assumption. Plots of − log(− log[S(t)])  vs. log(t)  for categorical features such as treatment group or diagnostic group can be constructed using stratum-specific baseline cumulative hazard, defined as

        ∑  (                         )
Hˆk (t) =    ( ∑-------(--dik∑---------)) ,
       tik≤t    ℓ∈Rk(t) exp(  pj=1 xℓjβj)
(8.38)

where k represents the k th category, tik  is a failure time in the k th stratum, dik  is the number of failures at time tik  , and Rk (t)  is the risk set for censored and failed objects exclusively in the k th category at risk up to time tik  . This calculation requires use of only the covariate values, unique failure times, and risk sets containing objects in the k th group, which results in stratum-specific baseline cumulative hazard. The stratum-specific survivorship function value for each observation in the k th category and t th risk set is then determined as

                    ⊤
Sik(t) = exp{−Hˆk (t)}(xi β).
(8.39)

Tests of proportional hazards assumption. Grambsch and Therneau  [5] introduced a test based on scaled Schoenfeld residuals to determine if the PH assumption holds for each feature, and for the global collection of features used in a model. Under the null hypothesis, the PH assumption will hold true if the slope of scaled Schoenfeld residuals over some function of time, f(t)  , is zero. The effect of a specific feature on violation of the PH assumption can be tested using the following zero-slope test statistic

       [∑n            ¯    sca]2
χ2j = ---∑ni=1{dif(ti)-−f-(t)}rij∑n----,
     vjj  i=1{dif(ti)− ¯f(t)}2   i=1di
(8.40)

where f(t)  is either the survival time, log survival time, etc., f¯(t) = (∑ diti)∕(∑ di)  , and vjj  is a diagonal element of the variance covariance matrix V (ˆβj,β ˆk)  of the coefficients. Any feature yielding a significant slope test should either be use as a strata term for stratified Cox PH regression, or be removed from the model. In addition, an overall global zero-slope test for all features can be evaluated using the following formulation

 2    ⊤
χg = e Ye,
(8.41)

where each element ej  in the p ×1  vector e is ∑ni=1{dif(ti)− f¯(t)}rsicja  and each element yjk  in the p× p matrix Y is vjk(∑n  {dif(ti)− ¯f(t)}2)(∑n  di)
      i=1                  i=1  . If the global zero-slope test is significant, then the features employed in the model result in violation of the PH assumption.

Example 2 - Cox PH regression of Primary Biliary Cirrhosis. In this example, we will be using the Mayo Clinic data for Primary Biliary Cirrhosis again to perform Cox PH regression.

First, import the PBC survival dataset. From the Analysis pull-down menu, select Survival analysis → Cox proportion hazards regression, shown as:

PIC

Once the following popup window appears, specify dead as the censoring feature, fudays as the survival time feature, and for categorical features select edema, and stage, and finally select the rest of the continuous and binary variables as covariates for the model:

PIC

Once the run is complete, the following treenode icons will appear:

PIC

Click on the Coefficients icon, and the univariate regression coefficients and their hazards ratios (95% CI) will be listed at the top of the output:

PIC

The univariate regression coefficient for treatment reveals there is not significant treatment effect on survival, which coincides with the KM example performed on these data in the previous section. Many of the other covariates are significantly related to survival, and these should be included in the multiple regression model. Next, the multiple regression coefficients are listed in the output:

PIC

We notice that treatment is not significant when adjusting by potential nuisance and medical history factors among the subjects. and finally the hazard ratios (95% CI) for the multiple regression model:

PIC

Among the hazard ratios, we can observe quite high values for the edema_3 and stage_4 levels of edema and stage of disease, further attesting to the fact that these covariates should be included with the treatment feature (treatment) in the multiple regression model.

The proportional hazard assumption tests are shown below:

PIC

and indicate that the group 2 category for edema violates the PH assumption. See the -log(-log[S(t)) vs. log(t) plot later in this section to observe this violation.

The predicted survival for each edema group is shown below:

PIC

The predicted cumulative hazard for each edema category is shown below:

PIC

The -log(-log[S(t)) vs. log(t) plot for the 3 edema groups for this model is shown below:

PIC

and it is apparent that the group 2 survival curve crosses the other curves, suggesting that the proportional hazard assumption is violated for the categorical edema feature. The Nelson-Aalen cumulative hazard vs. Cox-Snell residual plot is shown below:

PIC

The Martingale residuals are as follows:

PIC

The deviance residuals are as follows:

PIC

Bibliography

[1]   E.T. Lee. Statistical Methods for Survival Data Analysis. Belmont(CA), Lifelong Learning Publications, 1980. (ISBN: 0-534-97987-4).

[2]   J.F. Lawless. Statistical Models and Methods for Lifetime Data, 2nd Edition. New York(NY), John Wiley and Sons, 2002. (ISBN: 978-0-471-37215-8).

[3]   J.D. Kalbfleisch, R.L. Prentice. The Statistical Analysis of Failure Time Data. New York(NY), John Wiley and Sons, 1980. (ISBN: 978-0-471-05519-0).

[4]   J.P. Klein, M.L. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. New York(NY), Springer-Verlag Press, 1997. (ISBN: 0-387-94829-5).

[5]   P.M. Grambsch, T.M. Therneau. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 81:515-526, 1994.

Chapter 9
DATASETS USED

9.1 Plasma Retinol

The Retinol dataset is based on results of a cross-sectional study to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids [123]. Study subjects (n = 314  ) were patients who had an elective surgical procedure during a three-year period to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found to be non-cancerous. Plasma beta-carotene levels were log-transformed prior to the analyses due to severe asymmetry of the residuals on the original scale. For log beta-carotene, dietary intake, regular use of vitamins, and intake of fiber were associated with higher plasma concentrations, while Quetelet Index (defined as weight/height   2  in the units kg/m 2  ) and cholesterol intake were associated with lower plasma levels, adjusting for the other factors (R2 = 0.5  ). There was one extremely high leverage point in alcohol consumption that was deleted prior to the analyses.

Plasma concentrations of retinol and beta-carotene were not correlated. The authors conclude that there is wide variability in plasma concentrations of these micronutrients in humans, and that much of this variability is associated with dietary habits and personal characteristics. A better understanding of the physiological relationship between some personal characteristics and plasma concentrations of these micronutrients will require further study.

Authorization: This data set can be used to demonstrate multiple regression, transformations, categorical variables, outliers, pooled tests of significance and model building strategies.




Feature

Description

Scale



age

Age (years)

Continuous



sex

Sex (1=Male, 2=Female)

Categorical(1,2)



smokstat

Smoking status (1=Never, 2=Former, 3=Current Smoker)

Categorical(1,2,3)



quetelet

Quetelet (weight/(height^2))

Continuous



vituse

Vitamin Use (1=Yes, 2=Sometimes, 3=No)

Categorical(1,2,3)



calories

Number of calories consumed per day

Continuous



fat

Grams of fat consumed per day

Continuous



fiber

Grams of fiber consumed per day

Continuous



alcohol

Number of alcoholic drinks consumed per week

Continuous



cholesterol

Cholesterol consumed (mg per day)

Continuous



betadiet

Dietary beta-carotene consumed (mcg per day)

Continuous



retdiet

Dietary retinol consumed (mcg per day)

Continuous



betaplasma

Plasma beta-carotene (ng/ml)

Continuous



retplasma

Plasma Retinol (ng/ml)

Continuous



SIZE: Excel with 314 objects, 14 features.

9.2 Low Birth Weight

Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.

The goal of this study  [4] was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on n = 189  women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.

The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986.




Feature

Description

Scale



id

Identification Code

NA



low

Low Birth Weight (0=Birth Weight≥ 2500g, 1=Birth Weight< 2500g)

Binary(0,1)



age

Age of the Mother in Years

Continuous



lwt

Weight in Pounds at the Last Menstrual Period

Continuous



race

Race (1=White, 2=Black, 3=Other)

Categorical(1,2,3)



smoke

Smoking Status During Pregnancy (1=Yes, 0=No)

Binary(0,1)



ptl

History of Premature Labor (0=None, 1=One, etc.)

Binary(0,1)



ht

History of Hypertension (1=Yes, 0=No)

Binary(0,1)



ui

Presence of Uterine Irritability (1=Yes, 0=No)

Binary(0,1)



ftv

Number of Physician Visits During the First Trimester (0=None, 1=One, 2=Two, etc.)

Continuous



bwt

Birth Weight in Grams

Continuous



SIZE: Excel with 189 objects, 11 features.

9.3 Small Round Blue-Cell Tumors

The purpose of this study  [5] was to develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs). ANNs were trained using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories (Ewing’s sarcoma, Burkitt’s lymphoma, neuroblastoma, and rhabdomyosarcoma) and often present diagnostic dilemmas in clinical practice. The ANNs correctly classified all samples and identified the genes most relevant to the classification. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRBCTs, the study team analyzed additional blinded samples that were not previously used for the training procedure, and correctly classified them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the identification of candidate targets for therapy.

The 23 features genes listed in tabular notation below were identified from the full set of 2,308 genes using Greety PTA feature selection.




Feature

Description

Scale



fc fragment of igg

Gene expression

Continuous



growth arrest-specific 1-36582

Gene expression

Continuous



fgfr4

Gene expression

Continuous



antigen-monoclonal antibodie

Gene expression

Continuous



apj receptor

Gene expression

Continuous



sarcoglycan

Gene expression

Continuous



histone deacetylase

Gene expression

Continuous



actin alpha 2

Gene expression

Continuous



eh domain containing

Gene expression

Continuous



ests-66552

Gene expression

Continuous



n-acetylglucosamine receptor 1

Gene expression

Continuous



mitogen-activated protein kina

Gene expression

Continuous



non-metastatic cells 4

Gene expression

Continuous



ribonucleaseangiogenin inhibit

Gene expression

Continuous



glycine amidinotransferase

Gene expression

Continuous



translocation protein 1

Gene expression

Continuous



recoverin

Gene expression

Continuous



igf-2

Gene expression

Continuous



transmembrane 4 superfamily

Gene expression

Continuous



myh-1c

Gene expression

Continuous



plasminogen activator

Gene expression

Continuous



ccaat

Gene expression

Continuous



n-cadherin 2

Gene expression

Continuous



class

Diagnostic class (1-Ewing’s sarcoma(EWS), 2-Burkitt’s lymphoma(BL), 3-neuroblastoma(NB), 4-rhabdomyosarcoma(RMS)

Categorical(1,2,3,4)



SIZE: Excel with 63 objects, 24 features.

9.4 British Doctors Cardiovascular and Smoking Data

This dataset is very amenable for Poisson regression because the data are presented in terms of the number of cases and person-years of follow-up of British doctors and coronary disease as a function of smoking (yes/no). Table 9.1 lists the number of coronary deaths as a function of age among non-smoking and smoking doctors.

Table 9.1: Deaths from coronary disease among British male doctors  [6].




Feature Description Scale



Cases # cases Continuous



PY Person-years of follow-up Continuous



Age 35-44 Age category 35-44 (1-Yes,0-No) Binary(0,1)



Age 45-54 Age category 45-54 (1-Yes,0-No) Binary(0,1)



Age 55-64 Age category 55-64 (1-Yes,0-No) Binary(0,1)



Age 65-74 Age category 65-74 (1-Yes,0-No) Binary(0,1)



Age 75-84 Age category 75-84 (1-Yes,0-No) Binary(0,1)



Smoke Smoking history (1-Yes, 0-No) Binary(0,1)



SIZE: Excel with 10 rows, 8 features.

9.5 Primary Biliary Cirrhosis

The data is from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984 [7]. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data set participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants. Missing data items are denoted by a period.

This dataset is nearly identical to the data set found in appendix D of Fleming and Harrington. The differences with this data set are: age is in days, status is coded with 3 levels, and the sex and stage variables are not missing for obs 313-418. Table 9.2 lists the data format.

Table 9.2: Primary Biliary Cirrhosis dataset  [7].




Feature

Description

Scale



id

case number

NA



treatment

0-Placebo, 1=D-penicillamine

Binary(0,1)



age

age in years

Continuous



sexfem

sex (0=Male, 1=Female)

Binary(0,1)



ascites

presence of ascites (0=No, 1=Yes)

Binary(0,1)



hepatomegaly

presence of hepatomegaly (0=No, 1=Yes)

Binary(0,1)



spiders

presence of spiders (0=No, 1=Yes)

Binary(0,1)



edema

presence of edema (1=No edema and no diuretic therapy for edema; 2=edema present without diuretics, or edema resolved by diuretics; 3 = edema despite diuretic therapy

Categorical(1,2,3)



bilirubin

serum bilirubin in mg/dl

Continuous



albumin

albumin in gm/dl

Continuous



copper

urine copper in ug/day

Continuous



alkphos

alkaline phosphatase in U/liter

Continuous



sgot

SGOT in U/ml

Continuous



platelets

platelets per cubic ml / 1000

Continuous



cholesterol

serum cholesterol in mg/dl

Continuous



prothrombin

prothrombin time in seconds

Continuous



stage

histologic stage of disease (1,2,3,4)

Categorical(1,2,3,4)



fudays

number of days between registration and the earlier of death, transplantation, or study analysis time in July, 1986

Continuous



dead

failure status (0=Alive, 1=Dead)

Binary(0,1)



SIZE: objects 312, features 19

Bibliography

[1]   D.W. Nierenberg, T.A. Stukel, J.A. Baron, B.J. Dain, E.R. Greenberg. Determinants of plasma levels of beta-carotene and retinol. Am. J. Epidemiol. 130:511–521, 1989.

[2]   http://lib.stat.cmu.edu

[3]   http://staff.pubhealth.ku.dk/ tag/Teaching/share/data/Plasma.html

[4]   D.W. Hosmer, S. Lemeshow, R.X. Sturdivant. Applied Logistic Regression: Third Edition. New York (NY), John Wiley and Sons, 2013.

[5]   J. Khan, J.S. Wei, M. Ringnér, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, P.S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 7:673–679, 2001.

[6]   R. Doll, A.B. Hill. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Nat’l. Cancer Inst. Monogr. 19:205–268, 1966.

[7]   T.R. Fleming, D.P. Harrington. Counting Processes and Survival Analysis. New York (NY), John Wiley and Sons, 1991.

Index

Association
    Covariance, 83
    Euclidean distance, 83
    Force plots, 88, 90
    Minimum spanning tree, 88
    Pearson correlation, 83
    Prim’s algorithm, 88
    Spearman rank correlation, 83
Average, 18

Behrens-Fisher problem, 52
Binary features, 5

Categorical features, 5
Chi-squared test, 71
Classifier
    Polytomous logistic regression (PLOG), 130
Clipboard
    Copying HTML output, 15
    Copying output images, 15
Coding effects
    Corner-point constraints, 127
    Sum-to-zero constraints, 127
Color
    Background of images, 11
    Cluster heat maps, 11
    Datagrid selected items, 11
    Filtered records, 11
    Histogram bars, 11
    Lines in cluster images, 11
    Preferences, 5
    Text in cluster images, 11
Contingency table analysis, 71
Contingency tables
    r× c , 75
    Multiway, 75
Continuous features, 5
Copying
    HTML output, 15
    images, 15
Correlation
    Matrix, 94
    Pearson product moment, 83
    Spearman rank, 83
Covariance, 83

Datasets
    British doctors, 171
    Low birth weight, 170
    Primary Biliary Cirrhosis, 172
    Retinol, 169
    Small round blue-cell tumors, 170
Default settings, 5
    Categorical feature criterion, 7
    Categorical table iterations, 8
    Class discovery run outputs, 8
    Class prediction run outputs, 9
    Convert input text to numeric, 7
    Data size control, 7
    Feature names, 8
    Filtered record color, 11
    Graphic image format, 12
    Graphics, 11
    Graphics resolution, 11
    Histogram bar color, 11
    Image background color, 11
    Image line color, 11
    Image save format, 7
    Input binary and categorical as continuous, 7
    Input string manipulation, 8
    Minimum sample size, 8
    Missing data codes, 6
    Model building strategy, 8
    Monte Carlo uncertainty analysis, 9
    Output options, 6
    Regression n-2-p ratio, 8
    Replace missing data, 7
    Report test results, 8
    Script editor font, 8
    Table fonts and colors, 9
    Text color, 11
Dependency
    Logistic regression, 122
    Multiple linear regression, 101
        ANOVA table, 103
        Cook’s distance, 108
        DFBETAS, 109
        DFFITS, 109
        Diagnostics, 106
        Jackknife residuals, 108
        Multicollinearity, 109
        Partial F-tests, 111
        Residuals, 108
        Standardized residuals, 108
        Variance inflation factors, 110
        Variance partitioning, 102
    Multivariate linear regression, 116
    Poisson regression, 136
    Polytomous logistic regression, 130
Distance
    Euclidean, 83
Distribution
    normal, 18
    Standard normal, 37

Editor, Filtering records, 33
Editor, Labels for categories, 30
Editor, Mathematical transforms, 31
Eigenvalues, 117
Euclidean distance, 83

Feature types
    Binary, 5
    Categorical, 5
    Continuous, 5
    Text, 5
File
    Category code map file, 16
    Commands, 15
    Export
        Comma-delimited (.csv), 16
        Tab-delimited text (.txt), 16
    Import
        Category codes, 16, 33
        Comma-delimited (.csv), 16
        Excel (xlsx, xls), 16
        Excel file (xlsx, xls), 20
        Tab-delimited text (.txt), 16
    Merge Excel files, 16
    Open Explorer project, 15
    Save Explorer project, 16
Font
    Preferences, 5
Fonts
    Script editor, 8

Geometric mean, 58
Geometric standard deviation, 58

Harmonic mean, 58
Harmonic standard deviation, 58
Heteroscedasticity, 49, 65
Homogeneity of variance test, 49
Homoscedasticity, 49
Hypothesis test
    Bartlett’s equality of variances, 49
    Chi-squared, 71
    Homogeneity of variance, 49
    Kruskal-Wallis, 62
    Mann-Whitney U, 53
    Proportions, two independent, 69
    T-test, 49
    T-test, assuming equal variances, 51
    T-test, assuming unequal variances, 52
    T-test, independent samples, 50
    T-test, paired samples, 66
    T-test, variances assumed unequal, 51, 52
    Welch ANOVA, 65
    Welch ANOVA test, 62

Import
    Category code file, 16, 33
    Comma-delimited file (.csv), 16
    Excel file (xlsx, xls), 16, 20
    Tab-delimited text file (.txt), 16

Jacobi method, 117

Kruskal-Wallis Test, 62
Kurtosis, 19

Likert scale, 13
Linear predictor, 107
Link function, 136
Logistic regression, 122

Mann-Whitney U Test, 53
Mathematical transforms
    ∗ , 31
    +  , 31
    − , 31
    ∕ , 31
    < , 31
    > , 31
    exp(x)  , 31
    ≥ , 31
    ≤ , 31
    log10(x)  , 31
    loge(x)  , 31
    √--
 x , 31
    ab  , 31
Matrix
    Correlation, R, 94, 96
    Covariance, Σ  , 95
    Covariance, S, 94
    Data, multivariate, Y, 93
    Data, X, 17
    Determinant, 96
    Distance, D, 95
    Hessian, 125, 131, 157
    Identity, 97
    Regression
        Coefficient, β , 117
        Data, X, 101
        Hat, H, 108
        Variance-covariance of coefficients, V (β)  , 104
        Variance-covariance of coefficients, V (βj,βk)  , 117
    Standardized data, Z, 94
    Variance-covariance, 106
    Variance-covariance, S, 94
Matrix setup of data, 17
Mean, 18
Mean-zero standardization, 37
Median, 19

Newton-Raphson method, 125, 132, 157
Normal distribution, 18
Normal variates, 18
Normality test, see Shapiro-Wilk, 19
Normalization, 37

Paired t-test, 66
Percentiles, 40
Polytomous logistic regression, 130
Preferences, 5

Quantile, 19

Range, 19
Recoding features
    Fuzzification, 41
    Logarithm, 38
    Nominal-to-binary, 41
    Normalization, 37
    Percentiles, 40
    Quantiles, 39
    Ranks, 39
    Standardization, 37
    van der Waerden scores, 40
    Z-scores, 37
Regression
    Cox proportional hazards, 155
    Deviance, 126
    Diagnostics, 106, 157
    Goodness-of-fit, 125
    Logistic (polytomous), 130
    Logistic regression, 122
    Logistic, unconditional (binary), 123
    Multicollinearity, 109
    Multiple linear regression, 101
    Multivariate linear regression, 116
    Newton-Raphson method, 125, 132, 157
    Partial F-tests, 111
    Poisson, 136
    Polytomous logistic regression, 130
    Residuals, 108, 125
    Simple linear, 101

Screenshots
    Copying HTML output to clipboard, 15
    Copying images to clipboard, 15
Shapiro-Wilk normality test, 19
Singular value decomposition, 117
Skew-zero transform, see van der Waerden transform, 41
Skewness, 19
Standard deviation, 18
Standard normal distribution, 37
Standardization, 37
Survival analysis
    Breslow’s cumulative hazard, 158
    Cox proportional hazards regression, 155
    Cox-Snell residuals, 158
    Deviance residuals, 158
    Hazard function, 151
    Kaplan-Meier analysis, 151
    Log-rank test, 151
    Martingale residuals, 158
    Nelson-Aalen baseline cumulative hazard, 152
    Product limit estimate of survival, 151
    Scaled Schoenfeld residuals, 158
    Schoenfeld residuals, 158
    Stratum-specific baseline hazard, 159
    Survivorship function, 151
    Tests of PH assumption, 159

Tests of hypothesis
    2-by-2 table frequencies, 71
    ANOVA (one-way), 59
    Bartlett’s equality of variances, 49
    Categorical data, 71
    Chi-squared, 71
    Contingency tables, 71
    Kruskal-Wallis, 62
    Mann-Whitney U test, 53
    McNemar paired binary outcomes, 77
    Paired t-test, 66
    Permutation, 77
    Proportions, two independent, 69
    r-by-c table frequencies, 71
    Randomization, 77
    T-test, 49
    T-test, assuming equal variances, 51
    T-test, assuming unequal variances, 52
    T-test, independent samples, 50
    T-test, paired, 66
    Wilcoxon rank sum, 55
    Wilcoxon signed rank, 67
Text features, 5
Transformations
    Categorical-to-Class, 45
    Continuous-to-categorical, 42
    Fuzzification, 41
    Inside CV folds, 45
    Logarithm, 38
    Nominal-to-binary, 41
    Normalization, 37
    Percentiles, 40
    Quantiles, 39
    Ranks, 39
    Standardization, 37
    Text-to-categorical, 43, 44
    van der Waerden scores, 40
    Z-scores, 37, 38

van der Waerden scores, 40
Variance, 18

Welch ANOVA test, 62