See the Explorer online User's Guide

Explorer - User’s Guide

Standard (Free) Edition

©NXG Logic, LLC.

October 7, 2019

©Copyright 2019 by NXG Logic, LLC.

All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the publisher except for the use of brief quotations in a user guide review.

Printed in the United States of America

ISBN: 978-1-5323-6817-2
United States Library of Congress - US Copyright Oﬃce Registration: 1-6268467801
NXG Logic, LLC.
Houston, Texas 77030

www.nxglogic.com

1 PREFERENCES AND DEFAULT SETTINGS
1.1 Feature Types
1.2 General Default Settings
  1.2.1 Missing Data Code
  1.2.2 Output Options
  1.2.3 Replace Missing Continuous Data
  1.2.4 Replace Missing Binary Data
  1.2.5 Replace Missing Categorical Data
  1.2.6 Default image (save) format
  1.2.7 Input Data Size Control
  1.2.8 Data Transformations
  1.2.9 Statistical Analysis
  1.2.10 Script editor
  1.2.11 String Manipulation
  1.2.12 Class Discovery Runs
  1.2.13 Class Prediction Runs
  1.2.14 Monte Carlo Uncertainty Analysis
1.3 Default Table Fonts and Colors
1.4 Color and Graphics Defaults
  1.4.1 Graphics
  1.4.2 Histograms
  1.4.3 Correlation, matrix plots, cluster images
  1.4.4 SOM Map Colors
  1.4.5 Cluster Heat Map Colors
  1.4.6 Series colors
  1.4.7 Class discovery score plot markers.
  1.4.8 Multichart panel.
  1.4.9 Histogram, line, and multipanel graph fonts.
  1.4.10 Options for YYYY and XYYY plots.
1.5 Changing Feature Types
1.6 Copying HTML output
1.7 Copying images
1.8 File Command
  1.8.1 Open Project
  1.8.2 Save Project
  1.8.3 Save Project As
  1.8.4 Import Data
  1.8.5 Export
2 SUMMARIZE
2.1 Data Arrangement
2.2 Normal Distribution
2.3 Central Tendency and Dispersion
3 LABELS, TRANSFORMS, FILTERING, AND GROUPING
3.1 Editor
  3.1.1 Labels for levels of categorical factors
  3.1.2 Mathematical transforms
3.2 Importing Category Labels from Text File
3.3 Filtering records
3.4 Grouping
3.5 Limitations of Filtering and Grouped Analysis
4 PRESET TRANSFORMATIONS
4.1 Normalization
4.2 Mean-Zero Standardization
4.3 Logarithmic
4.4 Quantile
4.5 Rank
4.6 Percentile
4.7 van der Waerden Scores
4.8 Nominal-to-Binary
4.9 Fuzziﬁcation
4.10 Continuous to Categorical
4.11 Text to Categorical (converts all text features)
4.12 Text to Categorical (select which text feature)
4.13 Assign Categorical Features w/labels to Class Feature
4.14 Transformations During Cross-Validation for Supervised Class Prediction
5 INDEPENDENCE
5.1 2 Unrelated Samples
  5.1.1 Student’s T-test
  5.1.2 Homogeneity of Variance Test for Two Samples
  5.1.3 Mann-Whitney U Test
5.2 Geometric and Harmonic Means
5.3 $k$ Unrelated Samples
  5.3.1 Analysis of Variance (ANOVA)
  5.3.2 Welch ANOVA Test for Unequal Variances
  5.3.3 Kruskal-Wallis Test
5.4 Related Samples
  5.4.1 T-test for Paired Samples
5.5 Equality Tests for 2 Independent Proportions
5.6 Chi-Squared Contingency Table Analysis
5.7 Two-way Contingency Tables
5.8 Fisher’s Exact Test
5.9 Multiway Contingency Tables
5.10 Exact Tests for Multiway Tables
5.11 McNemar Test for Paired Binary Outcomes
6 ASSOCIATION
6.1 Covariance
6.2 Parametric Correlation: Pearson Product Moment
6.3 Non-parametric Correlation: Spearman Rank
6.4 Force Plots from Minimum Spanning Trees
6.5 Multivariate Forms of Covariance and Correlation
6.6 Euclidean Distance
6.7 Matrix Formulation of Covariance and Correlation
7 DEPENDENCY
7.1 Linear Regression: Single Predictor
7.2 Multiple Linear Regression
  7.2.1 Regression Diagnostics
  7.2.2 Partial F-tests
7.3 Multivariate Linear Regression
7.4 Logistic Regression
  7.4.1 Unconditional Logistic Regression
  7.4.2 Polytomous Logistic Regression
7.5 Additive and Multiplicative Poisson Regression
  7.5.1 Poisson Assumption
  7.5.2 Residuals and Regression Diagnostics
  7.5.3 Goodness-of-Fit Statistics
8 SURVIVAL ANALYSIS
8.1 Kaplan-Meier Analysis and Logrank Test
  8.1.1 Product Limit Estimate of Survivorship Function
  8.1.2 Log-rank Test
8.2 Cox Proportional Hazards Regression
  8.2.1 Regression Diagnostics
9 DATASETS USED
9.1 Plasma Retinol
9.2 Low Birth Weight
9.3 Small Round Blue-Cell Tumors
9.4 British Doctors Cardiovascular and Smoking Data
9.5 Primary Biliary Cirrhosis

Chapter 1
PREFERENCES AND DEFAULT SETTINGS

1.1 Feature Types

There are 4 types of features used in NXG Explorer: continuous, binary, categorical, and text. Feature type is determined when inputting a dataset. In order to distinguish the 3 numeric types of features, during input of data NXG Explorer enumerates the number of unique values identiﬁed for each feature. The number of unique values is compared with a default setting called categorical threshold (default=10). If the number of unique feature values exceeds the categorical threshold, then the feature is speciﬁed as a continuous type. If the number of unique values of a features and the maximum value is less than the default categorical threshold, the variable is speciﬁed as a categorical. However, if there are two unique values of a feature identiﬁed, and min=0 and max=1, then the feature is speciﬁed as being binary. Any time text is identiﬁed within an input feature’s values, the feature is speciﬁed as a text feature.

The table below illustrates the feature types along with the graphic icon used during feature selection for all analyses.

NXG Explorer searches for a decimal in the input values, and if found, the feature is set to the continuous type.

An example of the feature codes for the Retinol dataset in Explorer is shown below:

1.2 General Default Settings

From time-to-time, users will want to modify some of the default settings in NXG Explorer, such as heat map colors, text and line colors on charts, and text-formatting during data input. To change defaults settings, select Preferences $→$ General default settings as shown below:

and the following popup window will appear:

The following options available in the popup window are described below.

1.2.1 Missing Data Code

Remove spaces and special symbols on input. This ﬁlter provides a useful way to ﬁlter out common errors in input ﬁles. Examples are empty spaces, dashes, Excel error codes, or special symbols. Values for the empty data cells identiﬁed will be set to a null set. No periods or codes are used for replacement.

Missing data code. If a preset missing data code such as NA or 9999 was used during coding, then the missing data code can be used in this textbox. NXG Explorer will replace any table cells with the missing code to a null value.

Remove empty rows in Excel. This option will remove empty rows from the input dataset.

Required proportion of records with string entries for text features. This option helps guard against the input of messy features containing, for example, half text and half numeric values. If string characters are found in a feature’s values, then this proportion can be used as a threshold for determining which features are text-based. For example, when set to 0.8, if less than 80% of the input records for an input feature contain text, then the feature will not be identiﬁed as a text-based feature, and any records with text will be set to nothing. On the other hand, if more than 80% of the input values for this feature contain text, then the entire feature will be input as a text feature. Use of this threshold is a good way to enforce that true text-based features do indeed contain mostly text.

1.2.2 Output Options

Empty output folder at program startup. Due to Windows security issues, especially in Windows 10, there are limitations for the folders which can freely be used for output by installed software. By default, at run-time, NXG Explorer will create an output folder called “NXG Explorer” in the user’s Document folder, which will contain all output ﬁles. When this option is selected, the entire folder will be deleted before each run, preventing the accumulation of old ﬁles from previous runs. We have used this option most of the time, and have never run into any problems, knowing full well that if an output is to be copied, then we will speciﬁcally save outputs to working folders used for report-writing or manuscript preparation. Otherwise, if this option is not selected, then the output ﬁles form all runs will be placed in this folder, without being purged.

Remove logos. This option will remove NXG Logic logos from various output ﬁles.

1.2.3 Replace Missing Continuous Data

Apply. When checked, will apply the following selected option (only one selection allowed).

With zeroes. Missing continuous values will be replaced with zeroes.

With feature means. Missing continuous values will be replaced with the average of the valid input values.

With feature medians. Missing continuous values will be replaced with the median of the valid input values.

1.2.4 Replace Missing Binary Data

Apply. When checked, will apply the following selected option (only one selection allowed).

With zeroes. Missing binary values will be replaced with zeroes.

With ones. Missing binary values will be replaced with ones.

1.2.5 Replace Missing Categorical Data

Apply. When checked, will apply the following selected option (only one selection allowed).

With zeroes. Missing categorical values will be replaced with the lowest feature category value (level of the factor).

With ones. Missing categorical values will be replaced with the greatest feature category value (level of the factor).

1.2.6 Default image (save) format

Available formats. NXG Explorer available default image formats are png, wmf, jpg, gif, tif, bmp, emf, exi, ico, and memorybmp.

1.2.7 Input Data Size Control

Max limit for display of input data points. For large datasets, limitations can be set on the maximum size of data which can be displayed in the visible datagrid.

1.2.8 Data Transformations

Recode text categories into numeric codes on input. This option (true, false) will prompt the user with text-to-numeric conversion options during data input to assign numeric class membership to various levels of a categorical text ﬁeld. If text is found in a feature’s value when set to true, NXG Explorer will tabulate the number of unique text strings found among the features data and prompt the user if the unique text words should be transformed into numeric class values. Therefore, the command will result in text-to-numeric class assignments when speciﬁed by the user. When set to false, and text is found in features upon input, the features will remain as text-based features.

Default level of categorical features: 10. This default value, known as the categorical threshold, forms the basis for determining which features are to be speciﬁed as categorical and continuous. The section above on “Feature types” describes how this threshold is used. Quite basically, if the number of unique feature values exceeds this default setting, the feature will be speciﬁed as a continuous feature, and categorical otherwise.

Input binary and categorical as continuous. When checked, this option will change the feature type from binary or categorical to continuous.

1.2.9 Statistical Analysis

Minimum sample size for grouped tests: $n = 10$ . The default value for the minimum sample size for grouped tests is $n = 10$ . Below a sample size of 10, which would equate $n = 5$ on average for a 2-sample test, the robustness and reliability of a hypothesis becomes degraded as sample size decreases. Users can set this criterion to whatever value they choose.

Report reduced test results table. For any 2- or $k$ -sample test of equality of means test for continuous features, this option will cause NXG Explorer to output a reduced table of results based on the choices listed below. This option is particularly useful for presentations and/or publications, when not all of the results are used in published tables. The more detailed tables which are generated when this option is not speciﬁed are more amenable to research and exploration, when more details are needed.

Report only parametric test results. When the reduced test result table option is speciﬁed, only parametric test results will be listed for continuous features.
Report only non-parametric test results. When the reduced test result table option is speciﬁed, only non-parametric test results will be listed for continuous features.

Regression: Minimum n-2-p ratio. This option determines the minimum ratio of $n$ =sample size to $p$ =feature ratio. If the ratio of the sample size to model-based features is less than the speciﬁed ratio, then regression runs will be disabled. It is recommended that an $n$ : $p$ ratio of 10 is used, which follows the typical recommendation of having 10 records for every regression parameter used.

Model building strategy screen p-value. During regression, a univariate regression model whose p-value is less than the speciﬁed value will be included in the following multi-feature regression model.

Model building strategy table p-value. Regression covariate’s whose p-values are less than the speciﬁed value will be listed in the multi-feature regression output.

Iterations for R $×$ C categorical tables - randomization tests (Fisher’s Exact). This value determines the number of iterations used during an empirical p-value test for chi-squared contingency table analysis, in order to mimic the Fisher exact test. A default value of 100,000 is recommended; however, values of e.g. 10,000 should not result in unreliable results, based on Monte Carlo statistics.

1.2.10 Script editor

Change font. This default setting changes the font typeface for the label and transform editor.

Font size. This default setting changes the font size for the label and transform editor.

1.2.11 String Manipulation

Create unique feature names on input. If there are long feature names among input features names, it has been observed that using this option is beneﬁcial, since many long feature names can be drastically shortened. By default, this option is turned oﬀ.

Input feature names as lower case. This option allows users to reduce any upper case characters among input feature names to lower case. Many software packages now work with lower case feature names as an attempt to get away from use of case-sensitive feature names.

Drop special characters from input feature names. When this option is selected, special characters such as, # $ %, etc. will be ﬁltered out of input feature names.

1.2.12 Class Discovery Runs

Dimension reduction only. Class discovery runs include non-linear manifold learning (dimensional reduction) techniques, which have varying output. When this option is speciﬁed, only the reduced 2D feature values are written to an Excel ﬁle along with the 2D scatter plot of the scores for the 2D.

Full output. This option will results in the full output by the class discovery runs, which can include a variety of diﬀerent charts and Excel ﬁles, depending on the algorithm used.

Use corr matrix for CMF output(Cov when unchecked). During PCA, component subtraction, and Monte Carlo analysis, when checked, the correlation matrix will be used, vs. use of the covariance matrix when unchecked.

1.2.13 Class Prediction Runs

Transform features during CV. When checked, this option will employ feature transformation over all objects within training folds, and apply parameters to objects in the testing fold.

Mean-zero standardization. Speciﬁes that feature value mean-zero standardization will be used for feature transformation. Feature mean and s.d. will be derived from objects in the training folds, and will be applied to objects in the testing fold.

Normalize into range [0,1]. Speciﬁes that feature value normalization to the range [0,1] will be used for feature transformation. Feature $min x$ and $max x$ will be derived from objects in the training folds, and will be applied to objects in the testing fold.

1.2.14 Monte Carlo Uncertainty Analysis

Discrete features are zero-based. When checked, this option will establish a zero base for any simulated discrete features.

Add distribution parameters to feature names. When checked, this option will append feature names with an abbreviated name of the probability distribution used as well as the parameter values.

1.3 Default Table Fonts and Colors

This section applies to the format of the table with summary statistics (command: Analysis $→$ Summarize) and hypothesis test results (command: Analysis $→$ Independence $→$ Unpaired 2- and k-sample tests). To specify the default table fonts and colors, select Preferences $→$ Default table fonts and colors, as shown below:

and you will see the following popup window:

The following descriptions describe the various options in the popup window.

Font. The font used for the entire table.

Font size. The font size used for the entire table.

Header font color. Color of the header text.

Row font color. Color of the text in rows.

Table border width. Width of the border.

Table border color. Color of the border.

Header border width. Width of the header’s border.

Header border color. Color of the headers’s border.

Cell border width. Width of cell border.

Cell border color. Color of the cell border.

Header background color. Background color of the header.

Odd row background color. Background color for odd rows. Note: row numbering starts with row 0 below the header row.

Even row background color. Background color for even rows.

Make header background same as odd row background. This option will set the background of the header to be the same as the background for odd rows.

Auto-color picker (base color). This color option allows the user to set the base color for the entire table.

1.4 Color and Graphics Defaults

To change color and graphics default values, select Preferences $→$ Color and graphic defaults, shown below:

and the following popup window will appear:

1.4.1 Graphics

Low resolution. This option results in low resolution graphics near 96 dpi.

High resolution. This option results in higher resolution graphics at 300 dpi.

1.4.2 Histograms

Histogram bar color. This default color option speciﬁes the color choice for histogram bars.

Histogram spline color. The default color of the line when spline ﬁtting is speciﬁed for histograms.

Histogram bins. The default value of the number of histogram bins.

Datagrid selection color. This default option speciﬁes the color of either the alternating row colors in the datagrid, or selected rows after record ﬁltering has been invoked.

1.4.3 Correlation, matrix plots, cluster images

Gradient 1st color. This gradient color deﬁnes the color starting at the top of the cluster image.

Gradient 2nd color. This gradient color deﬁnes the color at the bottom of the cluster image.

Line color. This color speciﬁes the color of the dendograms for objects and features.

Font color. This color speciﬁes the color of all object and feature names, as well as header information regarding the distance metric used and scale for heat map colors. Note that if “Alternate object name colors” is checked, then those alternating colors override the default text font color.

Alternate object name colors. This option only applies when the true classes are input with the dataset. When checked, the color of object names within classes will vary in the order blue, green, red, and black. Modulo 4 is used for the object’s class numbers. If no classes are input with the data, then the default font color will be applied.

Missing values. This color speciﬁes the heat map cell colors for missing values.

1.4.4 SOM Map Colors

This default setting speciﬁes the default color gradient used for the U-matrix and feature-speciﬁc component plots, which are generated when self-organizing maps are used.

1.4.5 Cluster Heat Map Colors

This default setting speciﬁes the default color combination used for heat maps in cluster images.

Some example cluster heat map results are shown as follows:

Default image (save) format This default setting speciﬁes the ﬁle format for all graphic output ﬁles. The choices are png, wmf, jpg, gif, tif, bmp, emf, exi, ico, and memorybmp.

1.4.6 Series colors

Windows color palettes. Numerous Windows-based color palettes are available for use with 2D, 3d, line and X-Y scatter plots. If object classes are speciﬁed at input, then classes $1,2,...,k$ will be assigned the series colors which are selected for default. Otherwise, in YYYY and XYYY line or scatter plots, the diﬀerent series-speciﬁc colors will be applied to the features or dimensions used.

Custom color palette. You can specify which default series colors are to be used by clicking on each series color, and then selecting the desired color. After clicking on Apply, the default custom colors will be applied, as long as the option above the custom colors is selected.

1.4.7 Class discovery score plot markers.

You can specify which default series colors are to be used for class-speciﬁc colors in 2d and 3D plots generated during unsupervised class discovery.

1.4.8 Multichart panel.

Default series colors are also available for object class in the multichart plot generated at the end of unsupervised class discovery.

1.4.9 Histogram, line, and multipanel graph fonts.

Default fonts can be selected for class discovery plots, as well as line and scatter plots (YYYY, XYYY), and multipanel plots.

1.4.10 Options for YYYY and XYYY plots.

Default fonts can be selected for class discovery plots, as well as line and scatter plots (YYYY, XYYY), and multipanel plots.

Line only. This selection will result in lines between data points in YYYY and XYYY line (scatter) plots.

Points only. This selection will result in points being plotted in YYYY and XYYY line (scatter) plots.

Line and points. This selection will result in lines and points being plotted in YYYY and XYYY line (scatter) plots.

X- and Y-axis thickness. This selection speciﬁes the X- and Y-axis line thickness in YYYY and XYYY line (scatter) plots.

1.5 Changing Feature Types

Changing categorical features to continuous. On occasion, you may need to change categorical features to continuous. As an example, let’s suppose you have integer-based scores on the Likert scale for which the item response is 1-Strongly disagree, 2-Disagree, 3-Neutral, 4-Agree, 5-Strongly agree. By default, this particular feature will be read in as categorical type, however, for the analysis you want to use means and standard deviations perhaps for t-tests, ANOVA, or regression. There are two ways to change categorical features to continuous:

To change during input: select Preference $→$ General default settings, and then set the maximum level used for determining categorical features during data input. For example if there are many integer-based categorical features with a maximum value of 10, and you would like all of these features to be read in as continuous features, then simply change the value of the maximum level for determining categorical features to e.g. 3, as shown below.

Then click on Apply. You will need to reload the input data, and then, for each text feature, NXG Explorer will code the feature as continuous.
To change after data input: select the “Feature scale preferences,” menu command in the Preferences menu, shown as:

The following popup window will appear:

Simply click on continuous for the categorical features you want changed.

Changing text features to categorical. You cannot use the popup window above to change text features to categorical. Rather, this has to be done one of two ways:

To change during data input: select Preference $→$ General default settings, and then click the check box under Data Transformations:

Then click on Apply. You will need to reload the input data, and then, for each text feature, NXG Explorer will identify the number of unique categories based on the text values, and prompt you to convert to categorical for each text feature found in the data. (If you have a lot of text features, this method is not recommended, see the next option).
To change after data input: select the following menu commands from the Transformations menu command, as follows:

Then select the text feature you want converted to categorical and NXG Explorer will identify the number of unique values among the text values and convert the feature to categorical.

1.6 Copying HTML output

Right-clicking on HTML output. Whenever the Output tab is activated after clicking on a treenode that was generated during a run, a region of interest can be selected for copying & pasting by right-clicking anywhere in the HTML output and then dragging the rectangular “rubber band” around the region of interest. As soon as the left-mouse button is let go, the ROI selected will be copied into memory, which can be pasted in any software that accepts Windows objects (Word, Powerpoint, etc.).

Shift-Right-clicking on HTML output. If the Shift key is pressed and then the mouse pointer is right-clicked on an HTML output, the Windows printer options will appear.

1.7 Copying images

Right-clicking on any output image will raise a popup window, which will show the following choices:

Select & copy region of interest
Copy
Save

If Region of interest is selected for copying, then click on the left mouse button and drag the mouse pointer rectangular “rubber band” around the region of interest. As soon as the left-mouse button is released, the ROI selected will be copied into Windows clipboard, which can be pasted in any software that accepts Windows objects (Word, Powerpoint, etc.). If Copy is selected, the entire image will be copied into the Windows clipboard, which can then be pasted into other Windows application that accept Windows objects. Otherwise, if Save is selected, several image format options will appear before saving the image.

1.8 File Command

1.8.1 Open Project

This option opens an already-saved Explorer project, for which the ﬁlename extension is .nxgp.

1.8.2 Save Project

This option saves an already-saved Explorer project, for which the ﬁlename extension is .nxgp.

1.8.3 Save Project As

This option saves a dataset in memory to an Explorer project, for which the ﬁlename extension is .nxgp.

1.8.4 Import Data

Excel. This option imports either an Excel ﬁle (.xlsx, or .xls), comma-delimited ﬁle (.csv), or a tab-delimited text ﬁle (.txt).

Category codes. This option allows the import of a tab-delimited text ﬁle (.txt) to aﬃx category labels to levels of the factors for which labels are provided. The command can be invoked (used) before or after data input. See the catcodes_Retinol.txt ﬁle for an example of category label formats for the three categorical factors in the Retinol dataset (sex, smoking status, and vitamin use).

Category code map. This option allows the import of a tab-delimited text ﬁle (.txt) with categorical feature names followed by a mapped term which is used in the dataset. For example, consider the following example categorical features and their counterpart mapped terms which are used in a dataset:


Factor	Feature name	Mapped name

Arm	arm	armcode

Sex	sex	sexcode

Smoking status	smokstat	smokstatcode

In the case of the Arm factor, the dataset contains the mapped term armcode, however, the user wants to use a feature name arm in all the analyses. Use of this command option will allow the user to translate a mapped term to any desired feature name.

Merge Excel. Use this option for merging multiple ﬁles together. Since 1-to-1 and 1-to-many cannot be speciﬁed, you should import the largest ﬁle (with a primary key), followed by smaller ﬁles (with primary keys). In this fashion, a many-to-fewer can be used successively to achieve the desired ﬁnal merged ﬁle.

1.8.5 Export

Comma-delimited (.csv) Selecting this option will export data in memory to a comma-delimited .csv ﬁle, for which the name must be provided.

Tab delimited text (.txt) Selecting this option will export data in memory to a tab-delimited text (.txt) ﬁle, for which the name must be provided.

Chapter 2
SUMMARIZE

This chapter covers NXG Explorer’s Summarize methods for summarization of variables in a dataset. Construction of a typical dataset is ﬁrst discussed, which is followed by measures of central tendency for continuously-scaled variables, followed by a brief discussion of categorical data analyses.

2.1 Data Arrangement

Table 2.1: Arrangement of records and variables for a set of continuously-scaled observations.


	Variable


Record	1	2	$⋅⋅⋅$	$j$	$⋅⋅⋅$	$p$

1	$x11$	$x12$	$⋅⋅⋅$	$x1j$	$⋅⋅⋅$	$x1p$
2	$x21$	$x22$	$⋅⋅⋅$	$x2j$	$⋅⋅⋅$	$x2p$
$.. .$	$.. .$	$.. .$	$...$	$.. .$	$.. .$	$.. .$
$i$	$xi1$	$xi2$	$⋅⋅⋅$	$xij$	$⋅⋅⋅$	$.. .$
$...$	$...$	$...$	$⋅⋅⋅$	$...$	$...$	$...$
$n$	$xn1$	$xn2$	$⋅⋅⋅$	$⋅⋅⋅$	$⋅⋅⋅$	$xnp$

A set of continuously-scaled data are often arranged into individual records and variables. A record represents an individual measurement or observation made from one unit of study, such as a patient, mouse, cell, gene, or protein. Variables, on the other hand, represent the collection of observations for all the records. When arranging the data in tabular form, records are arranged in rows, where the index $i$ is used to denote each record, with range $i = 1,2,...,n$ . Variables are arranged in columns where the index $j$ is used to denote each variable, with range $j = 1,2,...,p$ . Let us consider the random variable $xij$ for representing a single observation, where the subscript $i$ denotes the $i$ th record and subscript $j$ denotes the $j$ th variable. A collection of multiple $xij$ ’s for many observations can be arranged in tabular form according to the arrangement in Table 2.1. The symmetric arrangement of data in Table 2.1 can also be represented by use of a matrix. For $n$ records and $p$ variables, we introduce the matrix X in the form:

( ) | x11 x12 ⋅⋅⋅ x1p| X = || x21 x22 ⋅ x2p|| . n×p ( ... ... ... ... ) xn1 xn2 ⋅⋅⋅ xnp

(2.1)

Each row in X represents a record, and each column represents a variable. A matrix is characterized, among other things, by the dimension. The dimension of X for our example is $n × p$ , called “ $n$ by $p$ ,” representing the $n$ records and $p$ variables. Each element denoted $xij$ , represents the observation for the $i$ th record and the $j$ th variable in the data set. Because the matrix notation $xij$ can be used to represent any record or variable in the data set, matrices occupy a very powerful area of statistics based on what is known as matrix algebra. As we shall see in the remaining chapters, matrix algebra is fundamental to many areas of statistics, including analysis of variance, maximum likelihood modeling, etc.

However, before we address more advanced topics, let’s stick with methods used for describing the distribution of a random variable like $xij$ . When considering a set of data, independent of study design or analysis, a fundamental question among all statisticians is: “How are the data distributed?” With regard to scale of measurement, are the data categorical and measured on the nominal scale, ordinal scale, or are the data continuous and measured on the ratio scale? In terms of probability distributions, when the data are arranged into a histogram, are they normally distributed with a bell-curve shape, distributed according to the binomial distribution, Poisson distribution, or multinomial distribution? A frequency histogram (distribution) of most data used in statistics can be characterized by a speciﬁc probability distribution that captures information related to the scale of measurement, observation response, and type of variable.

2.2 Normal Distribution

There are many kinds of continuous distributions. The one we will be most concerned with is the is called the normal distribution. Normal distributions are also known as Gaussian distributions, and consist of a sample of normal variates. A frequency histogram of normal variates is typically bell-shaped, with the bulk of data lying in the middle of the curve and smaller less frequent data residing in vanishing tails. The particular shape of a histogram is characterized by its central tendency and dispersion.

2.3 Central Tendency and Dispersion

For normal distributions, two parameters used to distinguish central tendency and dispersion are the mean, or average, and standard deviation. Let $x1j,x2j,...,xnj$ be a random sample of size $n$ from the distribution of random variable $Xj$ . The mean of the $n$ variates is denoted $¯xj$ and is a measure of central tendency that reﬂects the average value of $xij$ :

∑n ¯xj =--i=1xij. n

(2.2)

The mean is also equal to the expected value of $x$ , in the form

∑ μx = E[x] = pixi i = (x1 +-x2 +-⋅⋅⋅+-xn) ∑ n = --ixi. n

(2.3)

Terminology used for the mean is $μ$ for a population, and $¯x$ for a sample. Typically $¯x$ is used since one usually never knows the true population mean, but rather an estimate of the sample mean through some kind of sampling strategy. The sample variance of variable $j$ is a measure of dispersion reﬂecting the overall width of the distribution, and takes the form

∑n (x − ¯x )2 s2j = --i=1-ij---j--. n− 1

(2.4)

and the standard deviation of variable $j$ is simply the square root of the variance, functionally composed as

∘ --------------- ∑ni=1(xij −-¯xj)2 sj = n− 1 .

(2.5)

The variance can also be determined by using the expected value of $2 x$ . The expected values of a random variable $2 x$ is

∑ E[x2] = pix2i, i

(2.6)

and the variance of $x$ becomes

σ2x = Var(x) = E [x2]− E [x]2 ∑ (xi − μx)2 = ---n----- ∑ i ∑ = pix2i − ( pixi)2. i i

(2.7)

Note that there was a shift in the notation used for several of the parameters introduced above. Various forms of notation are used interchangeably. These are described as follows.

Sample notation: x¯= Sample mean sx = Sample standard deviation s2x = Sample variance sxy = Sample covariance between x and y rxy = Sample correlation between x and y Population notation: μx = Population mean σx = Population standard deviation σ2x = Population variance σ = Population covariance between x and y xy ρxy = Population correlation between x and y

(2.8)

There are several other important summary statistics for the normal distribution, which are the range, median, mode, skewness, and kurtosis. The range can be simply determined by use of the relationship

rangej = max (xij)− min(xij).

(2.9)

The range of a distribution is informative about how far apart the most extreme data points are, which is useful when comparing variables with another. For example, if $max (xi1)$ and $min(xi1)$ for variable $j = 1$ are 2.53 and 0.4, the range is 2.13. Whereas, if for variable $j = 2$ the values of $max(xi2)$ and $min (xi2)$ are 79.6 and 21.2, the range is 58.4. A large diﬀerence in range is also known as a large change in scale, aﬀecting the mean diﬀerence and covariance between two variables. Another statistic of interest is the median, which reﬂects the value of a variable in the midpoint of the distribution. The median can easily be found by sorting the data in a distribution in ascending order and identifying the midpoint. Leaving the variable subscript $j$ aside, let $x1,x2,...,xn$ be a set of normal variates and $x ≤ x ≤ ⋅⋅⋅ ≤ x (1) (2) (n)$ be their rank ordered counterpart. The median is equal to the rank ordered variate $x (n∕2)$ . Deﬁnition of the median introduces a new concept called the $p$ th quantile. A quantile is any value among the rank ordered variates $x ≤ x ≤ ⋅⋅⋅ ≤ x (1) (2) (n)$ . Consider 100 rank-ordered normal variates such that $n = 100$ . The lower quartile is the 25th percentile which is equal to $x (25)$ . As mentioned previously, the median is the 50th percentile and for this example would be equal to $x (50)$ . The upper quartile is the 75th percentile and is equal to is $x (75)$ .

In cases when the most frequent quantile is sought, the mode is determined, which is the most frequently occurring quantile. The mode occurs at the peak of the distribution, since the $x$ -value at this location is the most frequent.

We now introduce the $r$ th moment about the mean of a distribution. The $r$ th moment about the mean is deﬁned as

n κr = 1-∑ (xi − ¯x)r. n i=1

(2.10)

The coeﬃcient of skewness is

--κ3-- γ1 = (κ2)3∕2,

(2.11)

and the coeﬃcient of kurtosis is

-κ4-- γ2 = (κ )2 − 3. 2

(2.12)

Shapiro-Wilk Normality Test. NXG Logic Explorer also performs a normality test on each continuously-scaled feature to determine if the feature is normally-distributed. The Shapiro-Wilk test [1] ﬁrst sorts $x$ -values for each feature in ascending order: $x(1) < x(2) < ⋅⋅⋅ < x (n)$ . Next, a standard normal variate is generated for each $x(i)$ , based on $−1 mi = Φ [(i− 0.375)∕(n + 0.25)]$ , which are then summed and squared to give

∑n m = m2i. i

(2.13)

Next, calculate $u = 1∕√n-$ , and determine a series of $a i$ values using the relationships:

a = − 2.706056u5 + 4.434685u4 − 2.07119u3 − 0.147981u2 + 0.221157u + m m−0.5 n n

(2.14)

5 4 3 2 −0.5 an−1 = − 3.582633u + 5.682633u − 1.752461u − 0.293762u + 0.042981u+ mn −1m

(2.15)

and ﬁnally, for $i = 3,4,...,n − 2$ , we use

ai = mi∕√ 𝜖,

(2.16)

where

m-−-2m2n-−-2m2n−1- 𝜖 = 1− 2a2n − 2a2n−1 .

(2.17)

The lower values on the other end of the scale are $a1 = − an$ , and $a2 = − an−1$ . The $W$ statistic is calculated as

∑n 2 W = (∑n-i-aixi)2. i(xi − ¯x)

(2.18)

The Shapiro-Wilk test statistic is standard normal distributed with

Z = log(1-−-W-)−-μ, σ

(2.19)

where

μ = 0.0038915log(n)3 − 0.083751log(n)2 − 0.31082 log(n)− 1.5861,

(2.20)

and

[ 2 ] σ = exp 0.0030302log(n) − 0.082676 log(n)− 0.4803 .

(2.21)

Values of $|Z| > 1.96$ indicate departure of the data being tested from the normal distribution. Hence, if the p-value of the Shapiro-Wilk test is less than 0.05, the test result indicates that the feature being tested is not normally-distributed.

Example 1 - Summary Statistics. This example will involve import of a (Microsoft) Excel ﬁle into NXG Explorer, and then generation of summary statistics for the variables in the ﬁle. Open the Retinol dataset which includes $n = 314$ records for research subjects who studied in a nutritional investigation on the antioxidants retinol and beta-carotene.

First, import the retinol.xlsx ﬁle in the Data directory where NXG Explorer is installed. This is done by selecting File $→$ Import $→$ Excel, and then selecting the Retinol.xlsx ﬁle.

Another popup window will appear which allows the user to specify the input format of the Excel workbook which is being imported. Let the default format be the choice, then click Apply:

After the Retinol.xlsx ﬁlename is selected and double-clicked, the data tab will show the data that were input (see image below).

Next, select the Summarize command in the Analysis pull-down menu:

You will now see a list of variables,

Click on Select all:

You will then notice check marks placed in all of the checkboxes,

Next, click on the Apply button, and the summary statistics will be generated, revealing a tree-node in the treeview on the left:

Click on the Summary statistics treenode, and the results will be shown in the Output tab:

Click on the Category frequencies icon, and a listing of categorical summary statistics will be shown for the categorical features:

Next, click on the Histogram icon, and you will notice a histogram plot that was generated for each feature:

To manually generate histograms for all the continuous variables, click on the Histogram command of the Graphics pull-down menu:

Next, uncheck the categorical variables representing sex, smoking status (smokstat), and vitamin use (vituse):

Click on Apply in the next popup window:

Select the betadiet variable (treenode) from the treeview:

The following plot shows the histogram for the betadiet variable, revealing a log-normally distributed variable with right-tail skewness.

Bibliography

[1] Royston, P. An extension of Shapiro and Wilks’s W test for normality to large samples. Applied Statistics. 31: 115–124; 1982.

Chapter 3
LABELS, TRANSFORMS, FILTERING, AND GROUPING

3.1 Editor

NXG Explorer provides a label and mathematical transform editor which can be employed to assign category labels to the various levels of a nominal or ordinal categorical feature. The editor can also be used for creating new features based on mathematical transforms of existing features, or dummy indicator (binary) features with values (0,1).

The editor is invoked by selecting the Labels and transforms command of the Preferences pull-down menu.

The image below shows the editor after it opens.

Next, navigate to the C:\Program Files\NXG Logic \NXG Logic Explorer \Scripts folder and Open the script ﬁle called retinol.nxgs, as shown by

In the image shown below, after opening the script ﬁle, you will notice any existing script within the ﬁle.

The scripts are divided into two sections that begin with either labels and/or trans, followed by the relevant syntax used in the speciﬁc section.

3.1.1 Labels for levels of categorical factors

Recall, the categorical features in the Retinol dataset are as follows:


Variable	Explanation (Unit)

sex	Sex (1=Male, 2=Female)

smokstat	Smoking status (1=Never, 2=Former, 3=Current Smoker)

vituse	Vitamin Use (1=Yes, 2=Sometimes, 3=No)

Labels for various levels of categorical factors (features) can be assigned quite simply by entering the level of the factor and listing the name of the category (level) immediately after, as shown below:


Number of categories	Command

2	catvarname^1 cat1label^2 cat2label

3	catvarname^1 cat1label^2 cat2label^3 cat3label

4	catvarname^1 cat1label^2 cat2label^3 cat3label^4 cat4label

5	catvarname^1 cat1label^2 cat2label^3 cat3label^4 cat4label^5 cat5label

$k$	catvarname^1 cat1label^2 ... ^k catklabel

An example for the categorical factors in the Retinol dataset for smoking status, vitamin use, and gender, would be:

labels sex^1 Male^2 Female;
smokstat^1 Never^2 Former^3 Current;
vituse^1 Yes^2 Sometimes^3 No@

To invoke (implement) the label and transformations listed in the table above, open the Retinol dataset, open the script ﬁle, and then select Run.

The labeling results can be observed in the following image which reveals the category labels in an ANOVA test of several continuously-scaled and categorical variables in the Retinol dataset on the 3-category feature vituse.

Warning! You must ensure that each set of labels assigned to each categorical feature (factor) ends with a semicolon “;”. In addition, the group of label commands for several categorical features must end with an “@” symbol.

3.1.2 Mathematical transforms

Examples of mathematical transformations available in NXG Explorer are listed in the following table:


Type	Command	Example	Result

Binary	$==$	male=sex==1;	males=1, females=0


	$==$	female=sex==2;	females=1, males=0


	$==$	smok1=smokstat==1;	smokstat(1)=1, other=0


	$==$	smok2=smokstat==2;	smokstat(2)=1, other=0


	$==$	smok3=smokstat==3;	smokstat(3)=1, other=0


	$>$	agegt70 $=$ age $>$ 70;	age $>$ 70 =1, age $≤$ 70 = 0


	$>=$	agege70 $=$ age $>=$ 70;	age $≥$ 70 =1, age $<$ 70 = 0


	$<$	agelt70 $=$ age $<$ 70;	age $<$ 70 =1, age $≥$ 70 = 0


	$<=$	agele70 $=$ age $<=$ 70;	age $≤$ 70 =1, age $>$ 70 = 0

Math	$+$	ageplus5 $=$ age+5;	age+5


	$−$	ageminus5 $=$ age-5;	age-5


	$∗$	agetimes5 $=$ age*5;	age $×$ 5


	$∕$	ageover5 $=$ age/5;	age/5


	^2	agesq $=$ age^2;	age $×$ age


	sqrt	agesqrt $=$ sqrt(age);	$√age-$


	log $e$	logage $=$ log(age);	$loge(age)$


	log $10$	log10age $=$ log10(age);	$log10(age)$


	exp	expage $=$ exp(age);	$exp(age)$

Combination	$∗$	femagegt70 $=$ female*agegt70;	females w/age $>$ 70=1, other=0


	$∗$	malesmok3 $=$ male*smok3;	male w/smokstat(3)=1, other=0

Examples of new features are:

tran one=1 ; zero=0 ;
     agegt30=age>30;
     agege40=age>=40;
     agege50=age>=50;
     agelt50=age<50;
     agegt50=age>50;
     lage40 = log(age/40) ; lage40sq = lage40^2 ;
     lage70 = log(age/70) ; lage70sq = lage70^2 ;
     agegt70=age>70;
     agege70=age>=70;

     sex1 = sex==1;
     sex2 = sex==2;
     sex1lage70 = sex1 * lage70;
     sex2lage70 = sex2 * lage70;
     quettimes5 = quetelet * 5@

Results of several of the mathematical feature transformations are listed in the ﬁgure below.

Warning! You must ensure that each set of mathematical transformations ends with a semicolon “;”. In addition, the group of transformation commands for several features must end with an “@” symbol.

3.2 Importing Category Labels from Text File

Another method for inputting and aﬃxing category labels is to import them from a text ﬁle directly, instead of using the Editor.

Example 1 - Assign category labels to the Retinol dataset. First, import the retinol.xlsx ﬁle in the Data folder where Explorer is installed. Next, import the category-label text ﬁle distributed with Explorer, using the pull-down menu command File $→$ Import Data $→$ Category codes (labels), and then select the catcodes_Retinol.txt ﬁle. Once the ﬁle is imported, the category labels will be assigned to the levels of the categorical features (sex, smoking status, and vitamin intake). You will now be able to see the categorical labels in most of the analyses which use categorical features.

3.3 Filtering records

Consider the case where only females in a dataset are to be ﬁltered out and used for hypothesis testing. The ﬁrst eﬀort should entail ﬁltering on a gender-related feature. Let’s use the Retinol dataset, for which the gender feature is called sex.

To get started, open the Filter form by selecting Filter on the Analysis pull-down menu:

Next, specify the sex categorical feature, and level 2. then click on Apply.

The ﬁltered records will then be highlighted in the datagrid, and also will be selected in memory for future use.

Summary statistics for female records in the Retinol data set are shown below:

3.4 Grouping

NXG Explorer can perform grouped analysis of most of the analytic modules. Let’s look at an example involving summary statistics of the Retinol dataset, grouped on smoking status. For this example, we will also label the smoking categories using methods outlined earlier in this chapter.

Example 2 - Summary statistics of Retinol dataset, grouped by smoking status.

1. To begin, import the Retinol dataset.

2. Apply the labels provided in the retinol.nxgs ﬁle (see earlier labeling example).

3. Select Analysis $→$ Grouped Analysis, as shown in the image below:

4. Next, select the smokstat feature, shown as:

5. Click on Apply, and you will notice that a red background is now shown in the warning textbox to the upper right of the user screen, shown as:

6. In the next popupwindow that appears, select the features age, quetelet, calories, fat, and fiber.

7. When the run is complete, you will notice three red icons representing the three smoking status groups, with the accompanying group names and categorical codes listed in parenthesis next to each red icon:

Click on the Summary statistics icon beneath the red icon with label “Summarize(smokstat=3),” and the summary statistics table for current smokers will be shown as follows:

3.5 Limitations of Filtering and Grouped Analysis

When using Explorer, ﬁltering and grouping do not work together. Filtering records should only be followed by non-grouped analyses such as summary statistics, correlation, or regression after the speciﬁed records have been selected. Whereas grouped analysis can be used for 2- and k-group hypothesis testing, summary statistics, regression, and Kaplan-Meier grouped analysis.

In summary: use the following sequence:

Filtering (gender=males): can run summary statistics, correlation, regression, but not independence tests (T-tests, MW, ANOVA, KW) or KM grouped survival analysis.

Grouped analysis (gender): can run summary statistics, independence (T-tests, MW, ANOVA, KW), regression, correlation, and KM survival analysis.

For example, if you want to run ANOVA with only male data, then perform a grouped analysis using gender as the grouping variable, and then run ANOVA using e.g. treatment arm as the categorical factor, and age, BMI, protein expression as continuous features

Warning! Filtering or grouping may result in program errors or sample sizes of “NaN,” (not a number) in table outputs, depending on the analysis employed. For example, if a treatment arm in a clinical trial does not have any male subjects and grouped analysis is employed for treatment arms, and 2- or k-group T-tests (ANOVA) are performed on gender as a factor feature, the arm with no males will report a sample size of NaN.

Chapter 4
PRESET TRANSFORMATIONS

The range and scale of a feature can have a large impact on the results of an analysis, especially if the heterogeneity of range and scale among the inputs are widely varying. NXG Explorer can transform features (variables) using various preset methods to remove skewness or to change the scale and range.

Once a dataset is opened, the preset transformations can be accessed from the Transformations command of the Analysis pull-down menu, as shown below.

The following sections describe the various preset transformations provided in NXG Explorer.

4.1 Normalization

Normalization re-scales the values of a continuous feature to be in the range [0,1] using the relationship

x = -xi −-xmin-, i xmax − xmin

(4.1)

where $x i$ is the $i$ th value of a feature, $x min$ is the minimum value of the feature, and $x max$ is the maximum.

Note: During normalization, NXG Explorer adds “n_” to the leading edge of the original variable’s name.

4.2 Mean-Zero Standardization

Normal distributions can vary by central location and dispersion based on the mean and variance. A frequency histogram for a group of patients having a mean weight of 130 pounds and standard deviation 12, will be centrally located 70 units (pounds) to the left of a histogram for patients with mean weight 200 pounds and standard deviation 12, when plotted on the same graph. Distributions with varying means and standard deviations are called scale and range dependent, since their location and spread on the same histogram will be diﬀerent. A powerful transformation which can be applied to any normal distribution to remove the eﬀects of scale and range is called the Z-transformation.

Let $¯x$ and $σx$ be the mean and standard deviation of a normal distribution of sample size $n$ =1,000. The $Z$ -score for each of the $n$ normal variates is given as

x − x¯ Zi =-i----, σx

(4.2)

where $xi$ represents a normal variate, and $Zi$ is a newly derived standard normal variate. The process of subtracting the mean from each normal variate and dividing by the standard deviation is called standardization, which results in a standard normal distribution. A common property of all standard normal distributions is that they all have a mean equal to zero and a standard deviation of one, with variance equal to one ( $12 = 1$ ). In addition, the Z-score equates to the number of standard deviations each normal variate lies away from the mean. Large values of $Zi$ above 2 are said to be “2-sigma” away from the mean, and are commonly considered as extreme values. For a tail probability of $α = 0.05$ , the value of $Z$ in the negative tail is -1.65, whereas in the positive tail the value of $Z$ =1.65. When the tail probability is divided by 2, that is $α∕2$ , the values of $Z$ in the negative and positive tails become -1.96 and 1.96, respectively. A probabilistic deﬁnition of tail probabilities is that they represent the probability that a given value of $Z$ is greater than or equal to reference value of $Zα$ . For example, $P(Z ≤ − 1.96) = 0.025$ , $P (Z ≤ − 1.65) = 0.05$ . Analogously, $P(Z ≥ 1.65) = 0.05$ , $P (Z ≥ 1.96) = 0.025$ .

Standard normal distributions also form the basis of many areas in statistics, including

Determination of outliers for a normal distributions, i.e., $|Zi| > 2$ .
Transformation of standard normal variates into a chi-square distribution with $ν$ degrees of freedom ( $2 2 2 2 χν = Z 1 + Z2 + ⋅⋅⋅+ Zν$ ).
Correlation between two random variables $x$ and $y$ as $∑ (Zi,xZi,y)∕n i$

It is important to note that Z-transformations do not remove skewness from an input distribution. If the feature(s) being transformed is skewed, then the resulting Z-scores will also contain the same degree of skewness. To remove skewness, the log-transform is typically used for log-normally-distributed data, or the van der Waerden transform is used, which always removes skewness.

Warning! Z-transformations do not remove skewness from an input distribution. If the feature(s) being transformed is skewed, the resulting Z-scores will also result in the same degree of skewness.

4.3 Logarithmic

The logarithmic transform is one of the better transforms for rescaling a skewed dataset with a right tail so that it becomes more normally-distributed. As an example, in the Retinol dataset, the distribution of RETDIET is as follows:

whereas the log-transform (base e) of the RETDIET is illustrated as follows:

and it can be seen that the transformed version of RETDIET is more normally-distributed since large outlier values of RETDIET in the right tail of the untransformed RETDIET are essentially removed after taking the log_e transform.

Note: For base $e$ , NXG Explorer adds “loge_” to the leading edge of the original variable’s name. For base $10$ , “log10_” is added to the leading edge of the original variable’s name. For use of a custom base, for example, 2, the custom base is added to the leading edge in the form “log2_”.

4.4 Quantile

The quantile transform will ﬁrst determine the percentile value for each observation based on each value’s rank divided by $(n + 1)$ , and then generate a new variable whose codes represent the various quantile category the percentile falls into. For example, if 4 quartiles (default) are speciﬁed, then any original percentile values less than or equal to 25% will be assigned a 1, percentiles greater than 25% and less than or equal to 50% will be coded as a 2, greater than 50% and less than or equal to 75% will be coded as a 3, and the remaining percentiles greater than 75% will be coded as a 4.

Note: NXG Explorer adds “q_” to the leading edge of the original variable’s name.

4.5 Rank

The rank transform will arrange the values for a variable in ascending order (smallest ﬁrst) and then assign the record number corresponding to the sort position. For example, if there are 100 values for a variable, then the ranks for these original variable values will take on ranks $1,2,...,100$ . A common approach for displaying values of $x$ that are sorted in ascending order is to use the notation $x(1),x(2),...,x(n)$ , where the subscript () denotes ranking. Another common notation to reﬂect ranked variate values is $x1 < x2 < ...< xn$ .

The table below lists the ranks for 10 x-values sorted in ascending order.


$x$	Rank( $x$ )

5	1
7	2
9	3
10	4
11	5
17	6
23	7
27	8
31	9
62	10

Note: NXG Explorer adds “r_” to the leading edge of the original variable’s name.

4.6 Percentile

The percentile value for each observation is equal to the value’s rank divided by (n+1). Percentiles reveal the location of each value on a scale of 0 to 1. Therefore, the p5 is the 5th percentile, p10 is the 10th percentile, p25 is the 25th percentile, p50 is the median or 50th percentile, p75 the 75th percentile, p90 the 90th percentile, and p95 the 95th percentile. The table below lists the percentiles for 10 x-values sorted in ascending order.


$x$	Rank( $x$ )	$pct$

5	1	0.091
7	2	0.182
9	3	0.273
10	4	0.364
11	5	0.455
17	6	0.545
23	7	0.636
27	8	0.727
31	9	0.818
62	10	0.909

Note: NXG Explorer adds “pct_” to the leading edge of the original variable’s name.

4.7 van der Waerden Scores

van der Waerden (vdw) scores transform a variable’s values into skew-zero standard normal variates for which the mean is zero and standard deviation is unity (one). Recall that the normal cdf function $Φ (Z )$ converts an input Z-score into a tail probability using the normal cdf, whereas the inverse cdf function maps the probability back to the Z-score, as listed in the following table:


Function	Notation	Conversion

CDF	$Φ(Z)$	Z-score to tail probability ( $P$ )
Inverse CDF	$Φ−1(P)$	Tail probability ( $P$ ) to Z-score

The steps involved for calculating the vdw score for each value of a variable are:

Sort values in ascending order
Assign ranks corresponding the sort order, starting with a 1 for the smallest value and n for the largest value
Divide each rank by $n + 1$ to obtain the value’s percentile, that is, $pcti = R(xi)∕(n + 1)$ .
Input each pct value into the inverse CDF function, $−1 Φ (pct)$ , to obtain the Z-score for the given cdf probability

The table below lists the VDW scores for 10 x-values sorted in ascending order.


$x$	Rank( $x$ )	$pct$	$Φ−1(pct)$

5	1	0.091	-1.335
7	2	0.182	-0.908
9	3	0.273	-0.605
10	4	0.364	-0.349
11	5	0.455	-0.114
17	6	0.545	0.114
23	7	0.636	0.349
27	8	0.727	0.605
31	9	0.818	0.908
62	10	0.909	1.335

The VDW transform is a skew-zero transform, since it always removes skewness that may exist in the input distribution.

Note: NXG Explorer adds “vdw_” to the leading edge of the original variable’s name.

4.8 Nominal-to-Binary

The nominal-to-binary transform creates a new binary yes/no variable for each level of a category of a nominally-scaled categorical variable. For example, if there are 3 levels of a category, 3 new variables will be generated, and for each record, only one of the three new variable will be set to 1 (0 otherwise) corresponding to the category of the particular record. The table below lists the nominal to binary coding results for a ﬁctitious 3-level categorical variable called “catvar.”


$catvar$	$b catvar 1$	$b-catvar-2$	$b-catvar-3$

3	0	0	1
1	1	0	0
2	0	1	0
2	0	1	0
1	1	0	0
1	1	0	0
2	0	1	0
3	0	0	1
1	1	0	0
2	0	1	0

Note: NXG Explorer adds a “b_” to the leading edge of the categorical variable’s name, and adds a trailing $1,2,...,k$ to the end of the original variable name to denote which category the binary variable represents.

4.9 Fuzziﬁcation

Fuzzy logic provides a mixture of methods for ﬂexible information processing of ambiguous data [1, 2, 3]. Fuzzy transformations were are used to map the original values of an input feature into 3 fuzzy sets representing linguistic membership functions, in order to facilitate the semantic interpretation of each fuzzy set. The fuzzy sets low, medium, and high exploit uncertainty among the original feature values, reducing the information in order to obtain a robust, less-expensive, and tractable solution. To fuzzify a continuous feature, ﬁrst determine $x min$ and $x max$ as the minimum and maximum values of $x ij$ for feature $j$ over all input samples and $q 1$ and $q 2$ as the quantile values of $x ij$ at the 33rd and 66th percentile. Then, calculate the averages $Avg = (x + q )∕2 1 min 1$ , $Avg = (q + q )∕2 2 1 2$ , and $Avg = (q +x )∕2 3 2 max$ . Next, translate each value of $x ij$ for feature $j$ into 3 fuzzy membership values in the range [0,1] as $μ low,i,j$ , $μ mid,i,j$ , and $μ high,i,j$ using the relationships

( |{ 1q −x x < Avg1 μlow,i,j = | q2−2Avg1 Avg1 ≤ x < q2 ( 0 x ≥ q2,

(4.3)

( ||| 0 x < q1 { AAvvgg22−−xq1- q1 ≤ x < Avg2 μmed,i,j = || qq2−−Axvg Avg2 ≤ x < q2 |( 02 2 x ≥ q , 2

(4.4)

( |{ 0 x < q1 μhigh,i,j = | Avx−g3q1−q1 q1 ≤ x < Avg3 ( 1 x ≥ Avg3.

(4.5)

The above computations result in 3 fuzzy sets (vectors) $μlow,j$ , $μmed,j$ and $μhigh,j$ of length $n$ which replace the original input feature. Figure 4.1 illustrates the values of the membership functions as a function of $xij$ . When features are fuzziﬁed using the methods described above and then input into class prediction analysis the classiﬁcation analysis is called “fuzzy,” whereas without feature fuzziﬁcation, the classiﬁcation analysis is called “crisp.”

Figure 4.1: The 3 fuzzy membership functions $μlow,j$ , $μmed,j$ and $μhigh,j$ , which were used to replace expression values of feature (gene) $j$ during fuzzy classiﬁcation.

Note: NXG Explorer adds an “f_” to the leading edge of the categorical variable’s name, and adds a trailing _1, _2, or _3 to the end of the original variable name to denote the lower, middle, and upper fuzzy transform.

4.10 Continuous to Categorical

This option in the feature transformation module allows one to convert continuously-scaled features to categorical using cutpoints and lower and upper boundary limits. Below is a snapshot of the popup window that appears when performing the continuous to categorical transformations:

The various options are described below.

Cutpoints are lower bounds. For example, if age is the feature, cutpoints 20,30,40 will result in category codes 1 for $≤$ 19, 2 for 20-29, 3 for 30-39, and 4 for 40+.

Cutpoints are upper bounds. For example, if age is the feature, cutpoints 20,30,40 will result in category codes 1 for $≤$ 20, 2 for 21-30, 3 for 31-40 and 4 for 41+.

Apply lower limit (30-39, .... This option will enforce a lower bound to the categorization.

Apply upper limit ....51-60). This option will enforce an upper bound to the categorization.

Use real-valued cutpoints (integer is default). This option will replace the standard mathematical notation such as $30 ≤ X < 40$ with e.g. $30− 39$ , and replace $X ≥ 60$ with “60+.”

4.11 Text to Categorical (converts all text features)

This feature transformation option will prompt the user to convert all text features in the input dataset. As an example, import the Fisher_Iris_wo_class.xlsx ﬁle, and then select Transformations $→$ Text-to-Categorical (converts all text features) from the command pull-down menu as shown in the following image:

Since the class feature is a text feature, it’s elements will be loaded into a form showing all words found along with their frequencies.

and then click on the Set as Categorical button, and the categorical levels of the class text feature will be transformed into class categorical values of 1, 2, and 3. NXG Logic Explorer will also automatically copy the original text value for each category into memory, so that their labels can be output during analyses. As an example, the image below shows results of running ANOVA on the 4 continuous features in the Fisher Iris dataset, with the category labels shown in the column headers.

Default settings: NXG Explorer’s default settings can be set to convert all input text features to categorical features in the Preferences $→$ General default settings, and then checking the box to the left of Recode text categories to numeric on input, in the Data Transformations group of commands.

4.12 Text to Categorical (select which text feature)

This transformation will also perform a text-to-categorical transformation for any text feature, however, the user must select the speciﬁc feature to be transformed.

4.13 Assign Categorical Features w/labels to Class Feature

Any categorical feature can be set to be a class feature (for class prediction analysis) as long as the labels have been input. This can be accomplished by either importing a ﬁle with class values and their labels, or by transforming text-to-categorical using the above transformations, or by assigning categorical labels using the Editor. The numeric class feature values will then be used for assigning symbol colors in all class discovery runs.

4.14 Transformations During Cross-Validation for Supervised Class Prediction

Explorer will transform features values over objects inside CV folds used for training and separately for objects used for testing, based on parameters obtained from objects within training folds. For example, if 10 folds are used for CV, feature transformation occurs over all objects within partitions (folds) $𝒟$ $2$ , $...$ , $𝒟$ $10$ and then feature values for testing objects in fold $𝒟$ $1$ are transformed using parameters from training folds $𝒟$ $2$ , $...$ , $𝒟$ $10$ . When normalization is speciﬁed, feature values for $min(x)$ and $max (x)$ are based on objects in the training folds and applied to objects in the testing fold. Boundary constraints for the range [0,1] are maintained by setting negative feature values for objects in the testing fold to zero, and values greater than 1 are set to 1. For mean-zero standardization, the feature mean and s.d. are obtained from all objects in the training folds, and then are applied to objects in the testing fold.

Default options for feature transformations during CV for class prediction runs can be set in the general default settings popup window, which are shown in the image below:

Bibliography

[1] G.J. Klir, B. Yuan. Fuzzy Sets and Fuzzy Logic, Upper Saddle River(NJ), Prentice-Hall, 1995.

[2] D. Dubois, H. Prade. Fundamentals of Fuzzy Sets, Boston(MA), Kluwer, 2000.

[3] S.K. Pal, P. Mitra. Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery and Soft Granular Computing. Boca Raton(FL), Chapman & Hall, 2004.

Chapter 5
INDEPENDENCE

5.1 2 Unrelated Samples

5.1.1 Student’s T-test

Applies To	Hypothesis testing to determine if means are equal in 2 groups

Origin	Parametric statistics (based on means, variances, etc.)

Null Hypothesis	$H0 : μ1 = μ2$ , both means are equal

Alternative Hypothesis	$Ha : μ1 ⁄= μ2$ , Both means are signiﬁcantly diﬀerent

Variables	One normally distributed continuously-scaled variable, one grouping variable

Variants	Welch test, Behrens-Fisher problem

The t-test is one of the most commonly used tests for determining whether two means are the same. Naturally, since the t-test is based on the average and standard deviation of each sample, the test is of the parametric type. In addition, the t-test assumes normality among the data being tested, so the data should also be continuously-scaled. Speciﬁc t-tests have been developed for cases in which the variances among variables being tested are assumed to be equal or unequal, the latter of which is known as the Behrens-Fisher problem. The null and alternative hypotheses for the t-test are:

$H : μ = μ 0 1 2$

$Ha : μ1 ⁄= μ2.$

5.1.2 Homogeneity of Variance Test for Two Samples

Before conducting a t-test, it is instructive to know whether the variances for the two samples are equal or unequal. A simple set of hypotheses can be constructed for this scenario, where the null and the alternative hypotheses are

$2 2 2 H0 : σ1 = σ 2 = ⋅⋅⋅ = σk$

$H : σ2 ⁄= σ2. a i j$

Tests for equal variance among groups are also known as homogeneity of variance tests. Equal variances is known as homoscedasticity, where unequal variances is known as heteroscedasticity. Bartlett’s test [1] is used to test for equality of variances between 2 or more group normally-distributed samples. The test statistic is calculated as

( ∑c 2) ∑ (n − c)log --j(nnj−−c1)sj − cj(nj − 1)log(s2j) χ2c−1 = ----------------(∑------------)-------, 1+ 3(1c−1) cjnj1−1 − n1−c

(5.1)

where $n$ is the total samples size, $c$ is the number of independent samples, $nj$ $(j = 1,2,...,c)$ is the sample size of the $j$ th group, and $2 sj$ is the sample variance for the $j$ th group. The test statistic is $2 χ$ distributed with $c − 1$ degrees of freedom. If the calculated statistic exceeds the critical value of $χ2c−1,0.95$ , the p-value is less than $α = 0.05$ and the the variances are considered to be unequal. Otherwise, if the p-value exceeds 0.05, the the variances are considered equal (not signiﬁcantly diﬀerent).

Assume that two samples of size $n1$ and $n2$ were drawn from two inﬁnitely large populations. Calculate the sample means $¯x1$ and $¯x2$ , replace the sample data, resample, recalculate means, resample, etc. The mean diﬀerences $d = ¯x1 −x¯2$ obtained from multiple samplings would be normally distributed with variance $σ2$ . A typical question is: is the diﬀerence ( $d = ¯x1 − ¯x2$ ) between two sample means signiﬁcant. The hypotheses are as follows:

H0 :x¯1 = ¯x2 Ha : ¯x1 ⁄= ¯x2.

(5.2)

Because we are limited to only samples from the inﬁnitely large populations, we obtain samples of size $n1$ and $n2$ , which have averages $¯x1$ and $¯x2$ , with sample variances $σ21$ and $σ22$ . For the two independent samples, the test statistic

t = ∘x¯1-−-¯x2 -s21 + s22 n1 n2

(5.3)

follows a $t$ distribution with $ν$ degrees of freedom. When sampling during a study or experiment, a priori values of the population variances can either be known or unknown. If they are known, then they can either be equal or unequal. If they are unknown, it is a common practice to assume they are unequal, since this would be the most conservative approach. In the remaining sections, we address the formulation of the t-test when the sample variances are equal or unequal.

Understanding two-tailed and one-tailed tests. There are often times when the belief among an investigator is that the two means are not signiﬁcantly diﬀerent, but rather, one mean is signiﬁcantly greater or signiﬁcantly lower than the other. In such cases, the use of one-tailed tests are employed. Recall, the criteria for whether a test is called two-tailed or one-tailed is based on the form of the alternative hypothesis, or the belief formed by the investigator.

Let us look at some example descriptions in the table below:


Hypothesis	Null	Alternative	# ways alternative is true	Terminology

Equality	$H : ¯x = x¯ 0 1 2$	$H : ¯x ⁄= ¯x a 1 2$	2 ( $¯x < ¯x 1 2$ , $¯x > ¯x 1 2$ )	“Two-tailed”

Inequality	$H0 : ¯x1 ≤ x¯2$	$Ha : ¯x1 > ¯x2$	1 ( $¯x1 > ¯x2$ )	“One-tailed”

Inequality	$H0 : ¯x1 ≥ x¯2$	$Ha : ¯x1 < ¯x2$	1 ( $¯x1 < ¯x2$ )	“One-tailed”

The ﬁrst alternative hypothesis is based on the belief that the means are signiﬁcantly diﬀerent, which can occur two ways: ﬁrst $¯x1 < ¯x2$ , and second $¯x1 > ¯x2$ . A test performed under this alternative hypothesis is called a two-tailed test. On the other hand, when the belief is only that $¯x1 < x¯2$ , the test is called a one-tailed test. Analogously, if the only belief is that $¯x1 > ¯x2$ , then this is also called a one tailed test. Quite basically, the way to remember whether a one-tailed or two-tailed test is employed is to think of the number of ways the alternative hypothesis can be true. For a two-tailed test, there are two ways the two means can be signiﬁcantly diﬀerent; however, under the one-tailed test, there is only one way the alternative hypothesis can be true.

Formally, there is set of hypotheses (null and alternative) for each of the alternative hypotheses shown in the table above. Thus, for two means there can only be three sets of hypotheses, and these are:

For the two-tailed test:

$H0 : ¯x1 = x¯2$

$Ha : ¯x1 ⁄= ¯x2$

For the ﬁrst one-tailed test:

$H0 : ¯x1 ≤ x¯2$

$Ha : ¯x1 > ¯x2$

For the second one-tailed test:

$H0 : ¯x1 ≥ x¯2$

$Ha : ¯x1 < ¯x2$

Variances assumed equal ( $s21 = s22$ ). When the population variances are assumed to be equal, it makes sense to combine together the variance from each sample. This is known as the common pooled variance. To calculate the common pooled variance, we must weight each sample variance by the sample size in the form

2 2 s2p = (n1 −-1)s1 −-(n2 −-1)s2. n1 + n2 − 2

(5.4)

Going back the general equation for the t-test in Eq. (5.3), a substitution is made replacing the variance for each sample with the common pooled variance, so that the test statistic now becomes

t = ∘--¯x1-−x¯2----. 2(-1 1-) sp n1 + n2

(5.5)

Signiﬁcance of the test statistic is compared with tail probabilities for Student’s distribution at $α$ for $ν = n1 + n2 − 2$ degrees of freedom.

Worked problem. A researcher was interested in determining if the mean concentration of a protein in tumor tissue was signiﬁcantly diﬀerent from the mean expression in normal tissue. The mean expression in tumors was $¯x1 = 132.85$ , with standard deviation $s1 = 15.34$ , and sample size $n1 = 8$ , while the mean expression in normal tissue was $¯x2 = 127.44$ , with standard deviation $s2 = 18.23$ , and sample size $n2 = 21$ . Are the two means signiﬁcantly diﬀerent when assuming equal variances?

For the assumption of equal variances, the pooled variance is:

2 (n1 −-1)s21 +-(n2 −-1)s22 sp = n1 + n2 − 2 .

(5.6)

Substituting in the data, we have:

2 2 s2p = (8−-1)(15.34)-+-(21-−-1)(18.23)- 8+ 21− 2 = (7)(235.3156)-+-(20)(332.3329) 27 = 1647.2092-+-6646.658 27 = 8293.87- 27 = 307.18

(5.7)

Given $2 sp = 307.18$ , we now ﬁnd that $sp = 17.53$ .

¯x1 − ¯x2 t =--∘-(-------) sp n11 + 1n2 = -132.85∘−(127.44-) (17.53) 18 + 121 5.41 = ------∘------------ (17.53) (0.13 + 0.05) = ----5.4√1---- (17.53) 0.17 ---5.41--- = (17.53)0.42 5.41 = 7.28 = 0.74

(5.8)

Signiﬁcance of the test statistic is compared with cumulative probabilities for Student’s distribution at $1− α ∕2$ for $ν = n1 + n2 − 2$ degrees of freedom. The tabled critical value, $t(0.975;27)$ , is equal to 2.05 , and since $|t| ≤ t(0.975;27)$ we conclude that the two means are not signiﬁcantly diﬀerent. In fact, the p-value is 0.4639, which is not less than $α = 0.05$ .

Variances assumed unequal ( $s21 ⁄= s22$ ). When the sample variances for the populations are unequal, we keep the sample variances intact in the general equation for the t-statistic given as

t = ∘¯x1-−x¯2-, s21+ s22- n1 n2

(5.9)

however, the value of $t$ no longer represents the $t$ distribution. This problem was ﬁrst recognized by Behrens [2], who introduced one of the ﬁrst solutions. Behrens’ approach was later veriﬁed by Fisher [3]. We implement the approach introduced by Dixon and Massey [4], which involves a correction to the degrees of freedom, $ν$ , in the form

( 2 2)2 sn1+ ns2 ν = (s2)12---2(s2)2. -n11--- -n22--- n1− 1 + n2−1

(5.10)

The computed value of $ν$ is rounded to the nearest integer. The signiﬁcance of the test statistic is compared with tail probabilities for Student’s t distribution at values of $α$ and $ν$ degrees of freedom.

Worked problem. A researcher was interested in determining if the mean concentration levels of a drug in serum plasma of placebo patients (untreated) was signiﬁcantly diﬀerent from the mean serum plasma concentrations in treated patients. The mean concentration in placebo patients was $¯x1 = 207.3$ , with standard deviation $s1 = 35.6$ , and sample size $n1 = 100$ , while the mean concentration in treated patients was $¯x2 = 193.4$ , with standard deviation $s2 = 17.3$ , and sample size $n2 = 74$ . Are the two means signiﬁcantly diﬀerent when assuming unequal variances?

---¯x1 −-¯x2- t = ∘ (-s2-s2) n11 + n22 207.3 − 193.4 = ∘-(---2-----2) 351.600 + 17.743- 207.3 − 193.4 = ∘-(1267.36--299.29) --100--+ --74-- 13.90 = ∘-(12.67+-4.04) = √13.90- 16.72 = 13.90 4.09 = 3.40

(5.11)

Signiﬁcance of the test statistic is compared with cumulative probabilities for Student’s distribution at $1− α ∕2$ for $ν = n1 + n2 − 2$ degrees of freedom. The tabled critical value, $t(0.975;172)$ , is equal to 1.97 , and since $|t| > t(0.975;172)$ we conclude that the two means are signiﬁcantly diﬀerent. In fact, the p-value is 0.0008, which is less than $α = 0.05$ .

5.1.3 Mann-Whitney U Test

Applies To	Hypothesis testing to determine if two samples are from the same distribution

Origin	Non-parametric statistics (based on ranks)

Null Hypothesis	$H0 : x1,x2 ∼ 𝒟$ , both samples are from the same distribution

Alternative Hypothesis	$Ha : x1,x2 ⁄∼ 𝒟$ , both samples are not from the same distribution

Variables	One test variable that is at least ordinally distributed, and one categorical variable

Variants	Friedman test

The Mann-Whitney $U$ test is a non-parametric 2-sample test to determine whether two independent samples are from the same distribution [5]. Unlike the t-test and F-test which assume an underlying normal distribution of the data, the Mann-Whitney requires that the data be at least ordinal and not necessarily normally distributed. The asymptotic relative eﬃciency (A.R.E.) of the Mann-Whitney test approaches 95.5% as $n$ increases when applied to data for which parametric tests are appropriate [6]. However, in cases where parametric tests are inappropriate, the Mann-Whitney test will have greater power to reject $H0$ than the t-test. The Mann-Whitney test should be applied when there is little known about the distributional properties for both samples, or the assumptions of normality do not hold.

The goal of the Mann-Whitney test is to compare two sets of data to determine if they are from the same distribution. By “same distribution” we mean that values of one distribution are less than or greater than values of the other one. If values of one distribution are less than or greater than values the other distribution, then the two distributions are diﬀerent. Stated formally, the null $(Ho )$ and alternative $(Ha )$ hypotheses of the Mann-Whitney test are

$Ho$ : Both samples from the same distribution (null hypothesis)

$Ha$ : Both samples from diﬀerent distributions (alternative hypothesis)

Given the hypotheses, the goal is to discredit or reject the null hypothesis, and disprove that values of the two distributions are not the same. We begin calculations for the Mann-Whitney $U$ test by ﬁrst determining the $U$ statistic for each group in the form:

n1(n1 + 1) U1 = n1n2 +----2---- − R1 n (n + 1) U2 = n1n2 +--2-2---- − R2, 2

(5.12)

where $n1$ and $n2$ are the sample sizes for each of the two groups, and $R1$ and $R2$ are the sum of the ranks within each group. The $U$ statistic is taken as the lowest value among $U1$ and $U2$ :

U = min(U ,U ). 1 2

(5.13)

The test statistic for the Mann-Whitney test is standard normal distributed as

Z = U-−-μ σ ---U-−-n12n2--- = ∘ n1n2(n1+n2+1). 12

(5.14)

Worked problem. Let’s consider an example that tests whether expression of a single protein in two independent groups of tissues are from the same distribution. A laboratory assay was carried out in which ﬂuorescent intensity levels were generated for expression of protein A in two samples of mice: wild-type (group 1) and knock-out (group 2). The data are shown in the table below:


ID	Protein A	$R(xi)$	Group	$R(x ∈ grp1)$	$R (x ∈ grp2)$

1	-0.017	5	1	5

2	-0.349	1	1	1

3	0.136	8	1	8

4	0.111	7	1	7

5	-0.149	4	2		4

6	-0.181	3	2		3

7	-0.307	2	2		2

8	0.071	6	2		6

				$R1 = 21$	$R2 = 15$

Ranks are determined by ﬁrst sorting the data for both samples in ascending order, and then assigning ranks. Tied expression values are handled by averaging their original rank assignments. Substituting the results into the formulation gives us

U = n n + n1(n1-+-1)− R 1 1 2 2 1 n2(n2-+-1) U2 = n1n2 + 2 − R2

(5.15)

4(4+ 1) U1 = (4)(4)+ -------− 21 = 5.00 2 U2 = (4)(4)+ 4(4+-1)− 15 = 11.00 2

(5.16)

U = min(U1,U2) = min(5.00,11.00) = 5.00

(5.17)

U − n12n2 Z = ∘-(n1n2)----------- 12 (n1 + n2 + 1) 5.00− (4)(4) = ∘-(-----)---2----- (4)1(42) (4+ 4 + 1) = ∘-5.00-− 8- (1.33)(9) −-3.00 = 3.46 = − 0.87.

(5.18)

Since $|− 0.87|$ is less than 1.96, the test is not signiﬁcant, and the conclusion is that the two distributions are not signiﬁcantly diﬀerent.

Wilcoxon Rank Sum Test. We now show the similarity between the Wilcoxon rank sum test and the Mann-Whitney test. Assume the following data:


ID	Protein A	$R(xi)$	Group	$R(x ∈ grp1)$	$R (x ∈ grp2)$

1	-0.062	6	1	6

2	0.052	7	1	7

3	-0.096	4	1	4

4	-0.175	2	1	2

5	-0.166	3	2		3

6	0.089	8	2		8

7	-0.092	5	2		5

8	-0.293	1	2		1

				$R1 = 19$	$R2 = 17$

While both the Mann-Whitney and Wilcoxon rank tests use the standard normal $Z$ -variate as the test statistic, the Wilcoxon rank sum test uses a slightly diﬀerent form of the numerator of $Z$ (the denominator is the same). In spite of the similarity between these two non-parametric tests, NXG Explorer employs the Mann-Whitney test for all 2-sample non-parametric tests. Computation of the Wilcoxon rank sum test goes as follows:

n(n +n+1) Z = ∘--R1 −--1-12-2---- (n1n2)(n + n + 1) 12 1 2 ----19−-4(4+42+1)--- = ∘ (-(4)(4))--------- -12- (4+ 4 + 1) = ∘-19−-18-- (1.33)(9) = 1.00 3.46 = 0.29

(5.19)

Example 1 - Two-tailed t-tests ( $Ha : ¯x1 ⁄= ¯x2$ ). First, select Analysis $→$ Independence $→$ Unpaired 2 and k-sample tests shown as

Next, click on sex as the grouping feature, and then select all of the continuous features on the left side of the popup window, shown as:

Lastly, leave the default settings that are visible on the popup window for grouped tests:

Using the Retinol dataset, two-tailed t-tests were performed using continuous variables with sex considered as the grouping variable. Recall, the sex categorical feature has two levels: 1-males, 2-females. The two-tailed tests are based on the belief (alternative hypothesis) $Ha :x¯1 ⁄= ¯x2$ . Results are listed in Figure 5.1. For each continuous variable, the appropriate t-test is used, along with the Mann-Whitney test, and Bartlett’s variance test. Tail probabilities for t-test and Mann-Whitney test will be colored either red or blue, depending on results of Bartlett’s test. Whenever Bartlett’s test for equality of variances is signiﬁcant for a given continuous variable, NXG Explorer will automatically apply the appropriate t-test, that is assuming equal or unequal variances. If Bartlett’s test is signiﬁcant, the Welch t-test assuming unequal variances is used. The non-parametric Mann-Whitney test is not an alternative to the Welch t-test when variances are unequal between groups, but is nevertheless performed for each 2-group test.

Figure 5.1: Two-tailed t-tests ( $Ha : ¯x1 ⁄= ¯x2$ ) for continuous variables in the Retinol dataset, using sex as the grouping variable.

Example 2 - One tailed t-tests, ( $Ha : ¯x1 > ¯x2$ ). In this example, one-tailed t-tests were performed using the same continuous variables with sex as the grouping variable. The speciﬁc alternative hypothesis for the tests is $Ha : ¯x1 > ¯x2$ . Results are listed in Figure 5.2.

Figure 5.2: One-tailed t-tests ( $Ha :x¯1 > ¯x2$ ) for continuous variables in the Retinol dataset, using sex as the grouping variable.

Example 3 - One tailed t-tests, ( $Ha : ¯x1 < ¯x2$ ). In this example, one-tailed t-tests were performed using the same continuous variables with sex as the grouping variable. The speciﬁc alternative hypothesis for the tests is $Ha : ¯x1 < ¯x2$ . Results are listed in Figure 5.3.

Figure 5.3: One-tailed t-tests ( $Ha :x¯1 < ¯x2$ ) for continuous variables in the Retinol dataset, using sex as the grouping variable.

5.2 Geometric and Harmonic Means

NXG Logic Explorer can use geometric or harmonic means during 2- and $k$ -sample hypothesis tests for inequality of means. When the geometric mean is speciﬁed for a continuous feature, the mean is deﬁned as

( n ) μ = exp 1-∑ log(x ) , x > 0 g n i e i i

(5.20)

and the geometric standard deviation is

(∘ ∑n---------------) σg = exp ( --i(loge(xi)−-μg)2) , xi > 0. n − 1

(5.21)

If the harmonic mean is speciﬁed, then the mean of the reciprocals is

∑n -1 m = --ixi, xi ⁄= 0 n

(5.22)

and the harmonic mean is

--n--- μh = ∑ni-1, xi ⁄= 0 1 xi = --. m

(5.23)

The variance of $1∕xi$ is

( ) 2 1∑n 1- 2 s = n xi − m , xi ⁄= 0, i

(5.24)

and the standard deviation is then deﬁned as

∘ --2-- σh = -s--. nm4

(5.25)

5.3 $k$ Unrelated Samples

5.3.1 Analysis of Variance (ANOVA)

Applies To	Hypothesis testing to determine if one or more pairs of means are signiﬁcantly diﬀerent in $k > 2$ unrelated groups

Origin	Parametric statistics (based on means, variances, etc.)

Null Hypothesis	$H0 : μ1 = μ2 = ⋅⋅⋅μc$ , all means are equal

Alternative Hypothesis	$H : not all μ are equal (j=1,2,...,c). a j$ , any pair of means are signiﬁcantly diﬀerent

Variables	One normally distributed continuously-scaled variable, one grouping variable

Variants	Welch ANOVA

Analysis of variance (ANOVA) is essentially like the 2-sample t-test for inequality of means, however, ANOVA is employed when there are $k = 3$ or more groups. Recall, ANOVA is for comparing means of a single continuously-scaled variable across 3 or more groups. The null hypothesis is that all of the means in the groups are equal, and the alternative hypothesis, i.e., what the belief is, that one of more pairs of means are signiﬁcantly diﬀerent.

Using notation, the null and alternative hypotheses for ANOVA are

H : μ = μ = ⋅⋅⋅μ H 0: n1ot all2 μ are eqcual (j=1,2,...,c). a j

Deﬁne $Y$ $ij$ as a continuously-scaled observation in group $j$ of sample size $n$ $j(i$ =1,2,…, $n$ $j$ ; $j$ = 1,2,…, $c)$ , where $c$ is the number of groups. The ANOVA model states that each random observation can be represented by the model

Yij = μ + αj + 𝜀ij,

where $μ$ is the grand mean, $α j$ is the eﬀect of treatment $j$ , and $𝜀 ij$ is the error. Estimates of the model parameters are

¯ ¯ ¯ ˆμ = Y⋅⋅ ˆαj = Y⋅j − Y⋅⋅¯ Error between treatments ˆ𝜀 = Yij − Y⋅j Error within treatment.

A typical data setup for ANOVA is given as follows:


Group 1	Group 2	$⋅⋅⋅$	Group $j$	$⋅⋅⋅$	Group $c$

$Y$ $11$	$Y$ $11$	$⋅⋅⋅$	$Y$ $1j$	$⋅⋅⋅$	$Y$ $1c$

$Y$ $21$	$Y$ $22$	$⋅⋅⋅$	$Y$ $2j$	$⋅⋅⋅$	$Y$ $2c$

$Y$ $31$	$Y$ $32$	$⋅⋅⋅$	$Y$ $3j$	$⋅⋅⋅$	$Y$ $3c$

$. ..$			$. ..$		$. ..$

$Y$ $i1$	$Y$ $i2$	$⋅⋅⋅$	$Y$ $ij$	$⋅⋅⋅$	$Y$ $ic$

$. ..$	$. ..$		$. ..$		$. ..$

$Yn11$	$Yn22$	$⋅⋅⋅$	$Ynjj$	$⋅⋅⋅$	$Yncc$

The grand mean is determined as

∑c n∑j j=1i=1 Yij ¯Y⋅⋅ =--------, n

and the treatment mean for group $j$ is

nj ∑ Yij Y¯ = i=1---. ⋅j nj

Redeﬁning the error terms, we have

Yij = Y¯⋅⋅ + (¯Y⋅j − ¯Y⋅⋅)+ (¯Yij − ¯Y⋅j).

Subtracting $Y ⋅⋅$ from the right side yields

Yij − ¯Y⋅⋅ = (¯Y⋅j − ¯Y⋅⋅)+ (¯Yij − ¯Y⋅j),

and squaring all terms and summing over $i$ and $j$ gives us the sum of squares

∑c ∑nj ∑c ∑nj ∑c n∑j (Yij − ¯Y⋅⋅)2 = (¯Y⋅j −Y¯⋅⋅)2 + (Yij − ¯Y⋅j)2. j=1i=1 j=1i=1 j=1 i=1

A more numerically eﬃcient method for calculating the sum of squares is in the form

∑c n∑j 2 ∑c 2 2 ∑c n∑j ∑c 2 Y2ij − Y⋅⋅-= Y⋅j− Y⋅⋅+ Y2ij − Y⋅j, j=1i=1 n j=1nj n j=1i=1 j=1 nj

where the grand and treatment-speciﬁc sums are

c nj Y = ∑ ∑ Y , ⋅⋅ j=1 i=1 ij

and

nj ∑ Y ⋅j = Yij. i=1

In summary, the sum of squares total is equal to the treatment sum of squares (error between treatments) and the error sum of squares (within treatments), given as

SST O (total) = SST(between) + SSE (within).

Dividing each of these sums of squares by their individual degrees of freedom, one can obtain the mean squares due to treatment and the mean squares due to error:

M ST = -SST- c − 1 M SE = -SSE- n − c

(5.26)

F = M--ST ∼ Fc−1,n−c. M SE

Worked problem. In the example below, the expression of a single serum plasma protein, called “PSA” (prostate-speciﬁc antigen), was measured in laboratory mice treated with Drug A or Drug B. A third group was used for placebo or untreated “Control” mice. Perform an F-ratio test using ANOVA to determine if mean expression is signiﬁcantly diﬀerent between the 3 groups.

Table 1. Mouse PSA expression after treatment with Drug A and Drug B.


	Control	Drug A	Drug B

	8.24	16.94	2.92
	11.79	21.46	2.18
	9.44	21.48	1.80
	12.82	18.24	2.15
	7.82	15.79	3.02
	11.36	18.34	2.37
	9.32		2.33
	7.84

$nj :$	8	6	7	$n =$ 21
$Y : .j$	78.63	112.25	16.76	$Y = ..$ 207.64
$∑nj Y2 : i=1 ij$	798.84	2127.38	41.29	$∑c ∑nj Y2 = j=1 i=1 ij$ 2967.52

Substituting in our results, for the sum of squares treatment (SST) we have:

∑c SST = nj(Y¯.j − ¯Y..)2 j=1 ∑c Y2 2 = -⋅j-− Y⋅⋅- j=1 nj n (78.63)2 (112.25)2 (16.76)2 (207.64)2 = ---8---+ ---6----+ --7----− ---21--- = 859.89

(5.27)

The degrees of freedom for $SST$ are determined next, based the number of groups

df = c− 1 = 3− 1 = 2

(5.28)

The $SSE$ or “residual error,” which picks oﬀ the error between each observation and its treatment mean reveals the within treatment error. When $SSE$ is large, the residual error can outweigh the treatment eﬀects, resulting in a non-signiﬁcant test result.

c nj SSE = ∑ ∑ (Y − ¯Y )2 j=1 i=1 ij .j c nj c 2 = ∑ ∑ Y2− ∑ Y⋅j- j=1 i=1 ij j=1 nj [ 2 2 2] = 2967.52− (78.63)- + (112.25)-+ (16.76)- 8 6 7 = 54.50

(5.29)

df = n− c = 21 − 3 = 18

(5.30)

The F-statistic can now be calculated as the ratio of the two mean square errors

M-ST- 859.89∕2- 429.94 F = M SE = 54.50∕18 = 3.03 = 142.01

(5.31)

The tabled $F0.95;2,18$ is 3.55, suggesting that the means are not all equal, with a p-value less than 0.05.

5.3.2 Welch ANOVA Test for Unequal Variances

If Bartlett’s test for equality of variances is signiﬁcant, meaning that the variances are signiﬁcantly diﬀerent across groups, i.e., heteroscedasticity, then the Welch ANOVA test should be used [7]. Calculation of the F-ratio statistics and adjusted degrees of freedom are provided below. The weight for each $k$ th group is deﬁned as

nk- wk = s2k,

(5.32)

and the sum of the weights is

∑ w.= wk. k

(5.33)

A weighted mean is then calculated as

∑ μ = --kwk-μk. . w.

(5.34)

The F-ratio statistic is given as

∑ 2 --kwkk(μ−k1−μ.) F = ---2(k−2)∑--(----)-(------)2, 1+ -k2−1 k nk1−1 1− wwk.

(5.35)

where the numerator and denominator degrees of freedom are

-------k2-−-1-------- df1 = k− 1, df2 = ∑ (-1--)( wk)2. 3 k nk−1 1 − w.

(5.36)

5.3.3 Kruskal-Wallis Test

Applies To	Hypothesis testing to determine if three or more samples (i.e., $k ≥ 3$ ) are from the same distribution

Origin	Non-parametric statistics (based on ranks)

Null Hypothesis	$Ho : x1,x2,...,xk ∼ 𝒟$ , variates from all samples are from the same distribution

Alternative Hypothesis	$Ha : x1,x2,...,xk ⁄∼ 𝒟$ , samples are not from the same distribution

Variables	One test variable that is at least ordinal, and one categorical variable

Variants	Friedman test

The Kruskal-Wallis test is used to determine if three or more samples are from the same distribution.
It is a non-parametric test based on ranks. The null hypothesis is $Ho : x1,x2,...,xk ∼ 𝒟$ , which states that variates in the $k$ groups are from same distribution.
The alternative hypothesis is $Ha : x1,x2,...,xk ⁄∼ 𝒟$ , which implies that variates in each sample are not from the same distribution
The variables used in a Kruskal-Wallis test include one test variable that is at least measured on the ordinal scale, and one categorical variable which represents the groups.

The Kruskal-Wallis test is used to determine if data for a single variable in k groups are the from the same distribution.

Worked problem. A research team performed a study to explore concentrations of a serum marker in laboratory mice 24 hours after treatment with Drug A, Drug B, or no treatment (Control). Perform a Kruskal-Wallis test using the data listed in the table below, and provide the calculated test statistic, and comment on the p-value and your conclusion.

Table: Mouse serum plasma concentration of protein marker 24 hours after treatment.


Control	Rank	Drug A	Rank	Drug B	Rank

8.24	10	16.94	17	2.92	6
11.79	14	21.46	20	2.18	3
9.44	12	21.48	21	1.80	1
12.82	15	18.24	18	2.15	2
7.82	8	15.79	16	3.02	7
11.36	13	18.34	19	2.37	5
9.32	11			2.33	4
7.84	9

	$n1 =$ 8		$n2 =$ 6		$n3 =$ 7
	$R1 =$ 92		$R2 =$ 111		$R3 =$ 28

The test statistic used for the Kruskal-Wallis test is the chi-squared distribution with $k− 1$ degrees of freedom.

2 12 ∑k R2j χk−1 = n(n+-1)- n--− 3(n + 1) j=1[ j ] ---12---- (92)2 (111)2 (28)2 = 21(21+ 1) 8 + 6 + 7 − 3(21+ 1) 12 [ (8464) (12321) (784)] = 462 --8---+ --6----+ -7--- − 66 = (0.0260)[1058.0000+ 2053.5 +112]− 66 = (0.0260)(3223.5000) − 66 = 17.73

(5.37)

Since there are $k = 3$ groups, the chi-squared statistic is approximately distributed with 2 d.f. The calculated test statistic of 17.73 is signiﬁcant, since it exceeds the $1 − α$ critical value of chi-squared for $α = 0.05$ and 2 d.f., $χ2 = 5.99 0.95,2$ . Therefore, the p-value would be less than 0.05, i.e., $p < 0.05$ .

Example 4 - ANOVA and Kruskal-Wallis Tests. In this example, we provide NXG Explorer results for ANOVA and KW analysis of the continuous variables in the Retinol, with vitamin use status vituse as the grouping variable. Recall, there are 3 levels of the vitamin use category: 1=Yes, 2=Sometimes, 3=No. To begin, import the Retinol.xlsx ﬁle. Next, select Analysis $→$ Independence $→$ Unpaired 2 and k-sample tests shown as

Next, specify vituse as the grouping feature, and select all of the remaining features (left panel) as the test features:

Themn click on the Apply in the next popup window that appears:

When the run is complete, click on the Grouped analysis icon showing in the treenodes:

The panel with the tabular results of the ANOVA test will appear:

The above results were generated with the following default setting in Preferences $→$ General default settings

which will exclude non-parametric test results for the Kruskal-Wallis test as well as the Bartlett’s equality of variance test results. Un-check the box shown, and then rerun the ANOVA for the same features, and the following results will be generated:

which shows the Kruskal-Wallis test results, as well as Bartlett’s equality of variance test results.

You can also specify default settings for outputting only parametric or non-parametric test results. Another important point is that NXG Explorer will default to the Welch ANOVA when Bartlett’s equality of variance test is signiﬁcant, and in this case, you will notice that d.f. for the tests will change. This cannot be changed, since it is inappropriate to not use the Welch ANOVA in the presence of heteroscedasticity.

Note: NXG Explorer always performs Bartlett’s test for equality of variance for 2-sample and k-sample tests. If heteroscedasticity exists, then the “unequal variance” t-test will be used for 2-sample tests, and the Welch ANOVA test will be used for k-sample tests. You can determine when these tests are used by noticing a signiﬁcant Bartlett’s test and a change in the test’s d.f. which is reported.

5.4 Related Samples

5.4.1 T-test for Paired Samples

Confounding occurs when the variables not being tested are independently related to the outcome (e.g., age, gender) and the exposure. Matching is used to make subjects in diﬀerent exposure groups more similar in terms of extraneous variables that are independently associated with the outcome. The goal of matching is therefore to achieve diﬀerences between outcome between the exposure groups that are more likely to be due to exposure rather than confounding variables. The overall eﬀect of matching may reduce the standard deviation of diﬀerences rather than the diﬀerences themselves. Since standard deviation is in the denominator of the $t$ -test, the reduced standard deviation increases the test statistic. This reduction in variance cause an increase in statistical eﬃciency, which is the power to detect a diﬀerence for a ﬁxed sample size.

For $n$ paired samples, the $t$ -test is

d¯− δ t = --√--- sd∕ n = --¯d√--, sd∕ n

(5.38)

where $¯d$ is the mean of the $n$ pairwise diﬀerences $di = xi1 − xi2$ for each record, and $sd$ is the standard deviation of diﬀerences given as

∘ ∑n----------- sd = --i=1(di −-¯d)2 n− 1 ┌│ [∑-------∑----2]- ││ id2i − (-indi)- = ∘ ----------------. n− 1

(5.39)

Worked problem. Perform a matched paired t-test to determine if concentration of a protein A is signiﬁcantly diﬀerent in tissue and plasma. The table below shows the data:


ID	Protein A(plasma)	Protein A (tissue)	$d i$	$d2 i$

1	0.289	1.027	-0.738	0.54

2	-0.243	0.592	-0.835	0.70

3	0.082	0.795	-0.713	0.51

4	-0.168	1.393	-1.561	2.44

5	0.007	0.814	-0.808	0.65

			$¯d = − 0.93$

			$∑ di = − 4.65$	$∑ d2i = 4.84$

Substituting in the data, we have for the s.d.

┌│ [--------∑-----]- ││ ∑ d2− (-idi)2- ∘ ---i-i-----n---- sd = n− 1 ┌│ [------------]- │∘ 4.84− (−4.655)2 = -------------- ∘ -------4--- [4.84−-4.33] = 4 ∘ [0.51] = --4-- √ ---- = 0.13 = 0.36

(5.40)

and for the t-statistic, we get:

t =-¯d- √sdn- − 0.93 = -0.3√6- 5 = −-0.93 0.16 = − 5.85

(5.41)

The tabled critical value, $t(0.975;4)$ , is equal to 2.78 , and since $|t| > t(0.975;4)$ we conclude that the two means are signiﬁcantly diﬀerent. In fact, the p-value is 0.0042, which is less than $α = 0.05$ .

Wilcoxon Signed Rank Test. The Wilcoxon signed rank test is the non-parametric variant of the paired t-test. Let’s perform a Wilcoxon signed rank test for matched pairs data and then show worked equations.


ID	Protein A	Protein B	$di$	$\|di\|$	$R(\|di\|)$	$R(\|di\| \| d− )$	$R(\|di\| \| d+)$

1	0.122	0.004	0.117	0.117	3		3

2	0.456	0.418	0.038	0.038	2		2

3	-0.139	0.522	-0.661	0.661	8	8

4	0.155	-0.048	0.203	0.203	5		5

5	0.158	0.481	-0.323	0.323	6	6

6	-0.112	0.066	-0.179	0.179	4	4

7	0.374	0.380	-0.006	0.006	1	1

8	-0.032	0.373	-0.404	0.404	7	7

						$∑ − R (d ) = 26$	$∑ + R(d ) = 10$

In the table above, $di$ is the diﬀerence between the two values in a row, $|di|$ is the absolute value of the diﬀerence, $R(|di|)$ is the rank of the absolute diﬀerence, $R (|di| | d− )$ is the rank of $|di|$ when the original (raw) diﬀerence is negative, and $R (|di| | d+)$ is the rank of $|di|$ when the original (raw) diﬀerence is positive.

Next, we substitute in the value of $R d+$ along with the sample size $n$ to get

R + − n(n+1) Z = ∘----d------4------ n(n + 1)(2n + 1)∕24 10 − 8(8+1) = ∘-----------4-------- 8(8 + 17)(22× 8 + 1)∕24 = ∘-10-−-4---- (72)(17)∕24 10− 18.00 = --7.14--- − 8.00 = 7.14- = − 1.12

(5.42)

Since $|− 1.12| < 1.96$ , we cannot reject the null hypothesis that the two concentrations are equal, and conclude that they are not signiﬁcantly diﬀerent.

Example 5 - Paired t-test and Wilcoxon signed rank tests. In this example, both the paired t-test and Wilcoxon signed rank tests results are run for the fat and fiber features of the Retinol dataset. To begin, specify the “Paired mean inequality tests” option of the Independence pull-down menu, shown as follows:

Figure 5.4: Paired t-test commands.

Next, select two continuous variables for testing:

Figure 5.5: Paired t-test options; selecting 2 continuous variables.

The output for both the paired t-test and the Wilcoxon signed rank tests will be visible after clicking on the output icon, which is shown below

Figure 5.6: Paired t-test and Wilcoxon signed rank test results.

Results of the tests indicate that the mean values for fat and fiber measurements are signiﬁcantly diﬀerent.

5.5 Equality Tests for 2 Independent Proportions

Tests for equality of 2 independent proportions are commonly used in statistics. The notation used for tests of proportions uses $p1$ and $p2$ , where $p1$ is the proportion of successes(failures) out of the total sample size for sample 1, and $p2$ is the proportion of successes(failures) among the total population in sample 2. Let $xij$ be a binary outcome feature taking on values (0,1) which may represent (1=success,0=failure), (1=tumor,0=normal), etc., for the $i$ th object in the $j$ th group $(i = 1,2,...,n;j = 1,2)$ . The proportion in each group can be determined by either summing the x-values within each group to obtain the total number of occurrences for which $xij = 1$ , or by using a logical statement reﬂecting the total number of ones in each group.

∑n1 xi1 #{i,xi1 = 1} p1 =--i=n1---= ----n------, 1 1

(5.43)

where $#{i,xi1 = 1}$ represents the total number of ones among all values of $xi1$ . Similarly, $p2$ is deﬁned as

∑ --ni2=1xi2 #{i,xi2-=-1} p2 = n2 = n2 .

(5.44)

NXG Explorer employs a pooled and unpooled version of the test, whose $Z$ statistic is distributed standard normal, N(0,1). The pooled version is functionally composed as

----p1 −-p2-- Zpooled = ∘--(-1----1)-, ¯p¯q n1 + n2

(5.45)

where $¯p = (p n + p n )∕(n + n ) 1 1 2 2 1 2$ and $q¯= 1− ¯p$ . The unpooled variant of the test is

Zunpooled = ∘-p1 −-p2-. p1nq1+ p2nq2 1 2

(5.46)

Example 6 - Test for 2 independent proportions. For this example, we will use NXG Explorer to create the binary outcome feature. To begin, open the Retinol dataset and select the test of proportions option from the Independence command of the Analysis pull-down menu, shown as

Figure 5.7: Pull-down command menu for test of two independent proportions.

Using the Retinol dataset, select sex as the grouping variable, and smokstat as the outcome variable. Next, we will deﬁne a success $xij$ as subjects that have values of 3 for smoking status (smokstat=3) and failure for subjects with other values of smokstat equal to 1 or 2. (Recall, the categorical levels of smoking status in the Retinol dataset are 1=Never, 2=Former, 3=Current Smoker). In addition, select an alternative hypothesis of $p1 < p2$ in order to specify a one-tailed test. This can be done using options in the popup window for test of proportions:

Figure 5.8: User control for test of proportions.

The results of the one-tailed test of proportions are shown below, which indicate that the proportion of subjects with values of smokstat=3 are not signiﬁcantly diﬀerent because the p-values of 0.726 and 0.713 for the pooled and unpooled tests are not less than 0.05.

Figure 5.9: Test results for two independent proportions.

5.6 Chi-Squared Contingency Table Analysis

Data are often collected to identify the number of subjects which fall into categories. Examples are the number of new patients with breast cancer, lung cancer, or no cancer; low-intermediate-high risk, or diagnostic category A, B, C, etc. The $2 χ$ test is suitable for analyzing count data arranged into $k = 2$ or more categories [9]. The method is a goodness-of-ﬁt test to determine if a signiﬁcant diﬀerence exists between observed and expected counts based on the null hypothesis. In order to test a hypothesis, we need to be able to determine expected frequencies. The null hypothesis ( $Ho$ ) states that the proportion of counts falling into the categories for an equal number of counts form the presumed population. Thus, we deduce the expected frequencies form the null hypothesis. The $2 χ$ test determines how close the observed frequencies are to the expected frequencies under $Ho$ , and is tested with the chi-squared function

2 ∑k (Oi −-Ei)2 χ = Ei , i=1

(5.47)

where $Oi$ is the number of observed counts and $Ei$ is the number of expected counts in the $i$ th cell. When diﬀerences between the observed and expected $(Oi − Ei)$ are small, the chi-squared test statistic will be small. Whereas for large diﬀerences in $(Oi − Ei)$ , the chi-squared test statistic will be large. Squaring the non-zero diﬀerences results in positive values, and therefore the chi-squared test statistic is always greater than or equal to zero. For a one-way chi-squared test, the number of degrees of freedom (d.f.) is equal to $k− 1$ . For a multiway chi-squared test, the d.f. are equal to the number of rows minus one times the number of columns minus one, that is, $d.f.= (r− 1)(c − 1)$ .

5.7 Two-way Contingency Tables

A two-way table provides a cross-tabulation of counts as a function of population and class. This can be shown in the following conﬁguration:


	Class 1	Class 2	Total


Population 1	$O11$	$O12$	$n1$


Population 1	$O21$	$O22$	$n2$


Total	$C 1$	$C 2$	$n$

If we consider the counts which are used for the chi-squared test, we have


	Characteristic B


Characteristic A	Present	Absent	Total

Characteristic Present	$n11$	$n12$	$n1.$
Absent	$n21$	$n22$	$n2.$

Total	$n.1$	$n.2$	$n..$

which can be transformed into proportions according the tabular notation:


	Characteristic B


Characteristic A	Present	Absent	Total

Characteristic Present	$p11$	$p12$	$p1.$
Absent	$p 21$	$p 22$	$p 2.$
Total	$p.1$	$p.2$	$p..$

Thus, the chi-squared test is really a test of equal proportions among the various categories involved

∑2 ∑2 (|pij − pi.p.j|− 1∕(2n..))2 χ2 = n.. --------p-p----------, i=1j=1 i..j

(5.48)

which translates to our original formulation

2 ∑2 ∑2 (Oij −-Eij)2 χ = Eij . i=1 j=1

(5.49)

The the expected number of counts in each cell, $Ei$ is determined by the relationship

n C Eij =-i-j , n

(5.50)

and $ni$ is the total counts in row $i$ $(i = 1,2,...,r)$ and $Cj$ are the total counts in column $j$ $(j = 1,2,...,c)$ . Let’s consider an example chi-squared test involving the frequency of breast cancer among mothers whose ﬁrst age at giving birth was less than or greater than age 30:


	First age at giving birth


Cancer status	$30+$	$≤ 29$	Total

Case	$O11 = 683$	$O12 = 2537$	$n1 = 3220$
Control	$O21 = 1498$	$O22 = 8747$	$n2 = 10245$

Total	$C1 = 2181$	$C2 = 11284$	$n = 13465$

First, we need to calculate the number of expected cases in each cell. Recall, the expected frequency in each cell is

niCj Eij =--n- .

(5.51)

Upon substitution of the data, we have

$E = n C ∕n = (3220)(2181)∕13465 = 520 11 1 1$
$E = n C ∕n = (3220)(11324)∕13465 = 2700 12 1 2$
$E = n C ∕n = (10285)(2181)∕13465 = 1661 21 2 1$
$E = n C ∕n = (10285)(11324)∕13465 = 8624 22 2 2$

and ﬁnally, substituting the observed and expected frequencies into the equation for chi-squared, we have

∑2 ∑2 2 χ2(1) = (Oij −-Eij)- i=1 j=1 Eij

(5.52)

∑2 ∑2 2 χ2(1) = (Oij −-Eij) i=1j=1 Eij (O − E )2 (O − E )2 = --11----11-- + --12---12--+ E11 2 E12 2 = (O21 −-E21) + (O22-−-E22)- E21 E22 (683−-520)2 (2537−-2700)2 = 520 + 2700 + (1498−-1661)2 (8787−-8624)2 = 1661 + 8624 = 79.99

(5.53)

The calculated value of 79.99 far exceeds the lookup critical value of 3.84 for the $2 χ1−α$ with 1 d.f. Hence, we conclude that there is a signiﬁcant association between ﬁrst age at giving birth and breast cancer based on these data.

5.8 Fisher’s Exact Test

Fisher’s exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables [10]. Let there exist two such variables $X$ and $Y$ , with $r$ and $c$ observed states, respectively. Now form an $r × c$ matrix in which the entries $nij$ $(i = 1,2,...,r,j = 1,2,...,c)$ represent the number of observations in which $x = i$ and $y = j$ .


	Marker


Disease	Positive	Negative	Total

Present	$n11$	$n12$	$n1.$
Absent	$n21$	$n22$	$n2.$

Total	$n.1$	$n.2$	$n..$

where the row total is given by

ni.= ∑c nij, j=1

(5.54)

and column total by

∑r n.j = nij, i=1

(5.55)

with a grand table total

∑r ∑c ∑r ∑c n..= nij = ni.= n.j i=1j=1 i=1 j=1

(5.56)

Next, calculate the conditional probability of getting the actual matrix given the particular row and column sums, given by

∏r ∏c P = ni.!n.j! i=1j=1n..!nij (n1.!n2.!⋅⋅⋅nr.!)(n.1!n.2!⋅⋅⋅n.c!) = --------n-!∏--∏--n-!-------, .. i j ij

(5.57)

which is a multivariate generalization of the hypergeometric probability function. Next, ﬁnd all possible matrices of nonnegative integers consistent with the row and column sums $ni.$ and $n.j$ . For each one, calculate the associated conditional probability $P (b)$ , where the sum of these probabilities must be 1.

To compute the p-value of the test, the tables must then be ordered by some criterion that measures dependence, and those tables that represent equal or greater deviation from independence than the observed table are the ones whose probabilities are added together. There are a variety of criteria that can be used to measure dependence. In the $2 × 2$ case, which is the one Fisher looked at when he developed the exact test, either the Pearson chi-squared or the diﬀerence in proportions (which are equivalent) is typically used. Other measures of association, such as the likelihood-ratio-test, G-squared, or any of the other measures typically used for association in contingency tables, can also be used.

The test is most commonly applied to $2× 2$ matrices, and is computationally unwieldy for large $r$ or $c$ . For tables larger than $2 × 2$ , the diﬀerence in proportions can no longer be used, but the other measures mentioned above remain applicable (and in practice, the Pearson statistic is most often used to order the tables). In the case of the $2× 2$ matrix, the p-value of the test can be simply computed by the sum of all $P(b) ≤ P$ .

For an example application of the $2 × 2$ test, let $X$ represent the positive or negative result of a protein marker, and $Y$ be the number of patients with and without a disease. If there are 5 diseased patients and 1 non-diseased which test positive, and 0 diseased and 4 non-diseased that test negative, the the conﬁguration of these data would be


	Marker


Disease	$+$	$−$	Total

Present	5	0	5
Absent	1	4	5

Total	6	4	10

Computing $P$ gives

$P = (5!5!6!4!)∕(10!(5!0!1!4!)) = 0.0238$ ,

and the other possible matrices and their $P$ s are


	Marker


Disease	$+$	$−$	Total

Present	4	1	5
Absent	2	3	5

Total	6	4	10

$P (1) = (5!5!6!4!)∕(10!(4!1!2!3!)) = 0.2381$ ;


	Marker


Disease	$+$	$−$	Total

Present	3	2	5
Absent	3	2	5

Total	6	4	10

$P (2) = (5!5!6!4!)∕(10!(3!2!3!2!)) = 0.4762$ ;


	Marker


Disease	$+$	$−$	Total

Present	2	3	5
Absent	4	1	5

Total	6	4	10

$P (3) = (5!5!6!4!)∕(10!(2!3!4!1!)) = 0.2381$ ;


	Marker


Disease	$+$	$−$	Total

Present	1	4	5
Absent	5	0	5

Total	6	4	10

$P (4) = (5!5!6!4!)∕(10!(1!4!5!0!)) = 0.0238$ ,

which indeed sum to 1, as required. The sum of $P(b)$ values less than or equal to p=0.0238 is then 0.0476 which, because it is less than 0.05, is signiﬁcant. Therefore, in this case, there would be a statistically signiﬁcant association between the marker and disease.

When the table being analyzed is too large to consider all possible conﬁgurations, permutations are used by re-shuﬄing categories for one of the two categorical variables, each time recalculating $(b) P$ $(b = 1,2,...,B )$ in order to simulate the null distribution. After $B$ iterations, the empirical p-value is

(b) P = #-{b : P-<-P}-. B

(5.58)

5.9 Multiway Contingency Tables

Tables with more than two rows and columns are called multiway contingency tables. The table below shows the setup of a multiway contingency table:

	Class
	1	2	$...$	$c$	Totals


Population 1	$O 11$	$O 21$	$⋅⋅⋅$	$O 1j$	$n 1$


Population 2	$O21$	$O22$	$⋅⋅⋅$	$O2j$	$n2$


$⋅⋅⋅$	$⋅⋅⋅$	$⋅⋅⋅$	$⋅⋅⋅$	$⋅⋅⋅$	$⋅⋅⋅$


Population $r$	$Or1$	$Or2$	$⋅⋅⋅$	$Orc$	$nr$


Totals	$C1$	$C2$	$...$	$Cc$	$n$

The same notation using an $r$ for the number of rows and a $c$ for the number of columns is the same as for two-way and one-way tables [11]. Using an $r× c$ table, we can test whether cell probabilities diﬀer from sample to sample. We can also use an $r× c$ table for a single sample whose counts have been partitioned into $r$ diﬀerent categories according to one criterion and into $c$ diﬀerent categories according to another criterion. The application of chi-squared testing for both methods is the same.

Example 7 - $r× c$ Contingency table analysis. Gleason scoring is one of the most reliable pathological scoring systems for staging prostate cancer severity and growth. The higher the Gleason score for a tumor tissue sample, the more severe the prostate cancer. The table below lists the number of prostate cancer patients as a function of Gleason score category and diagnostic grade.

	Class
	1	2	$...$	$c$	Totals


	7.3	30.3	38.0	5.4


Gleason 1-3	23	40	16	2	81


	18.6	77.5	97.1	13.8


Gleason 4-6	11	75	107	14	207


	9.1	38.2	47.9	6.8


Gleason 7-9	1	31	60	10	102


Totals	35	146	183	26	390

Null Hypothesis: The proportion of cases in Gleason categories is the same for all stages. The alternative hypothesis is that the proportions of cases in Gleason categories are diﬀerent across stage.

Statistical test:

∑r ∑c (O − E )2 χ2 = --ij---ij-. i=1 j=1 Eij

(5.59)

The expected number of counts in each cell is

niCj E(ij) = n ,

(5.60)

where $ni$ is the total counts in row $i$ $(i = 1,2,...,r)$ and $Cj$ are the total counts in column $j$ $(j = 1,2,...,c)$ .

Signiﬁcance: Let $α = 0.05$ , $n = 390$ , the total number of samples.

Sampling Distribution: The $χ2$ distribution for an $r× c$ table has $(r− 1)(c− 1)$ d.f. Speciﬁcally, for this example we get $(3− 1)(4 − 1) = 6$ d.f.

Rejection Region: $Ho$ will be rejected if the probability of occurrence of $2 χ$ with one 6 d.f. under then null hypothesis is less than or equal to $α = 0.05$ .

Decision Rule: Reject values of $χ2$ that exceed $χ21−α$ for 6 d.f.

Example calculation:

∑r ∑c 2 χ2 = (Oij-− Eij)- i=1j=1 Eij (23− 7.3)2 (40− 30.3)2 (16− 38.0)2 (2 − 5.4)2 = ----7.3----+ ---30.3--- + ---38.0--- + --5.4---- (11− 18.6)2 (75 − 77.5)2 (107− 97.1)2 (14− 13.8)2 + ----------+ ----------+ ----------- + ---------- 18.62 77.5 2 97.12 13.28 + (1−-9.1)--+ (31-−-38.2)-+ (60−-47.9) + (10−-6.8)- 9.1 38.2 47.9 6.8 = 33.8+ 3.1 + 12.7+ 2.1+ 3.1+ 0.08 + 1+ 0.003 + 7.3+ 1.4+ 3.1+ 1.5 = 69.2

(5.61)

The calculated test statistic of 69.2 far exceeds the critical value of 22.46 for $χ21−0.999$ for 6 d.f., so the decision rule is to reject $Ho$ , and the conclusion is that there exists a signiﬁcant association between Gleason score and tumor grade, with p $<$ 0.001.

5.10 Exact Tests for Multiway Tables

Consider the table

	1	2	$⋅⋅⋅$	$j$	$⋅⋅⋅$	$c$	Totals

1	$n11$	$n12$	$⋅⋅⋅$	$n1j$	$⋅⋅⋅$	$n1c$	$n1.$

2	$n21$	$n22$	$⋅⋅⋅$	$n2j$	$⋅⋅⋅$	$n2c$	$n2.$

$...$	$...$	$...$		$...$		$...$

$i$	$ni1$	$ni2$	$⋅⋅⋅$	$nij$	$⋅⋅⋅$	$nic$	$ni.$

$...$	$...$	$...$		$...$		$...$

$r$	$nr1$	$nr2$	$⋅⋅⋅$	$nrj$	$⋅⋅⋅$	$nrc$	$nr.$

Totals	$n.1$	$n.2$	$⋅⋅⋅$	$n.j$	$⋅⋅⋅$	$n.c$	$n$

For each row, the multinomial probability of the row total ${ni.}$ is

--ni.!- ∏ nij P ({ni.}) = ∏j nij! πj|i , j

(5.62)

where $πj|i$ is the conditional probability in column $j$ ( $j = 1,2,...,c$ ) given row $i$ (Freeman and Halton, 1951; Agresti, 2002).

The probability of the table ${n } ij$ is the product of all of the multinomial row probabilities

( ) ∏ n ! ∏ P ({nij}) = ( ∏-i.-- πnjij|i) . i j nij! j

(5.63)

Under independence, we assume that in each column the conditional cell probabilities are equal (including the row total), shown as $π = π = ⋅⋅⋅ = π = π j|1 j|2 j|r .j$ . Thus, let’s substitute $π .j$ for $π j|i$ and $n .j$ for $n ij$ in the above equation and obtain

( ) P({n }) = ∏ ( ∏-ni.!- ∏ πn.j) ij i jnij! j .j ∏ ∏ n.j = -i∏ni.∏!-jπ-.j-. i jnij!

(5.64)

The distribution of the cell counts in each row, however, also depend also on $π.j$ . Fisher assumed these to be suﬃcient nuisance parameters, and conditioned on them so that the resulting conditional probability does not depend on them. The contribution of $π.j$ to the product multinomial distribution depends on the data through $n.j$ , which are the suﬃcient statistics. Each column total $n.j$ has the multinomial distribution

n! ∏ n P ({n.j}) = ∏---- π.j.j. jn.j! j

(5.65)

The probability of ${nij}$ conditional on ${n.j}$ equals the probability of ${nij}$ divided by the probability of ${n.j}$ , so we invert the above ratio and multiply as

(∏ n !∏ πn.j) ( ∏ n !) P({nij}|{n.j}) = --i∏-i.∏--j-.j- ∏--jn..jj- ∏ i∏ jnij! jπ.j n! ini.! jn.j = n!∏--∏--nij!. i j

(5.66)

$P$ above is ﬁrst calculated from the observed data. Permutations are then used by re-shuﬄing categories for one of the two categorical variables being considered, each time recalculating $P(b)$ $(b = 1,2,...,B )$ repeatedly in order to simulate the null distribution. After $B$ iterations, the exact p-value is

#-{b : P(b) <-P} P = B .

(5.67)

Randomization Test. You will notice that even for small tables, e.g. 3x4, 4x4, 4x5, 5x5, the number of iterations required for fully implementing the exact form of the test can exceed than $4 10$ conﬁgurations. Therefore, NXG Explorer uses the following randomization test:

Calculate the raw chi-squared value for the table, call this $2 χ$ .
Randomly shuﬄe (permute) one of the grouping variable’s values.
Calculate chi-squared using the grouping variables (with one of them shuﬄed), call this $χ2(b)$ .
Repeat steps 2-3 $B = 10,000$ times (default).

The p-value for this randomization test is

2 2 #{b-: χ(b) ≥-χ-} P = B ,

(5.68)

that is, the number of times the chi-squared value for the permuted conﬁgurations exceeds or is equal to the chi-squared value for the raw data conﬁguration (table), divided by the number of iterations.

5.11 McNemar Test for Paired Binary Outcomes

McNemar’s test is a chi-squared test of the signiﬁcance of change among objects for which the presence (+) or absence (-) of a trait is determined before and after an exposure (treatment) is given. As an example, consider an intervention study to determine whether hearing loss could be signiﬁcantly improved after cochlear implant therapy among patients with severe tinnitus (ringing in the ears). The number of patients testing positive for severe tinnitus before and after cochlear implant therapy is listed in the table below. McNemar’s test will allow you to determine whether the therapy signiﬁcantly changed the frequency of tinnitus, and whether the outcome after therapy was better or worse.

The data table for patients undergoing cochlear implant therapy and the presences of tinnitus is given below:


	Before


After	$+$	$−$	Total


$+$	330	141	$n1.$ = 471

$−$	351	383	$n2.$ = 734

	$n.1$ = 681	$n.2$ = 524	$n$ = 1205

(b− c)2 χ21 = ------- b+ c 2 = (141−-351)- 141+ 351 = (− 210)2 492 = 89.63

(5.69)

The degrees of freedom (d.f.) for this test is equal to 1. The calculated test statistic of 89.63 is signiﬁcant, since it exceeds (or is equal to) the $1− α$ critical value of chi-squared for $α = 0.05$ and 1 d.f., $χ20.95,1 = 3.84$ . Therefore the p-value would be less than 0.05, i.e., $p < 0.05$ .

Example 8 - McNemar Test. Open the mcnemar_data.xlsx ﬁle that is in the Data folder. Next, select Independence $→$ McNemar in the Analysis pull-down menu:

Figure 5.10: Independence pull-down menu for the pair categorical McNemar test.

It is important to realize that both features used for the McNemar test must be binary (0,1) codes. Binary features will have a icon associated with them. Next, select both binary features:

Figure 5.11: Feature selection during McNemar test.

The results are shown below. While the change is not signiﬁcant, there were 22% more subjects that changed from positive to negative when compared with subjects that changed form negative to positive, so the pattern indicates that, on average, the cochlear implant therapy resulted in an improvement. Obviously, in clinical medicine, for an eﬃcacious treatment you would expect many more subjects that changed from positive to negative, and few subjects whose conditions worsened.

Figure 5.12: McNemar test results.

Bibliography

[1] M.S. Bartlett. Properties of suﬃciency and statistical tests. Proc. Royal Stat. Soc., Series A. 160: 268282; 1937.

[2] W. Behrens. Ein beitrag zu fehlerberechnung bei wenige beobachtungen. Landwirtsscheftliche Jahrbucher. 68:807-837, 1929.

[3] R. Fisher. The comparison of samples with possibly unequal variances. Annals of Eugenics, 9:174–180, 1932.

[4] W. Dixon, F. Massey. Introduction to Statistical Analysis. New York (NY), McGraw-Hill, 1969.

[5] H. Mann, D. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat., 18:50–60, 1947.

[6] A.M. Mood. On the asymptotic eﬃciency of certain non-parametric two-sample tests. Ann. Math. Stat., 25: 514–522, 1954.

[7] B.L. Welch. On the comparison of several means values. Biometrika, 38(3/4): 330–336, 1951.

[8] W. Kruskal, W. Wallis. Use of ranks on one-criterion variance analysis. Amer. Stat. Soc., 47:583–621, 1952.

[9] S. Siegel. Nonparametric Statistics. New York(NY), McGraw-Hill, 1956.

[10] G.H. Freeman, J.H. Halton. Note on an exact treatment of contingency, goodness of ﬁt and other problems of signiﬁcance. Biometrika, 38: 141–149, 1951.

[11] A. Agresti. Categorical Data Analysis. New York(NY), John Wiley and Sons, 2002.

Chapter 6
ASSOCIATION

6.1 Covariance

Association is deﬁned as the co-occurrence of a linear relationship between two continuously-scaled features. For example, the classical example of two associated features in clinical medicine is age and systolic blood pressure (mmHg). As age increases, there is a tendency for systolic blood pressure to increase as well. Recall that the variance of a normally-distributed continuously-scaled random variable $xi$ is deﬁned in the form

1 ∑n s2x =----- (xi − ¯x)2. n − 1i=1

(6.1)

If we consider another continuous feature $yi$ , then the covariance between $x$ and $y$ is

--1--∑n sxy = n − 1 (xi − ¯x)(yi − ¯y). i=1

(6.2)

The above deﬁnitions apply to the bivariate case, where association between a pair of features is considered. The covariance for multiple features is covered at the end of this chapter.

6.2 Parametric Correlation: Pearson Product Moment

Pearson Product Moment Correlation. The Pearson product moment correlation coeﬃcient is similar to covariance, but does not include the eﬀect of each feature’s scale. Pearson correlation between features $x$ and $y$ is the ratio of their covariance to the product of their standard deviations given as

∑ sxy- ------ni=1(xi −-¯x)(yi −-¯y)--- rxy = sxsy = ∘∑n---(xi −-¯x)2∑n-(yi −-¯y)2. i=1 i=1

(6.3)

The correlation coeﬃcient is scale independent and dimensionless, with range $± 1$ . When $rxy$ approaches 1, then there is a strong linear association between $x$ and $y$ . However, as $rxy$ approaches 0, the linear association between $x$ and $y$ gets weaker. A negative value of $rxy$ indicates that as one feature goes up in scale, the other feature goes down. Recall that the population-based notation for covariance and standard deviation are $σxy$ and $σx$ , $σy$ , respectively, and thus correlation can be written as

ρ = ∘σxy--- xy σ2xσ2y = σxy-. σyσy

(6.4)

Figure 6.1: Examples of correlation values as a function of sample size (with signiﬁcance testing results).

Correlation: Z-score implementation. A much simpler formula for correlation involves the cross-product between the Z-scores for $x$ and $y$ in the form

z (i) = xi −-¯xj, x sx

(6.5)

yi − ¯yj zy(i) =--s---, y

(6.6)

in the form

∑n z (i)z(i) i=1--x---y-- rxy = n − 1 .

(6.7)

We can test the signiﬁcance of a correlation coeﬃcient using the null hypothesis $Ho : r = r0$ vs. the alternative hypothesis $Ha : r ⁄= r0$ by use of Fisher’s $z$ -transformation. For the sample estimate we use

( 1+ r) zr = 0.5loge ----- , 1− r

(6.8)

whereas for the hypothesized coeﬃcient $r0$ we use

( 1+ r ) zr0 = 0.5loge ----0 . 1− r0

(6.9)

The two statistics are then used to calculate a standard normal deviate

z = zr −-zro, σx

(6.10)

where standard error of $x$ is

∘--1-- σx = n−-3.

(6.11)

Worked problem. A research study measured log(ﬂuorescent intensity) values of expression of Corin (protein) in (a) prostate tumors and (b) serum plasma of $n$ =8 prostate cancer patients. [Note: log(ﬂuorescent intensity) implies that expression values can be negative, e.g., log(0.6)=-0.51)]. Determine the covariance and Pearson correlation between the two “tissue compartments”, and test whether correlation is signiﬁcantly diﬀerent from $r0 = 0$ .

Table: Covariance and Pearson correlation between expression of Corin in prostate tumor tissue and serum plasma.


Plasma	Tumor	$(xi1 − ¯x1)$	$(xi2 − ¯x2)$	Cov.	$( ) zi1 = xi1s−1 ¯x1$	$( ) zi2 = xi2−s2¯x2$	Corr.
log(intensity)	log(intensity)	(A)	(B)	(A)(B)	(C)	(D)	(C)(D)

-1.69	-1.65	-1.07	-1.12	1.20	-0.77	-0.81	0.62
0.10	0.98	0.71	1.51	1.08	0.51	1.09	0.56
-0.04	-0.31	0.57	0.23	0.13	0.41	0.17	0.07
-0.89	-0.68	-0.27	-0.14	0.04	-0.20	-0.10	0.02
-2.25	-2.22	-1.64	-1.68	2.76	-1.17	-1.21	1.42
1.94	1.29	2.55	1.82	4.65	1.82	1.32	2.39
0.00	0.47	0.61	1.01	0.62	0.44	0.73	0.32
-2.08	-2.17	-1.46	-1.63	2.39	-1.04	-1.18	1.23

$¯x = 1$ -0.61	$¯x = 2$ -0.54		$∑ = i$	12.87		$∑ = i$	6.62
$s1 =$ 1.40	$s1 =$ 1.39		$∑ ∕(n − 1) = i$	1.8384		$∑ ∕(n − 1) = i$	0.9455

The calculated value of the correlation coeﬃcient is $r$ =0.9455. Let’s now test whether this value of correlation is signiﬁcantly diﬀerent from zero, using the null hypothesis $Ho : r =$ 0.00. Substituting in the values, we have

( 1+-0.9455) zr = 0.5 ln 1− 0.9455 ( 1.9455 ) = 0.5 ln ------ 0.0545 = 0.5 ln(0.1060) = − 1.1222.

(6.12)

For $r0$ =0.00 we use

( 1+ 0.00) zr = 0.5 ln ------- ( 1− 0.)00 = 0.5 ln 1.00- 1.00 = 0.5 ln(1.0000) = 0.0000.

(6.13)

Substituting in the remaining items, we get

∘ ----- σx = --1--= 0.4472 8 − 3

(6.14)

zr −-zr0 Z = σx − 1.1222− 0.0000 = ----0.4472----- = − 2.5094

(6.15)

So we reject the null hypothesis of zero correlation, since $|Z| ≥ 1.96$ .

6.3 Non-parametric Correlation: Spearman Rank

The Spearman rank correlation coeﬃcient, $rs$ is an alternative method for determining the correlation between two features $x$ and $y$ . Spearman correlation is a non-parametric method that does not depend on means and standard deviations of the pair of vectors $x$ and $y$ . Let $Rx1,Rx2,...,Rxnx$ be the ranks of $x$ and $y y R1,R 2,...,Ryny$ be the ranks of $y$ . Spearman correlation is determined as

rs = 1 −-63D--, n − n

(6.16)

where:

∑n n∑ D = d2i = (Rx(i) − Ry(i))2. i=1 i=1

(6.17)

Signiﬁcance testing of $rs$ is based on the standard normal distribution. Using the expected value of $D$ as

E (D ) = n3 −-n-= n(n−-1)(n+-1). 6 6

(6.18)

n2(n+-1)2(n-−-1) V(D) = 36 .

(6.19)

D − E (D) Z = -∘-------. V (D)

(6.20)

Worked problem. A research group studied log(expression) of two serum plasma proteins (A & B) in $n$ =8 laboratory mice after several weeks of feeding via a high-fat diet. [Note: log(expression) implies that expression values can be negative, e.g., log(0.6)=-0.51)]. Determine the Spearman rank correlation between the two proteins, and test whether correlation is signiﬁcantly diﬀerent from $r0 = 0$

Table: Spearman correlation between expression values of two serum plasma proteins.


Protein A	Protein B	$Rx$	$Ry$	$(Rx − Ry )$	$(Rx − Ry )2$
log(expression)	log(expression)	(A)	(B)	(A)-(B)	((A)-(B)) $2$

-0.18	0.26	5	7	-2	4
-0.57	-1.03	4	2	2	4
0.59	0.18	7	6	1	1
1.78	1.50	8	8	0	0
-1.97	-1.46	1	1	0	0
0.53	-0.26	6	5	1	1
-0.83	-0.75	2	4	-2	4
-0.69	-0.85	3	3	0	0

					$∑ = i$ 14

Substituting in the results, we get:

∑n ∑n y D = d2i = (Rx(i) − R(i))2 i=1 i=1 = 14(from the table)

(6.21)

(6)(D) rs = 1− n3 −-n = 1− (6)(14)- 512− 8 = 1− 84- 504 = 1− 0.1667 = 0.8333

(6.22)

Signiﬁcance testing of $rs$ is based on the standard normal distribution. Using the expected value of $D$ as

n3 − n E(D ) =--6--- n(n− 1)(n+ 1) = ------6------ (8)(7)(9) = -------- 5046 = --- 6 = 84,

(6.23)

and the variance of $D$ as

2 2 V(D ) = n-(n+-1)-(n−-1) 36 = 82(8+-1)2(8-−-1) 36 (64)(81)(7) = 36 36288 = 36 = 1008.00

(6.24)

and upon substitution, we have

Z = D∘−-E(D)- V(D) 14−-84- Z = √1008- − 70 Z = 31.75- Z = − 2.20

(6.25)

So we reject, since $|Z | ≥ 1.96$ .

The advantages of Spearman correlation are as follows:

Less sensitive to bias due to outlier eﬀects;
Does not require assumption of normality, although requires assumptions about symmetry of a Gaussian-like distribution.

The disadvantages are:

Lower power if assumptions for using parametric (Pearson) are true;
Calculations become tedious in spreadsheets;
Ties are important and must be factored into computation.

6.4 Force Plots from Minimum Spanning Trees

NXG Explorer also provides force plots derived from the minimum spanning trees of the correlation matrices. Firstly, the distance between features(objects) $j$ and $k$ is determined as $∘ --------- djk = 2(1− rjk)$ , where $rjk$ is the Pearson or Spearman rank correlation. Prim’s [1] algorithm is then applied to the distance matrix to construct the minimum spanning tree.

Example 1 - Pearson and Spearman rank correlation. This run will determine the Pearson and Spearman correlation between all pairs of continuous features in the Retinol data. To begin, ﬁrst open the Retinol dataset. Next, select Analysis $→$ Association $→$ Correlation, Covariance, or Euclidean distance from the Analysis pull-down menu.

Next, select all the continuous features in the Retinol dataset, then use the default settings in the popup window shown below:

The resulting correlation matrix will be shown after clicking on the treenode icon for “Pearson correlation (html)”:

Using html format, the Pearson and Spearman correlation coeﬃcients are provided along with color coding to distinguish signiﬁcant negative and positive correlation coeﬃcients. In other words, a test of signiﬁcance is automatically performed at run-time, and testing results are revealed via color coding:

An additional plot is made showing the p-value results based on the same color coding used for the html output:

Click on the Force plot(Pearson) and Force plot(Spearman) icons in the treenodes, and the following images will appear:

The minimum spanning trees shown above depict the various relationships between the features employed, and diﬀerences between force plots constructed from Pearson correlation and Spearman correlation matrices is due to the range and scale of the features used. In cases where outliers are present in the data, Spearman rank-based force plots will reveal a better picture of the true minimum spanning tree.

Example 2 - Object-by-Object correlation. Correlation is commonly applied to the features, however, it can also be applied to objects to identify objects with similar correlation across their feature values. This example will use a dataset based on expression of 23 genes in 4 classes of $n = 63$ small round blue cell tumors (SRBCT). First, open the SRBCT_wo_classes.xlsx dataset. Transpose the entire dataset by selecting the following commands:

Then select all the features in the dataset, without selecting the Text-based features.

Then, select all of the objects for correlation analysis. The results of the signiﬁcance plot for Pearson correlation are shown below:

The 23 genes in the SRBCT dataset were identiﬁed as the best class predictors for the four tumor types, and the high co-regulation (co-expression) among genes within the four classes (EWS-Ewing’s sarcoma, BL-Burkitt’s lymphoma, NB-neuroblastoma, and RMS-rhabdomyosarcoma) can be seen in the correlation signiﬁcance plot, where large red areas indicate statistically signiﬁcant positive correlation. The signiﬁcant positive correlation (red) is only apparent within each class, but not between each class. This object-by-object approach to correlation can provide tremendous strength to exploratory analysis to conﬁrm impressions on correlation between objects.

6.5 Multivariate Forms of Covariance and Correlation

Let $Mn,p(ℜ)$ be the set of all data features over the ﬁeld of real numbers, and $Mp (ℜ)$ as the set of symmetric $p$ -square matrices over the ﬁeld of real numbers. Deﬁne the data matrix $Y ∈ Mn,p(ℜ )$ with $n$ rows and $p$ columns as a function on the pairs of integers $(i,j),1 ≤ i ≤ n,1 ≤ j ≤ p$ , with values in Y in which $yij$ designates the value of Y at the pair $(i,j)$ shown as

( y y ⋅⋅⋅ y ) | y11 y12 ⋅ y1p | Y = || 2.1 2.2 . 2p. || . n×p ( .. .. .. .. ) yn1 yn2 ⋅⋅⋅ ynp

(6.26)

The sample mean for the $j$ th column of Y is

∑n ¯yj = 1 yij, n i=1

(6.27)

so that the $p ×1$ sample mean vector is given by

( ) | y¯1 | ¯y = || y¯2 || . ( ... ) y¯p

(6.28)

The $n × p$ standardized data matrix of Y is

( ) z11 z12 ⋅⋅⋅ z1p || z21 z22 ⋅ z2p || pZ×p = |( ... ... ... ... |) , zn1 zn2 ⋅⋅⋅ znp

(6.29)

with elements

zij = yij-−y¯j. sjj

(6.30)

The sample covariance matrix of Y is

( s11 s12 ⋅⋅⋅ s1p ) | s21 s22 ⋅ s2p | S = || . . . . || , p×p ( .. .. .. .. ) sp1 sp2 ⋅⋅⋅ spp

(6.31)

with diagonal elements determined with the form

--1--∑n 2 sjj = n − 1 (yij − ¯yj), i=1

(6.32)

and oﬀ-diagonal elements as

n sjk = -1--∑ (yij − ¯yj)(yik − ¯yk). n− 1 i=1

(6.33)

Below is a plot of the covariance matrix for all features in the Retinol dataset:

The $p−$ symmetric sample correlation matrix of Y is

( ) | r11 r12 ⋅⋅⋅ r1p | R = || r21 r22 ⋅ r2p || p×p ( ... ... ... ... ) rp1 rp2 ⋅⋅⋅ rpp

(6.34)

with elements

--sjk-- rjk = √sjjskk.

(6.35)

6.6 Euclidean Distance

The proximity matrix D contains the Euclidean distances between each pair of features and appears in the form

( d d ⋅⋅⋅ d ) | d11 d12 ⋅ d1p | D = || 21. 2.2 . 2p. || p×p ( .. .. .. .. ) dp1 dp2 ⋅⋅⋅ dpp

(6.36)

with elements

( n )1∕2 djk = ∑ (yij − yik)2 . i=1

(6.37)

The Euclidean distance matrix for all Retinol dataset features is shown below:

Columns of Y are usually measured on diﬀerent scales (units), and it is therefore possible that one or more features will strongly inﬂuence the elements in D simply because of their large scale diﬀerences. To remove scale diﬀerences, the correlation matrix R is often used for the covariance matrix because its elements are scale independent and have potentially much less variation across the features. We advocate use of the correlation matrix R generated from the standardized data matrix Z rather than from Y in order to minimize additive and relative changes across features while removing possible scale eﬀects.

6.7 Matrix Formulation of Covariance and Correlation

The deﬁnitions of bivariate and multivariate covariance and correlation described in the preceding sections can be further expounded on from the perspective of matrix algebra. The use of covariance and correlation matrices allows us to consider additional concepts involving their determinants, eigenanalysis, and orthogonality.

Design (Data) Matrix Consider the following (data) design matrix:

⌊ ⌋ x11 x12 x13 ⋅⋅⋅ x1j ⋅⋅⋅ x1p ||x21 x22 x23 ⋅⋅⋅ x2j ⋅⋅⋅ x2p || || .. .. .. || X = || . . . || (n×p) ||xi.1 xi2 xi3 ⋅⋅.⋅ xij ⋅⋅.⋅ xip || ⌈ .. .. .. ⌉ xn1 xn2 xn3 ⋅⋅⋅ xnj ⋅⋅⋅ xnp.

(6.38)

By deﬁnition, the population covariance matrix is

∂ef T Σ = E[(x− μ) (x − μ)],

(6.39)

where $x$ is a $p$ -tuple $(xi1,xi2,...,xip)$ for the $i$ th object and $μ$ is a $p$ -tuple with means for the $p$ features. Because it is impossible to measure $Σ$ , we are limited to working with the sample covariance matrix, given as

1 ∑n S = -- (xi − ¯x )T(xi − ¯x). n i=1

(6.40)

Let $D = diag(S)$ , the sample correlation matrix is

−1 −1 R = D SD .

(6.41)

During computation, elements $rjk$ of $R$ can be determined individually using the relationship

sjk rjk = √s--s--- s jj kk = --jk-. sjsk

(6.42)

If the means of columns of $X$ are known, covariance elements $sjk$ of $S$ can be directly calculated using the functional form

n∑ sjk = 1- (xij − ¯xj)(xik − ¯xk), n i=1

(6.43)

whereas, if the standard deviations and correlations are already known, then the covariance matrix elements can be calculated using

sjk = rjksjsk.

(6.44)

Notation.

$Σ$ : Population covariance matrix
$ρ$ : Population correlation matrix
$C$ : Sample covariance matrix
$R$ : Sample correlation matrix
$CE = ΛE (C = EΛET )$
$RE = ΛE (R = E ΛET )$
$λ1 ≥ λ2 ≥ ⋅⋅⋅ ≥ λp > 0$ : Eigenvalues of $Σ$ or $ρ$
$l1 ≥ l2 ≥ ⋅⋅⋅ ≥ lp > 0$ : Eigenvalues of $C$ or $R$

Properties of Covariance Matrix.

Elements $σjk = ρjkσjσk$
$∑p ∑p j=1λj = tr(Σ) = j=1 σjj$
Loadings $∘ -- ρyj,xk = ekj λj√σkk$
Determinant $∏p |Σ | = j=1λj$

Properties of Correlation Matrix.

Elements $ρjk = σjk∕√σjjσkk$
$∑p ∑p j=1λj = tr(ρ) = j=1ρjj = p$
Loadings $ρyj,xk = ekj∘ λj$
Determinant $∏p |ρ| = j=1λj$
$|C| = ∏p λj ⁄= |R | = ∏p λj j=1 j=1$

Independence and Multicollinearity. Under independence of features (no covariance), the properties of C and R are:

Orthogonal columns of $X$
Zero upper and lower triangulars of $C$ , i.e., $c = 0 ∀j,k jk$
Zero upper and lower triangulars of $R$ , i.e., $rjk = 0 ∀j,k$
$C = I$ , $R = I$

For multicollinearity:

Correlation between all columns of $X$ (stringent!)
$cjk ⁄= 0 ∀j,k$
$rjk ⁄= 0 ∀j,k$
$C ⁄= I$ , $R ⁄= I$

Orthogonality

Features are orthogonal if their covariance (correlation) is zero.
Under orthogonality, $C = I$ , $R = I$ . An example of an identity matrix is
$⌊ 1 0 0 ⋅⋅⋅ 0 ⋅⋅⋅ 0⌋ | 0 1 0 ⋅⋅⋅ 0 ⋅⋅⋅ 0| || 0 0 1 ⋅⋅⋅ 0 ⋅⋅⋅ 0|| || . . . || I = || .. .. .. || (p×p) || 0 0 0 ⋅⋅⋅ 1 ⋅⋅⋅ 0|| |⌈ .. .. .. |⌉ . . . 0 0 0 ⋅⋅⋅ 0 ⋅⋅⋅ 1$ (6.45)

Bibliography

[1] R.C. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal. 36(6): 1389–1401, 1957.

Chapter 7
DEPENDENCY

7.1 Linear Regression: Single Predictor

Linear regression focuses on the dependency between a dependent variable (feature) $y$ and an independent variable (feature) $x$ [1, 2]. Deﬁne the following estimators:

β : regression coeﬃcient for y- intercept, i.e., value of y when x = 0 0 β1 : regression coeﬃcient for slope, i.e., change in y for a 1-unit change of x yi : observed y-value for the ith object yˆi = β0 + β1xi : predicted value of y for the ith object e = y − ˆy : residual, diﬀerence between observed and predicted y-values i i i

(7.1)

Next, let the objective function $Q(β0,β1)$ , be dependent on the parameters $β0$ and $β1$ . The objective function is minimized using the method of least squares, for which the sum of the squared residuals, $∑ 2 ei$ , is minimized. Recall that the residuals, $ei = yi − ˆyi = yi − β0 − β1x$ .

Q(β ,β ) = β +β x +e 0 1 0 1in i min∑ e2 i=1 i s.t. β0,β1

(7.2)

The above section introduced linear regression based on a single predictor; however, it is more common to solve regression problems with multiple predictors in order to adjust for the eﬀects of confounders or nuisance factors. Therefore, we focus more on multiple linear regression, introduced below.

7.2 Multiple Linear Regression

Multiple linear regression employs more than one predictor feature, and still only uses one outcome $y$ -feature. The calculations used for determining regression coeﬃcients are based on matrix algebra, which is described in the section below.

Matrix Form of Multiple Linear Regression. A standard multiple linear regression model is

y = X β + e.

(7.3)

In matrix form, the model equation is expressed as

⌊y1⌋ ⌊x11 x12 ⋅⋅⋅ x1p⌋⌊β1⌋ ⌊𝜀1⌋ |y2| |x21 x22 ⋅⋅⋅ x2p||β2| |𝜀2| || ..|| = || .. .. .. .. |||| ..|| + || ..|| . ⌈ .⌉ ⌈ . . . . ⌉⌈ .⌉ ⌈ .⌉ yn xn1 xn2 ⋅⋅⋅ xnp βp 𝜀n

(7.4)

where $X$ is the data matrix with dimensions $n × p$ , $n$ is the number of rows in $X$ , $p$ is the number of columns in $X →$ # features, $i$ is the $i$ th row $(i = 1,2,...,n)$ , and $j$ is the $j$ th feature (column) $(j = 1,2,...,p)$ .

In theory, since we cannot access an inﬁnitely large universal population to make inferences, we have to generalize results using a sample. When doing so, we replace the $β$ terms with the maximum likelihood estimates $βˆ$ . Without loss of generality, we nevertheless use $βˆj$ interchangeably with $βj$ .

By use of calculus (involving derivatives of several matrices) it can be shown that the normal equations can be rearranged to solve for the regression coeﬃcients in the form

⊤ −1 ⊤ β = (X X ) X y.

(7.5)

The necessary matrix operations for solving for the regression coeﬃcients are

( )−1 β = (pX×p)⊤(Xn×p) (Xp×n)⊤ y (p×1) (⌊ (n⌋×⌊1) ⌋)−1⌊ ⌋⌊ ⌋ ||xx1121 xx1222 ⋅⋅⋅⋅⋅⋅ xx12nn||xx1211 xx1222 ⋅⋅⋅⋅⋅⋅ xx12pp|| |xx1211 x1x222 ⋅⋅⋅⋅⋅⋅ x1x2nn||yy12| = ||(||⌈ .. .. .. .. ||⌉||⌈ .. .. .. .. ||⌉||) ||⌈ .. .. .. ..||⌉||⌈ ..||⌉. x. x. ⋅⋅.⋅ x. x. x. ⋅⋅.⋅ x. x. x. ⋅⋅.⋅ x. y. p1 p2 pn n1 n2 np p1 p2 pn n

(7.6)

Matrix notation when no constant term is used. When no constant term is used, the matrix notation is:

⌊ ⌋ ⌊ ⌋ ⌊ ⌋ ⌊ ⌋ |y1| |x11 x12 ⋅⋅⋅ x1p| |β1| | 𝜀1| ||y2.|| = ||x21. x2.2 ⋅⋅⋅ x2.p|| ||β2.||+ || 𝜀.2|| ⌈ ..⌉ ⌈ .. .. ... .. ⌉ ⌈ ..⌉ ⌈ .. ⌉ yn xn1 xn2 ⋅⋅⋅ xnp βp 𝜀n

(7.7)

Matrix notation when constant term is used. Using expanded form, when a constant term is introduced into the regression model, a one is “padded” into the left side of the $X$ matrix in column one:

⌊ ⌋ ⌊ ⌋ ⌊ ⌋ ⌊ ⌋ |y1| |1 x11 x12 ⋅⋅⋅ x1p| |β0| |𝜀1| ||y2.|| = ||1. x2.1 x.22 ⋅⋅.⋅ x2p|| ||β1.|| + ||𝜀2.|| ⌈ ..⌉ ⌈.. .. .. .. ⌉ ⌈ ..⌉ ⌈ ..⌉ yn 1 xn1 xn2 ⋅⋅⋅ xnp βp 𝜀n

(7.8)

Once regression coeﬃcients $β$ are ﬁtted, we can cross-multiply them against their relevant x-values and add the products to obtain the predicted y-value, also known as “y-hat,” given as

p ˆyi = ˆβ0 + ∑ βˆjxij, j=1

(7.9)

where $ˆβ0$ is an estimate of the intercept, and $ˆβj$ is an estimate of the $j$ th regression coeﬃcient for feature $xj$ . Using the y-hats, the residual for each observation can then be determined as

ei = yi − ˆyi.

(7.10)

Variance Partitioning – Sums of Squares. Multiple linear regression uses matrix algebra to determine the partitioned sum-of-squares. The total sum of squares for multiple regression is

(1 ) SST O = y⊤y − -- y ⊤11⊤y, n

(7.11)

where

y⊤y = ∑n y2i, i=1

(7.12)

and

( ) ∑n 2 -1 y⊤11⊤y = (--i=1yi). n n

(7.13)

The sum of squares error is functionally composed as

⊤ ⊤ ⊤ SSE = y y − β X y,

(7.14)

while the sum of squares regression is

( ) ⊤ ⊤ 1- ⊤ ⊤ SSR = β X y − n y 11 y.

(7.15)

The sums of squares helps us partition the variance which is attributable to the regression and the resulting error after the regression coeﬃcients are ﬁt. If SSE=0, then the models ﬁts perfectly to the data. As SSE increases, the original $y$ -values become further from their predicted values, $ˆy i$ . If there is too much variation in the residuals, i.e., $σ2$ , the SSE will be large. In addition, the straight-line model assumption may be inappropriate. If we assume that the linear straight-line assumption holds, then an estimate of $σ2$ can be made from using SSE. We call this estimate, the mean square error, or MSE. The square root of the MSE, that is, $√M--SE$ , is called the root mean square error.

2 M SE = σ --1--∑n 2 = n − p (yi − ˆyi) i=1 = --1--SSE. n − p

(7.16)

Note: MSE is an unbiased estimator of $σ2$ , which is the variance of the residuals, $ei = yi − yˆi$ . $σ2 ≈ M SE$ and its square root, $σ ≈ RM SE$ , are used throughout later calculations, especially for the t-tests for regression coeﬃcients, prediction intervals, and conﬁdence intervals. Please note that an alternative notation for $σ2$ is $S2y|x$ .

Finally, we calculate the RMSE using the notation

√ ----- RM SE = M SE = σ

(7.17)

The mean squares due to regression, MSR, is also calculated, but it is only used in the ANOVA table for the global F-test. Whereas the MSE is used in multiple later calculations.

After partitioning the variance due to our regression model, we must next construct an ANOVA table, which will allow us to set up the additional calculation for the F-ratio test to test the hypothesis: $H0 : β1 = β2 = ⋅⋅⋅ = βj = ⋅⋅⋅βp$ =0. The alternative hypothesis (your belief) is that at least one of the coeﬃcients is not equal to zero: $H1 : βj ⁄= 0$ . The test statistic for the F-ratio test is equal to $F = M SR ∕M SE$ . The calculated F statistic approximates a variate from the F-ratio probability distribution, which has two parameters: numerator degrees of freedom, $ν$ , and denominator degrees of freedom, $ω$ . In a linear regression problem, the F-statistics’s numerator degrees of freedom, $ν$ , is always equal to $p− 1$ , while the denominator degrees of freedom, $ω$ , is always equal to $n − p$ . The operation $p − 1$ implies subtracting one from the number of regression coeﬃcients, and in our single predictor model we know that $p = 2$ : due to $β0$ and $β1$ . The denominator degrees of freedom $ω$ is simply equal to the sample size $n$ minus the number of regression coeﬃcients, which, again is $p = 2$ for a single predictor regression model. The degrees of freedom are required because SSE and SSR are estimators, which are penalized by a certain degrees of freedom during their formation. Obviously, as the F-statistic increases, it suggests that the regression model explains more of the variance of $y$ than SSE does. The ANOVA table is provided below.


Source of variation	Computational formula	df	MS	$F$

Due to regression	$∑n SSR = i=1(ˆyi −y¯)2$	$p − 1$	$M SR = SS1R-$	$F = MMSSRE-$

Due to error	$SSE = ∑n (y − ˆy)2 i=1 i i$	$n − p$	$M SE = SSE- n−p$


Total	$∑ SST O = ni=1(yi −y¯)2$	$n − 1$

Example. A researcher was interested in studying the dependency of a dependent feature, $Y$ , on three independent features, $X1$ , $X2$ , and $X3$ . Determine the percentage of variance explanation of $Y$ by the $X$ -predictors. Assume that linear regression is appropriate for determining the relationship between $Y$ and the independent features. The table below shows the values of the features assumed.


ID	Y	X1	X2	X3

1	1.2	-1.0	-0.7	0.9

2	2.0	-0.9	1.5	0.4

3	1.9	0.9	-0.2	1.3

4	1.7	-1.5	0.1	-0.1

5	1.6	1.4	0.9	-0.9

6	1.8	-0.9	1.5	0.7

7	0.0	-1.1	-3.0	0.6

8	0.6	-0.3	1.0	-1.5

9	1.9	0.6	-1.5	0.7

10	1.5	-1.4	1.5	-1.4

11	0.9	0.1	-1.4	-1.6

12	1.0	0.3	-0.2	-0.1

13	1.5	-0.8	0.1	1.5

14	1.5	-0.3	0.3	-0.3

15	0.6	-0.6	-0.6	0.3

16	1.1	1.8	-1.1	-1.1

17	2.2	0.9	-0.9	0.2

18	0.9	-0.2	-0.4	1.1

19	1.5	0.4	-0.1	-0.1

20	1.0	-1.1	1.5	0.2

ANOVA Table Results. Below is the resulting ANOVA table:


Source of variation	SS	df	MS	$F$

Regression	2.35	3	0.78	3.35

Error	3.74	16	0.23

Total	6.09	19

In the ANOVA table, the formal statement of the null and alternative hypotheses is

H0 : β1 = β2 = ⋅⋅⋅ = βj = ⋅⋅⋅βp = 0 H1 : βj ⁄= 0

(7.18)

If $H0$ holds, $F ∼ F0.95,p−1,n−p$ . Large values of $F$ lead to conclusion $H1$ . For this example, since the calculated test statistic for F was equal to 3.35 and exceeded the lookup critical value of F based on $1− α = 1− 0.05 = 0.95$ , $ν$ =3 $ω$ =16, 3.24, we conclude that $H1$ is true, suggesting that at least one of the coeﬃcients is not equal to zero.

Notation for the variance-covariance matrix of coeﬃcients is:

⌊ ⌋ σ2(β1) σ (β1,β2) σ(β1,β3) σ(β1,β4) 2 ⊤ − 1 ||σ(β2,β1) σ2(β2) σ(β2,β3) σ(β2,β4)|| σ (X X ) = V (β ) = ⌈σ(β3,β1) σ (β3,β2) σ2(β3) σ(β3,β4)⌉ σ(β4,β1) σ (β4,β2) σ(β4,β3) σ2(β4)

(7.19)

Results for the variance-covariance matrix of coeﬃcients:

⌊ ⌋ |0.0124 0.0031 0.0015 0.0003| σ2(X ⊤X )−1 = V (β ) = |⌈0.0031 0.0153 0.0037 0.0036|⌉ 0.0015 0.0037 0.0098 0.0028 0.0003 0.0036 0.0028 0.0150

(7.20)

The source of s.e. $(βj)$ for the denominator of the t-test for each $βj$ is:

⌊∘ ------ ⌋ σ2(β1) σ∘ (β1,β2) σ(β1,β3) σ(β1,β4) 2 ⊤ − 1 ||σ(β2,β1) σ2(β2) σ∘(β2,β3)- σ(β2,β4)|| σ (X X ) = V (β ) = ⌈σ(β3,β1) σ (β3,β2) σ2(β3) σ(β3,β4)⌉ σ(β4,β1) σ (β4,β2) σ(β4,β3) ∘ σ2(β4)

(7.21)

After taking the square root of each diagonal element, we get:

⌊ ⌋ s.e.(β1) = 0.1114 σ2(X⊤X )−1 = V(β) = || s.e.(β2) = 0.1239 || ⌈ s.e.(β3) = 0.0988 ⌉ s.e.(β4) = 0.1224

(7.22)

Let us now determine if the regression coeﬃcient $β j$ , is statistically signiﬁcant. If the regression coeﬃcient $β j$ is a reliable estimate of the change in $Y$ as a function of a 1-unit change in feature $X j$ , then its value will will be large, and either positive or negative. A positive slope suggests that $Y$ increases with increasing $X j$ , whereas a negative slope indicates that $Y$ decreases with increasing $X j$ . Using the standard error of each regression coeﬃcient, i.e., $s.e.(β ) j$ , we now calculate the t-test for each coeﬃcient, $β (j = 1,2,...,p) j$ , using the relationship

tj = --βj--. s.e.(βj)

(7.23)

The hypothesis for any regression coeﬃcient is: The formal statement of the null and alternative hypotheses is:

H0 : βj = 0. H : β ⁄= 0 1 j

(7.24)

If $H0$ holds, $t ∼ t1−α∕2;n−p$ . Large values of the calculated test statistic $|tj|$ lead to concluding $H1$ is true.

Regression coeﬃcient results. Below are the regression coeﬃcients, and t-test results:


Variable	$βj$	$s.e.(βj)$	$t$	Prob

Const	1.39434	0.11144	12.5119	0.0000

X1	0.26015	0.12386	2.1004	0.0519

X2	0.26492	0.09878	2.6821	0.0164

X3	0.22297	0.12244	1.8210	0.0874

We notice that feature $X2$ is a signiﬁcant predictor of $Y$ , since $(P < 0.05)$ , but that $X1$ and $X3$ are not.

Multiple Coeﬃcient of Determination, $2 R$ . One of the overall goals of regression is to estimate the degree of variance explanation of Y by introduction of regression coeﬃcients. This is determined using the coeﬃcient of determination, $R2$ , and the multiple r-squared, $R$ . Notice that $R2$ reveals the “percentage of variance explanation” of the dependent feature, $y$ , by the independent $x$ -predictors in the model.

∑ 2 -SSR-- --ni=1(ˆyi −-¯y)2 R = SST O = ∑ni=1(yi − ¯y)2 2.34856 = 6.09066 = 0.38560

(7.25)

√ -2- R = √ R------ = 0.38560 = 0.62097

(7.26)

Similarity between regression and correlation. Although similar to correlation, the diﬀerence between correlation and regression is that correlation is essentially the tightness of data points around a straight line irrespective of slope, whereas regression focuses on the dependency of the dependent $y$ -feature(s) on one or more predictor $x$ -features. The correlation coeﬃcient is a standardized regression coeﬃcient for the regression $Y = β1X$ , since

∘ --- β = ρ σ2x. 1 xy σ2y

(7.27)

The goal of regression is to determine the amount of variance explanation due to the independent features and the slope measured by regression coeﬃcients due to each independent feature. The degree of slope reﬂects the linear relationship between the dependent feature(s) and independent features.

Centering and estimation of Intercept Term. Prior to each regression run, NXG Explorer centers each predictor feature by subtracting oﬀ the mean value from each observation:

xij = xij − ¯xj,

(7.28)

where $¯xj$ is the mean for feature $j$ . After a regression run, NXG Explorer determines the $y$ -intercept (constant term) as follows:

∑p β0 = ¯y− βj¯xj. j=1

(7.29)

The standard error of the $y$ -intercept is determined using the relationship:

┌│ --------(-----)---p------------p−1--p---------------- │∘ (n−-1)s2y-SSSSTEO-- ∑ 2 2 ∑ ∑ s.e.(β0) = n(n− p) + ¯xjσ(βj) + 2 ¯xj¯xkcov(βj,βk), j=1 j=1k=j+1

(7.30)

where $n$ is the number of objects, $p$ is the number of features in the model, $s2y$ is the variance of $y$ , and $cov(βj,βk)$ is element $(j,k)$ of the variance-covariance matrix of coeﬃcients, $σ2(X⊤X )−1$ .

7.2.1 Regression Diagnostics

Regression diagnostics are statistical techniques for detecting conditions which can lead to inaccurate or invalid regression results. Techniques include:

Detection and removal of outliers to determine their impact on a model;
Checking regression assumptions;
1. Existence: For any value of $X$ , $Y$ is a random feature with ﬁnite mean and variance.
2. Independence: $Y$ -values are statistically independent of one another, i.e., are not repeated measurements within a subject.
3. Linearity: The mean value of $Y$ , $μY|X$ is a straight line.
4. Homoscedasticity: The variance of $Y$ is the same for any value of $X$ .
5. Normality: For any ﬁxed value of $X$ , $Y$ has a normal distribution. That is $Y ⌢ N(μ ,σ2) Y|X$ or $e ⌢ N (0,σ2)$ .
Detecting the presence of collinearity, or correlation between independent features.

Model Interpretation: DACOD Acronym. We can use the $DACOD$ acronym for assuring valid regression analysis and interpretation:

Distribution - is normality assumed?
Adjustments - what else is in the model?
Coeﬃcients - $s.e.(βj)$ , direction, i.e., $βj > 0,βj < 0$ , and p-values.
Outliers - can strongly bias results.
Diagnostics - goodness-of-ﬁt ( $R2$ ), leverage, multicollinearity, etc.

The majority of regression analyses often include a constant term ( $y$ -intercept) in the regression model. The basic data setup for regression diagnostics starts with the residuals, which are based on subtraction of the predicted y, or “y-hat” based on notation, $ˆyi$ , from the observed y-value, $yi$ . Recall, the subscript $i$ represents the $i$ th object or subject (human, mouse, rabbit, etc). Let’s look at the many ways the regression model can be shown:

Recall, for multiple linear regression the vector and matrix notation is

y = X β + e,

(7.31)

which is equal to

⌊ ⌋ ⌊ ⌋ ⌊ ⌋ ⌊ ⌋ |y1| |1 x11 x12 ⋅⋅⋅ x1p| |β0| |𝜀1| ||y2.|| = ||1. x2.1 x22 ⋅⋅.⋅ x2p|| ||β1.|| + ||𝜀2.|| ⌈ ..⌉ ⌈.. .. ... .. ⌉ ⌈ ..⌉ ⌈ ..⌉ yn 1 xn1 xn2 ⋅⋅⋅ xnp βp 𝜀n

(7.32)

To calculate the predicted y-hat $ˆyi$ for the $i$ th object, we simply cross multiply the x-values for the $i$ th object with the corresponding ﬁtted coeﬃcients and add the constant term, given as

yˆi = β0 + β1xi1 +β2xi2 + ⋅⋅⋅+ βpxp1 ⊤ = β0 + x β ∑p = β0 + xijβj. j=1

(7.33)

where $⊤ x β$ , $∑ xβ$ or $xb$ is called the linear predictor.

Residuals, $ei$ . Residuals are deﬁned as

ei = ˆyi − yi,

(7.34)

and reﬂect the disagreement between the predicted and observed $y$ -values. Residuals are distributed with mean zero and common variance, $2 N (0,σ )$ .

A data setup including the y-hats and residuals, would be listed as


y	x1	x2	$⋅⋅⋅$	xp	yhat	resid


$y1$	$x11$	$x12$	$⋅⋅⋅$	$x1p$	$ˆy1$	$e1$

$y2$	$x21$	$x22$	$⋅⋅⋅$	$x2p$	$ˆy2$	$e2$

$y3$	$x31$	$x32$	$⋅⋅⋅$	$x3p$	$ˆy3$	$e3$

$y4$	$x41$	$x42$	$⋅⋅⋅$	$x4p$	$ˆy4$	$e4$

$y5$	$x51$	$x52$	$⋅⋅⋅$	$x5p$	$ˆy5$	$e5$

$. ..$	$. ..$	$. ..$	$⋅⋅⋅$	$. ..$	$. ..$	$. ..$

$yi$	$xi1$	$xi2$	$⋅⋅⋅$	$xip$	$ˆyi$	$ei$

$. ..$	$. ..$	$. ..$	$⋅⋅⋅$	$. ..$	$. ..$	$. ..$

$yn$	$xn1$	$xn2$	$⋅⋅⋅$	$xnp$	$ˆyn$	$en$

Once the residuals are determined, the sum of square error can also be estimated as follows:

n SSE = ∑ e2 i=1 i

(7.35)

1∑n σ2 = M SE = -- e2i n i=1

(7.36)

√----- σ = RM SE = M SE.

(7.37)

Residuals

The initial focus in regression diagnostics involves determination of residuals for identifying outliers. An outlier is an observations (record) that is overly inﬂuential on the regression coeﬃcients, leading to biased coeﬃcients. In the sections below, we deﬁne a majority of the most commonly used residuals for multiple linear regression.

Standardized residuals, $zi$ . Standardized residuals are deﬁned as

zi = ei, s

(7.38)

where $s = √M-SE-$ . Standardized residuals are distributed $N (0,1)$ . Obviously, if a standardized residual for a single observation is below -2 or above 2 (very close to $± 1.96$ ), we know that the residual, $ei = Yi − Yˆi$ , is greater than two standard deviations from the mean. Any observation whose standardized residual is more than two standard deviations from the mean of all residuals is viewed as an outlier.

Studentized Residuals, $ti$ . Studentized residuals are deﬁned as

ei ti = s√1-−-h--, ii

(7.39)

where $√ ----- s = M SE$ and $hii$ is the leverage of the $i$ th observation (leverages are discussed below). Studentized residuals are t-distributed with $n− p − 1$ d.f.

Deletion Variance, $s2 −i$ . Another measure of the residual is the change in variance resulting from deletion of the $i$ th observation from a regression analysis. Deletion variance is functionally deﬁned as

2 SSE--−-e2i∕(1−-hii) s−i = n − p− 1 ,

(7.40)

where $hii$ is the leverage of the $i$ th observation.

Jackknife Residuals, $t−i$ . Jackknife (“deletion”) residuals are deﬁned as

t−i =---√ei----. s−i 1− hii

(7.41)

Jackknife residuals are t-distributed with $n− p − 2$ d.f. Thus, if the absolute value, $|t−i|$ , exceeds the 95th percentile of the relevant t-distribution (i.e., $α = 0.10$ ), then it is reasonable to say that the $i$ th observation is an outlier.

Cook’s Distance, $Di$ . Cook’s distance reﬂects the change in regression coeﬃcients when an observation is deleted from a regression model.

e2h Di = ----i-ii2-2-. (1 − hii) sp

(7.42)

Observations for which $Di > 1$ should be scrutinized.

Leverages. The leverage, $hii$ , is a measure of extremeness of an observation and is derived from the diagonal elements of the $n× n$ Hat matrix, given as

H = X (X⊤X )−1X ⊤.

(7.43)

Another interpretation is that leverage is equal to the distance between the $i$ th observation and the mean of all observations. The average leverage value is

¯h = ∑ni hii= p-. n n

(7.44)

Hoaglin and Welsh [3] recommended scrutinizing observations for which $hii > 2p∕n$ . Some useful properties of the leverage are $0 ≤ hi ≤ 1$ and $∑i hii = p$ .

DFFITS, $DF FIT Si$ . DFFITS is a scaled measure of the change in the predicted value for the $i$ th observation, $ˆyi$ , and is calculated by deleting the $i$ th observation. DFFITS are in the form

∘ --h---- DF FITSi = ti ---ii--. 1 − hii

(7.45)

Large values of DFFITS indicate inﬂuential observations. A general cutoﬀ to consider is 2; a size-adjusted cutoﬀ recommended by Belsley, Kuh, and Welsch [4] is $∘ ---- 2 p∕n$ .

DFBETAS, $DF BET AS ij$ . DFBETAS are a measure of the change in the $j$ th regression coeﬃcient as a result of removing the $i$ th observation in a regression run.

----------tiZij----------- DF BET ASij = (√1−-h--)(∘ [(X-⊤X-)−-1]-), ii jj

(7.46)

where $Zij$ is an element of the $n × p$ matrix

⊤ −1 ⊤ Z = (X X) X .

(7.47)

Large values of DFBETAS indicate observations that are inﬂuential in estimating a given parameter. Belsley, Kuh, and Welsch [4] recommend 2 as a general cutoﬀ value to indicate inﬂuential observations and $√ -- 2∕ n$ as a size-adjusted cutoﬀ.

Multicollinearity

Multicollinearity is a problem caused by correlation between predictor features. Multicollinearity is not really associated with the dependency between dependent feature and independent features, but rather, the correlation among independent features. Multicollinearity mostly biases the coeﬃcients and their standard errors, causing the regression model to become too sensitive to the data. Using our input data matrix X with $n$ rows and $p$ columns, i.e., $p$ features, we ﬁrst determine the correlation coeﬃcient $r$ for all possible pairs of predictor features. We then arrange these correlation coeﬃcients in a correlation matrix denoted by R. We also know that R is a square symmetric matrix, which is given in the form

⌊ 1 r ⋅⋅⋅ r ⌋ |r 112 ⋅⋅⋅ r1p| R = || 21. . . 2p.|| , ⌈ .. .. .. ..⌉ rp1 rp2 ⋅⋅⋅ 1

(7.48)

where the diagonal consists of ones, and the oﬀ-diagonals are comprised of the between-feature correlation coeﬃcients $rjk$ , where $j ⁄= k$ are features $j,k = 1,2,...,p$ . In the next section, we introduce variance inﬂation factors, which are used for detecting multicollinearity. Rather than exploit the correlation matrix R, the VIFs are based on the model $R2$ .

Variance Inﬂation Factors, $V IFj$ . Collinearity involves association or correlation between predictor features. Variance inﬂation factors [5] represent the degree of collinearity between an independent feature of interest compared with other independent features.

---1-- V IFj = 1 − R2j,

(7.49)

where $2 Rj$ is the $2 R$ value when feature $Xj$ is regressed on the remaining independent features. Independent features whose $V IF$ is greater than 10 represent a severe collinearity problem.

Residual Criteria Based on $n$ and $p$ . The table below lists the residual screening criteria based on the values of $n$ and $p$ for this worked example with $n$ =20 objects (records) and $p$ =3 predictor features:


Residual	Outlier criteria	Calculated results

StdResid	$±$ 2	$±$ 2

StudResid	$±tn−p−1,0.95$	$±$ 1.753

DelResid	$±tn−p−2,0.95$	$±$ 1.761

CooksD	1	1

Leverage	$2p∕n$	0.400

DFFITS	$∘ ---- ± 2 p∕n$	$±$ 0.894

DFBETAS	$√ -- ± 2∕ n$	$±$ 0.447

VIFs	10	10

Residual Analysis Results. The table below lists the calculated residuals, with outliers highlighted in red, for the example 3-predictor model:


Object	Resid	StdResid	StudResid	DelResid	CooksD	Leverage	DFFITS

1	0.06714	0.13884	0.15009	0.14543	0.00095	0.14431	0.05972

2	0.35000	0.72372	0.79182	0.78215	0.03089	0.16461	0.34719

3	0.01340	0.02772	0.03244	0.03141	0.00010	0.26989	0.01910

4	0.70871	1.46544	1.60404	1.69539	0.12742	0.16534	0.75459

5	-0.17228	-0.35624	-0.42753	-0.41634	0.02012	0.30569	-0.27626

6	0.15012	0.31042	0.34535	0.33564	0.00709	0.19204	0.16363

7	-0.47882	-0.99009	-1.41193	-1.46115	0.51516	0.50827	-1.48553

8	-0.65117	-1.34646	-1.52667	-1.59925	0.16640	0.22214	-0.85464

9	0.61214	1.26576	1.38208	1.42602	0.09180	0.16124	0.62525

10	0.40670	0.84095	1.02515	1.02689	0.12770	0.32707	0.71591

11	0.25318	0.52351	0.64305	0.63084	0.05260	0.33723	0.44999

12	-0.44711	-0.92453	-0.95618	-0.95346	0.01592	0.06511	-0.25161

13	-0.03844	-0.07949	-0.08852	-0.08573	0.00047	0.19366	-0.04201

14	0.19589	0.40506	0.41841	0.40735	0.00293	0.06275	0.10541

15	-0.53488	-1.10602	-1.14977	-1.16231	0.02667	0.07467	-0.33017

16	-0.17943	-0.37103	-0.45101	-0.43949	0.02429	0.32324	-0.30373

17	0.77230	1.59694	1.71366	1.83629	0.11123	0.13158	0.71477

18	-0.55056	-1.13844	-1.21376	-1.23337	0.05035	0.12026	-0.45601

19	0.10185	0.21059	0.21832	0.21170	0.00089	0.06953	0.05787

20	-0.57873	-1.19668	-1.30675	-1.33871	0.08215	0.16138	-0.58726

DFBETAS Results. The table below lists the resulting DFBETAS for the example 3-predictor model:


Obs	X1	X2	X3

1	-0.03176	-0.02308	0.02208

2	-0.04873	0.24620	0.11188

3	0.01269	0.00493	0.01434

4	-0.62583	-0.13159	-0.18269

5	-0.20924	-0.13755	0.04991

6	-0.02122	0.11440	0.07088

7	0.79397	1.31937	0.11380

8	0.11546	-0.23569	0.63815

9	0.25813	-0.29100	0.21920

10	-0.38005	0.18566	-0.45673

11	-0.09139	-0.27079	-0.36516

12	-0.11312	-0.01085	0.01138

13	0.00446	-0.00729	-0.03367

14	-0.01739	0.01565	-0.03729

15	0.13180	0.14106	-0.03280

16	-0.20027	0.06009	0.11454

17	0.46860	-0.14184	0.15433

18	-0.06106	0.00665	-0.33761

19	0.02980	0.00643	0.00023

20	0.17956	-0.37680	-0.06967

Variance Inﬂation Factor.

Below is a the table of VIFs:


Feature	VIF

X1	1.138

X2	1.133

X3	1.093

Because none of the VIFs exceeded a value of 10, there does not appear to be a multicollinearity problem.

7.2.2 Partial F-tests

This section introduces computational methods used for determining partial F-tests for the predictor features $X1$ , $X2$ , and $X3$ .

Extra Sum-of-Squares for Variable $X ∗$ . When regressing $Y$ separately on the sets of predictors ${X } 1$ , ${X ,X } 1 2$ , and ${X ,X ,X } 1 2 3$ , we can determine the extra sum-of-squares, deﬁned as follows:

SS( $X1$ ): the sum of squares explained by only using feature $X1$ to predict $Y$
SS( $X2|X1$ ): the extra sum of squares explained by using $X2$ in addition to $X1$ to predict $Y$
SS( $X |X ,X 3 1 2$ ): the extra sum of squares explained by using $X 3$ in addition to $X 1$ and $X 2$ to predict $Y$

The results of regressing $Y$ separately on the sets of predictors ${X1}$ , ${X1,X2 }$ , and ${X1,X2, X3}$ , are listed below:

Regression of Y=X1


Source of variation	SS	df	MS	$F$

Regression	0.31	1	0.31	0.96

Error	5.78	18	0.32

Total	6.09	19


Variable	$βj$	$s.e.(βj)$	$t$	Prob

Const	1.35401	0.12936	10.4669	0.0000

X1	0.13323	0.13607	0.9791	0.3405

Regression of Y=X1,X2


Source of variation	SS	df	MS	$F$

Regression	1.57	2	0.79	2.96

Error	4.52	17	0.27

Total	6.09	19


Variable	$βj$	$s.e.(βj)$	$t$	Prob

Const	1.38915	0.11875	11.6979	0.0000

X1	0.20609	0.12818	1.6078	0.1263

X2	0.22356	0.10247	2.1818	0.0435

Regression of Y=X1,X2,X3


Source of variation	SS	df	MS	$F$

Regression	2.35	3	0.78	3.35

Error	3.74	16	0.23

Total	6.09	19


Variable	$βj$	$s.e.(βj)$	$t$	Prob

Const	1.39434	0.11144	12.5119	0.0000

X1	0.26015	0.12386	2.1004	0.0519

X2	0.26492	0.09878	2.6821	0.0164

X3	0.22297	0.12244	1.8210	0.0874

The extra sum of squares for the ﬁrst regression run is simply based on the SSR for that model, therefore, $SS(X1 ) = SSR$ . However for the second and third regression runs, we need to use the following formulation to determine the extra sum of squares:

SS (X ∗|X ,X ,...,X ) = SSR (X ,X ,...,X ,X ∗)− SSR (X ,X ,...,X ) 1 2 p 1 2 p 1 2 p

(7.50)

where $∗ X$ denotes a single predictor $X$ -feature, $∗ SS(X |X1, X2,...,Xp )$ is the extra sum of squares for feature $∗ X$ , $SSR (X1$ , $X2,$ $...$ , $∗ Xp, X )$ is the SSR from the full model containing feature $∗ X$ , and $SSR (X1,X2,...,Xp)$ is the SSR from the reduced model without feature $∗ X$ . Using the SSR from the reduced model for the regression of $Y = X1$ , and the SSR from the full model for the regression $Y = X1,X2$ , we calculate the extra sum-of-squares $SS(X2 |X1 )$ using the following relationship:

SS (X |X ) = SSR (X ,X )− SSR (X ) 2 1 1 2 1 = 1.57− 0.31 = 1.27

(7.51)

Analogously, SS( $X3 |X1,X2$ ) is determined as follows:

SS (X3|X1,X2) = SSR (X1, X2,X3)− SSR (X1,X2 ) = 2.35− 1.57 = 0.78

(7.52)

Partial F-test for Variable $X ∗$ . Once the extra sum of squares is determined for all the predictor $X$ -features, the partial F-test for each feature $∗ X$ is determined using the relationship

SSR (X∗|X1,X2,...,Xp) F (X ∗|X1, X2,...,Xp ) = M-SE-(X-,X-,...,X-,X-∗) 1 2 p

(7.53)

As an example, the partial F-test for feature $X2$ is determined as

SSR (X |X ) F(X2|X1) = -------2--1-- M SE (X1, X2) = 1.2650-= 4.7603 0.2657

(7.54)

Partial F-test Results. The table below lists the partial F-test results:


Regression	$X∗\|X ,X ,...,X 1 2 p$	$SS (X∗\|X ,X ,...,X ) 1 2 p$	d.f.	MSE	$F (X ∗\|X ,X ,...,X ) 1 2 p$	$F$	P-value

Y=X1	$SS (X1 )$	0.3080	18	0.32	0.96	4.41	0.3405

Y=X1,X2	$SS (X2\|X1 )$	1.2650	17	0.27	4.76	4.45	0.0435

Y=X1,X2,X3	$SS(X3 \|X1, X2)$	0.7756	16	0.23	3.32	4.49	0.0874

d.f. - degrees of freedom used for MSE.
$F$ - denotes the critical value based on $F0.95,1,n−p$ .

Example 1 - Multiple linear regression of Retinol data. This example will use NXG Explorer to perform multiple linear regression of the serum plasma retinol concentration (retplasma) on several predictor features to identify signiﬁcant predictors. Retinol is one of the A vitamins, and is important for night vision, tooth and bone formation, and skin health. Therefore, it would be desirable clinically to know which features may be predictive of serum retinol in order to make public health recommendations.

Before we begin this example run, we make sure the appropriate criteria for model building strategies are set in the Preferences $→$ General default settings,

which speciﬁes that only univariate regression models for which the p-value of univariate predictor $β 1$ is less than 0.05, i.e. $p < 0.05$ , are used in the full multiple linear regression model. The second entry in the default settings specify that we want all the coeﬃcients shown in multivariable model, based on $(p ≤ 1)$ (which implies all features).

To begin, import the Retinol dataset.

Next, select the command Analysis $→$ Regression $→$ Multiple linear from the Analysis pull-down menu, shown as:

Then specify the dependent feature, categorical features, and continuously-scaled features, as follows:

Then accept the default values for the options popup window:

The univariate regression coeﬃcients results based on each predictor regressed separately are ﬁrst listed:

Next, the ANOVA table and multiple regression coeﬃcients based on a full model with all predictors regressed simultaneously is presented:

Coeﬃcients of partial determination are listed next:

For regression diagnostics, the residual criteria are presented ﬁrst, followed by the residuals, with outliers (overly inﬂuential records) identiﬁed by red highlighting in the output:

DFBETAS are listed next:

Lastly, VIFs are listed:

7.3 Multivariate Linear Regression

We now introduce a linear multivariate regression model for which there are multiple dependent features and multiple independent features [6]. Let $xi = (xi1,xi2,...,xij,...,xip)$ be a vector of values for $p$ features for the $i$ th object and $yi = (yi1,yi2,...,yiω,...,yiΩ)$ be the vector of dependent features for the same object, where $Ω$ is the total number of $y$ -dependent features. The regression equation for each object and each dependent feature is

yiω = β0ω + β1ωxi1 +β2ωxi2 + ⋅⋅⋅+ βpωxip + 𝜀iω,

(7.55)

where $β0ω$ is the intercept, or constant term, $βjω$ is the regression coeﬃcient for feature $xj$ , and $𝜀iω$ is the random error term or residual. Regression coeﬃcients are used for predicting the $y$ -values for each object, using the relationship

∑p ˆyiω = β0ω + βjωxij. j=1

(7.56)

Coeﬃcients are determined directly from the matrix operation

Y = XB + Ξ.

(7.57)

The matrix form of the dependent features is given as

⌊ ⌋ y11 y12 ⋅⋅⋅ y1Ω ||y21 y22 ⋅⋅⋅ y2Ω || nY×Ω = |⌈ ... ... ... ... |⌉. yn1 yn2 ⋅⋅⋅ ynΩ

(7.58)

The typical matrix of $x$ -values is

⌊ ⌋ |1 x11 x12 ⋅⋅⋅ x1p| X = ||1 x21 x22 ⋅⋅⋅ x2p|| . n×p+1 ⌈1 ... ... ... ... ⌉ 1 xn1 xn2 ⋅⋅⋅ xnp

(7.59)

Regression coeﬃcients are deﬁned using the following matrix notation

⌊ ⌋ |β01 β02 ⋅⋅⋅ β0Ω| B = ||β1.1 β1.2 ⋅⋅⋅ β1Ω.|| , p+1×Ω ⌈ .. .. ... ..⌉ βp1 βp2 ⋅⋅⋅ βpΩ

(7.60)

and the matrix of random error terms is in the form

⌊ ⌋ |𝜀11 𝜀12 ⋅⋅⋅ 𝜀1Ω | Ξ = ||𝜀21 𝜀22 ⋅⋅⋅ 𝜀2Ω ||. n×Ω ⌈ ... ... ... ... ⌉ 𝜀n1 𝜀n2 ⋅⋅⋅ 𝜀nΩ

(7.61)

Rearranging the matrices, we can solve for the regression coeﬃcients using the one-step operation

⊤ −1 ⊤ B = (X X ) X Y.

(7.62)

The inverse of the dispersion matrix $⊤ X X$ can be obtained using the Jacobi method, singular value decomposition, or some other matrix inversion technique. Once the coeﬃcient matrix $B$ is obtained, we obtain the predicted $y$ -values by applying $B$ to the data matrix $X$ using the matrix operation $Yˆ= XB$ .

The variance-covariance matrix for the coeﬃcients is equal to the direct matrix product of the residual sums of square and the inverse of the variance-covariance matrix of the coeﬃcients, given as

[( 1 ) ] V (βj,βk) = ----- Ξ ⊤Ξ ) ⊗ (X ⊤X )−1 [( n − p ) ] = --1-- (Y ⊤Y − Yˆ⊤Yˆ) ⊗ (X⊤X )−1 [( n − p ) ] --1-- ⊤ ⊤ ⊤ ⊤ −1 = n − p (Y Y − β X Xβ ) ⊗ (X X ) .

(7.63)

The null hypothesis $Ho : B = 0$ which applies to all of the regression coeﬃcients is determined by comparing statistics derived from the eigenvalues of the non-symmetric matrix $−1 E H$ , where the hypothesis matrix is

⊤ ⊤ ⊤ H = B X Y − ny¯¯y ,

(7.64)

and the sum-of-squares matrix is

⊤ ⊤ ⊤ E = Y Y − B X Y.

(7.65)

The ﬁrst test is based on Wilks’ $Λ$ using the eigenvalues $λ1,λ2,...,λ Ω$ of $−1 E H$ , given in the form

Ω Λ = ∏ ---1--. j=11 + λj

(7.66)

Let

p = #Coeﬃcients, w/o intercept,

(7.67)

g = Ω + 1,

(7.68)

p − g+ 2 a = N − g− --------, 2

(7.69)

{ ∘---------- pp22+(g(−g1−)12)2−−45; if p2 + (g − 1)2 − 5 > 0 b = 1; if p2 + (g − 1)2 − 5 ≤ 0

(7.70)

and

c = p(g−-1)−-2 . 2

(7.71)

Then

( ) ( ) 1−-Λ1∕b -ab−-c- ⋅ F = Λ1∕b p(g− 1) ∼ Fp(g−1),ab−c.

(7.72)

The Pillai trace $V$ statistic is also computed as

∑Ω --1--- V = 1+ λj. j=1

(7.73)

Let

s = min(p,g− 1),

(7.74)

|g−-1-− p|−-1 t = 2 ,

(7.75)

and

N − g− p− 1 u = ------------. 2

(7.76)

Here, under $Ho$

( )( ) F = 2u+-s-+1- --V-- ⋅∼ Fs(2t+s+1),s(2u+s+1). 2t+ s+ 1 s− V

(7.77)

The Lawley-Hotelling $U$ statistic is

∑Ω U = λj. j=1

(7.78)

Then

( ) F = -22(su-+1)--- U ∼⋅Fs(2t+s+1),2(su+1). s(2t+ s+ 1)

(7.79)

Lastly, Roy’s test is based on the greatest root $λ1$ , and the test statistic is

( ) F = n-−-(p+-1) λ1∼⋅Fp,n− (p+1). p

(7.80)

Example 2 - Multivariate regression of Retinol data. For this example, we will regress the four outcome measurements reﬂecting subject-speciﬁc values of beta-carotene and retinol from dietary assessment and serum plasma measurements on continuously-scaled features to identify signiﬁcant predictors. First, open the Retinol dataset, and select Analyis $→$ Regression $→$ Multivariate from the Analysis pull-down menu, shown as

Next, specify the four continuous dependent (“target”) features betadiet, retdiet, betaplamsa, and retplasma, and then select the continuous independent features age, quetelet, calories, fat, fiber, alcohol, and cholesterol – which is shown below:

The results of the regression run will listed in the treenodes on the left side of the output window:

Click on the Full model Tests icon, and the following table will appear:

Results of the full model tests indicate that all of the tests suggests that one or more of the independent features is a signiﬁcant predictor of one or more of the dependent features. To ﬁnd out which independent features are signiﬁcant predictors of one or more of the dependent features, we need to look at the feature-speciﬁc test results, which are shown below. Click on the Feature Tests icon, and the following table will appear:

In the table above for feature-speciﬁc tests, we see notice that age, quetelet, fiber and cholesterol were signiﬁcant. To see the speciﬁc-contribution of each independent feature to each dependent feature, we look at the tables of regression coeﬃcients by clicking on the Coeﬃcients icon, which will display the following output:

In the table above for which betadiet is the dependent feature, we notice that fiber was the only signiﬁcant predictor. In addition, betadiet increased 138.3 units for each one-unit increase in ﬁber. For the dependent feature retdiet, we notice that cholesterol was the only signiﬁcant predictor, with a slight increase in retdiet per unit increase in cholesterol. Coeﬃcient tables for the dependent feature betaplasma and retplasma are shown below. While both quetelet and fiber were signiﬁcant predictors of betaplasma, only age was a signiﬁcant predictor of retplamsa.

7.4 Logistic Regression

Logistic regression is a type of categorical regression for which the outcome features (dependent features) are discrete, and input features (independent features) are discrete or continuous. There are several types of logistic regression, and these are listed in the table below:


Type of logistic regression	Dependent feature	Scale	Example

Unconditional	$yi = 0,1$	Binary	Case: $yi = 1$ , Control: $yi = 0$

Ordinal	$yi = 1,2,...,k$	Ordinal	$yi = 2$ : subject in 2nd group

Conditional	$y = {1,0,0,0} i$	Matched-pairs	$y = {1,0,0,0} i$ : record with 1 case and 3 controls

Polytomous	$yi = {0,0,0,1}$	Class membership	Subject in class 4: $yi = {0,0,0,1}$

When there is a binary (yes/no) outcome for each object, the $y−$ dependent (outcome) feature is coded as a binary feature, consisting of zeroes and ones, i.e., (0,1), to represent failure or success. In failure analysis, objects that failed are commonly assigned a $y$ -value of 1, while items that didn’t fail are assigned a $y$ -value of 0. In clinical medicine and epidemiology, this binary coding scheme is amenable for analysis of case-control studies, where cases (diseased) are assigned $y = 1$ while controls (non-diseased) are assigned $y = 0$ . Among the various types of logistic regression, NXG Explorer implements two variants: unconditional and polytomous. The theoretical developments for each of these types of logistic regression models are described in the following sections.

7.4.1 Unconditional Logistic Regression

Let $y = 1 i$ represent disease and $y = 0 i$ non-disease for subject $i$ having covariate vector $x$ . The probability of disease for subject $i$ is

---eg1(x)---- πi1 = P (yi = 1|x) = eg0(x) +eg1(x),

(7.81)

and the probability of not having disease is

---eg0(x)---- πi0 = 1− πi1 = P (yi = 0|x) = eg0(x) + eg1(x),

(7.82)

where $gj(x)$ is the logit for response category $j = 0,1$ . The general deﬁnition for the logit is given as

( ) g(x) = log πij j πi0 ( P(yi = j|x)) = log P(yi =-0|x) ⊤ = x β = β0 + β1x1 + β2x2 + ⋅⋅⋅+ βpxp.

(7.83)

Using the general deﬁnition, the logit for non-disease is

( π ) g0(xi) = log -i0- πi0 = log(1) = 0,

(7.84)

and thus, $eg0(xi) = 1$ . The logit for disease is

( ) g1(xi) = log πi1 πi0 = x⊤β = β0 + β1x1 + β2x2 + ⋅⋅⋅+ βpxp.

(7.85)

Through simple substitution of the logit, we get the conditional probabilities

--eg1(x)- πi1 = 1 + eg1(x),

(7.86)

and

1 πi0 =----g1(x). 1 + e

(7.87)

Let $x = 1 i$ represent exposure and $x = 0 i$ represent non-exposure to a risk factor, and $y = 1 i$ represent disease and $y = 0 i$ no disease. The logit, $g(x) = β + β x 0 1$ . Hence, our model has one $x$ -feature and two regression coeﬃcients: $β 0$ and $β 1$ .

Cross-tabulation of disease and exposure status for $x = 1$ , vs. $x = 0$


	Disease ( $yi = 1$ )	No disease( $yi = 0$ )	Total


Exposed ( $xi = 1$ )	$-eβ0+β1- πi1 = 1+eβ0+β1$	$---1--- 1− πi1 = 1+eβ0+β1$	1.0


Non-exposed ( $xi = 0$ )	$πi0 =-eβ0β- 1+e 0$	$1 − πi0 =-1β-- 1+e 0$	1.0

The odds ratio can now be determined as

-π1∕(1-− πi1) OR = πi0∕(1 − πi0) π1(1 − πi0) = π--(1−-π-) (i0β0+β1i1)( ) 1+eeβ0+β1 1+1eβ0 = (--eβ0-)-(---1---) 1+eβ0 1+eβ0+β1 eβ0+β1 = --eβ0-- = eβ1.

(7.88)

For the OR, the null and alternative hypotheses are:

$H0 : OR = 1$
$H1 : OR ⁄= 1$

Since $OR = eβ$ , we know that when $β = 0$ the $OR = e0 = 1$ . Therefore, in logistic regression we have

$H0 : β = 0$
$H1 : β ⁄= 0$

The signiﬁcance of each regression coeﬃcient is based on a standard normal variate equal to

--β--- Z = s.e.(β)

(7.89)

and if $|Z| > 1.96$ then $p < 0.05$ . The p-value is determined using the standard normal distribution, based on $2 × [1 − Φ (Z)]$ if $Z ≥ 0$ and $2 ×[Φ(Z )]$ if $Z < 0$ . The $100% × (1− α)$ CI for OR is

100% × (1− α )CIofOR = eβ±1.96×s.e.(β) = [L[L, UL ] ] = eβ−1.96×s.e.(β),eβ+1.96×s.e.(β) .

(7.90)

Development of the Likelihood

We now move into discussion surrounding how the coeﬃcients are determined via the iterative maximum likelihood method. The logistic likelihood function is deﬁned as

∏n ℒ(𝜃) = (πi0)1−yiπyi1i i=1 ∏n = (1 − πi1)1− yiπyii1. i=1

(7.91)

Taking the natural logarithm gives the log-likelihood in the form

∑n log(ℒ(𝜃)) = log(1− πi1)− yilog(1− πi1)+ yilog(πi1) i=1 ∑n { } = log(1− πi1)+ yi log(πi1)− log(1 − πi1) i=1 ∑n ( πi1 ) = yilog 1-− πi1 + log(1− πi1) i=1n ( ) ∑ πi1 = yilog πi0 + log(πi0) i=1n ( ) = ∑ yig1(x)+ log ---1---- i=1 1+ eg1(x) ∑n = yig1(x)− log(1 + eg1(x)) i=1 ∑n ⊤ = yi(x⊤ β)− log(1 +e(x β)). i=1

(7.92)

Once the log-likelihood function is formed, we can use the Newton-Raphson method to perform a gradient ascent approach to ﬁnd maximum likelihood values of the solution parameters $β$ . This entails calculation of the scores or ﬁrst derivatives of $log(ℒ(𝜃))$ w.r.t parameters and the information (Hessian) matrix of second partial derivatives. The score is

( ) ∂-logL-(β-) ∑n --e(x⊤β)-- sj(β ) = ∂βj = xi yi − 1+ e(x⊤β) , i=1

(7.93)

and element $(j,k)$ of the information matrix

Ij,k(β) = − ∂2logL(β)-= ∑n xijxik--e(x⊤β-)--. ∂βj∂βk i=1 (1+ e(x⊤β))2

(7.94)

Similarly, maximum likelihood estimates of each $βj$ can be found by determining the vector $β$ iteratively with the matrix manipulation

− 1 βi+1 = βi + I(β ) s(β)

(7.95)

until convergence is reached when $||β|| < 𝜖$ . Values of $𝜖$ are typically in the range $− 8 −4 10 ≤ 𝜖 ≤ 10$ .

Assessing Model Goodness of Fit

Recall, the residual for a sample is essentially $(yi − ˆyi)$ . The ﬁtted regression value for the $i$ th subject and the $j$ th class is $yˆij$ , is determined as

-----egj(x)----- πˆij = P (y = j|x) = 1 + ∑ Ω−1egk(x). k=1

(7.96)

The Pearson residual is

r(yij,πˆij) = ∘-yij −-πˆij-. πˆij(1 − ˆπij)

(7.97)

The Pearson goodness-of-ﬁt (GOF) statistic is

2 ∑n 2 χ n−p+1 = r(yij,πˆij) . i=1

(7.98)

The deviance residual is

∘ --------- d(yij,πˆij) = 2|log( ˆπij)|.

(7.99)

The deviance goodness-of-ﬁt (GOF) statistic is

∑n D = d(yij, ˆπij)2. i=1

(7.100)

The log-likelihood ratio statistic for a feature is minus twice the ratio of the log-likelihood of a model without the feature and the log-likelihood of the model with the feature:

( ) G = − 2 Lreduced Lfull = − 2(LLreduced − LLfull),

(7.101)

and is chi-square distributed with 1 d.f., that is $2 χ (1)$ should exceed 3.84 as each feature is added to the model. Values of $G$ exceeding $2 χ(1) = 3.84$ are suggestive that the additional feature in the full model is statistically signiﬁcant in terms of contribution to the likelihood. Assuming the model ﬁts, the distribution of Pearson and deviance residuals is chi-square distributed with $n − p+ 1$ degrees of freedom. For deviance, the GOF statistic $D$ is the likelihood ratio of a model with $n$ parameters and one with $p− 1$ parameters. GOF values of $2 χ$ or $D$ that are less than the d.f. ( $n − p+ 1$ ) are suggestive of a signiﬁcant ﬁt, that is, the model used and data appropriately describe sample membership in the outcome classes.

Example 3 - Unconditional logistic regression on Low Birth Weight. This example will employ unconditional logistic regression using the low birth weight data. The feature named low is a binary feature (1-Yes, 0-No) representing the birth of an infant with low birth weight, and will be used as the dependent feature. The remaining features such as age (age of the mother in years), lwt(weight in pounds at the last menstrual period), race (1=white, 2=black, 3=other), smoke (smoking status during pregnancy: 1=Yes, 0=No), ht (history of hypertension (1=Yes, 0=No), and ui (presence of uterine irritability (1=Yes, 0=No) will serve as the independent predictors.

First, open the Low Birth Weight dataset. Select Analysis $→$ Regression $→$ Binary Logistic from the Analysis pull-down menu:

You will then see the following popup window:

Select low as the discrete output feature, race as a disrete input feature, and age, lwt, smoke, ht, and ui as continuous input features. A popup window with options will appear, so accept the default values and click Apply. The logistic regression analysis will then run. The treenodes after the run will appear as follows:

Select the Coeﬃcients icon to display the results. At the top of the output you will notice results for all possible univariate logistic regression models, including univariate odds ratios (OR) and their 95% conﬁdence intervals(CI). This will appear as follows:

During feature input, NXG Explorer automatically recoded the 3-level race feature into two binary features representing the two race groups (2-Black and 3-Other), respectively. This type of recoding is called corner-point constraints recoding of categorical features, for which the eﬀect of the lowest category appears in the Constant term (or y-intercept). An alternative approach would involve use of sum-to-zero constraints recoding, which would force all three binary features into the model. For the univariate models, the features lwt, smoke, ht, and ui were signiﬁcant predictors of low birth weight. The univariate ORs and 95% CIs are also listed in the coeﬃcient table. Toward the bottom of the Coeﬃcient output, we can see the multiple logistic regression runs, for which all the speciﬁed independent features are used in a single model. The multiple logistic regression coeﬃcients are followed by a tabular listing of the ORs and 95% CIs.

Notice that in the multiple logistic regression model, in addition to lwt, smoke, ht, and ui being signiﬁcant, the two binary features (0,1) for black and other race categories were signiﬁcant (these categories were not signiﬁcant in the univariate models). Looking at the ORs, we notice that the OR for smoking (smoke) was 2.79(95% CI, 1.29-6.05), OR for hypertension (ht) was 6.41(95% CI, 1.66-24.72), OR for uterine irritability (ui) 2.45(95% CI, 1.02-5.90), OR for black race (b_race_2) was 3.60(95% CI, 1.28-10.10), and OR for other race (b_race_3) was 2.46(95% CI, 1.05-5.77), which are all quite elevated. Using results of these observations, a public health-based recommendation could be reported as follows: “given the observed results from empirical data, these ﬁndings indicate that smoking and hypertension during pregnancy as well as race category were signiﬁcant predictors of having a low birth weight infant.” Lastly, the class membership predictions are listed in tabular notation:

Next, click on the Residuals icon, and you will see a listing of records (objects) that are considered to be outliers due to their higher-than-expected residual values. For each outlier observation, the criterion which didn’t pass is highlighted in red, as shown below:

7.4.2 Polytomous Logistic Regression

Polytomous logistic regression was fundamentally developed for multiple class problems, unlike the dichotomous 2-class nature of unconditional logistic regression described in the previous section. To begin, assume a multiclass problem in which the dependent feature $yji$ for each $i$ th sample is coded 0 or 1 to represent membership in the $j$ th class, where $j = 0,1,...,J$ , and $J$ is the number of classes minus one. Let’s assume that the reference class (corner point) is the ﬁrst class for which $j = 0$ . The general deﬁnition of the logit for polytomous logistic regression remains the same, as in

( ) gj(x) = log πij- ( πi0 ) = log P(yi =-j|x) P(yi = 0|x) = x⊤β = βj0 + βj1x1 + βj2x2 + ⋅⋅⋅+βjpxp.

(7.102)

However, the conditional probability adds together the exponentiated logits in the denominator, shown in functional form as

πij = P(yij = j|x ) =-∑egj(x)----. 1+ Ωk=1 egk(x)

(7.103)

For a 3-class example $(j = 0,1,2)$ the probability that a sample $x$ has membership in each of the 3 classes is

eg0(x) πi0 =-g0(x)---g1(x)---g2(x) e + e1 + e = ----g1(x)---g2(x) 1 + e g (x+)e πi1 =-----e-1------- 1 + eg1(x) +eg2(x) -----eg2(x)----- πi2 = 1 + eg1(x) +eg2(x).

(7.104)

The likelihood is

∏n ∑J ∏J ℒ(𝜃) = (πi0)1− j=1yij πyiijj i=1 j=1 ∏n J∑ ∑J ∏J = (1 − πij)1− j=1yij πyiijj. i=1 j=1 j=1

(7.105)

Taking the natural logarithm gives the log-likelihood in the form

( ) ( ) ∑n ∑J ∑J ∑J ∑J yij log(ℒ(𝜃)) = log 1− πij − log 1− πij yij + log(πij ) i=1 ( j=1 ) ( j=1 ) j=1 j=1 ∑n ∑J ∑J ∑J ∑J = log 1− πij − log 1− πij yij + yijlog(πij) i=1 ( j=1 ) { j=1 j=1 j=1 } ∑n ∑J ∑J ( J∑ ) = log 1− πij + yij log(πij)− log 1 − πij i=1 j=1 j=1 j=1 ∑n ( ∑J ) ∑J { ( π ) } = log 1− πij + yij log (----∑Jij----) i=1 j=1 j=1 1 − j=1 πij ∑n ∑J ( π ) ( ∑J ) = yijlog -ij + log 1− πij i=1 j=1 πi0 j=1 ∑n ∑J = yijgj(x)+ log(πi0) i=1 j=1 ∑n ∑J ( 1 ) = yijgj(x)+ log ----∑J------- i=1 j=1 1 + j=1 egj(x) ∑n ∑J ( ∑J ) = yijgj(x)− log 1+ egj(x) i=1 j=1 j=1 ∑n ∑J ( ∑J ) = yij(x⊤β )− log 1+ e(x⊤β) . i=1 j=1 j=1

(7.106)

Notice that in unconditional (binary) logistic regression the coeﬃcients were obtained in the form of a $p× 1$ column vector $β$ , while in polytomous logistic regression, however, there is a total of $p(J − 1)$ coeﬃcients, resulting in a $p(J − 1)× 1$ column vector, which is equal to the total number of features times the number of classes minus one. The score value for the $j$ th logit $(j = 1,2,...,J )$ and the $k$ th feature $(k = 1,2,...,p)$ will occupy the $s(p(j − 1)+ k)$ element in the score vector given as

∑n ( (x⊤β) ) sp(j−1)+k (β ) = ∂-logL-(β-)= xik yij − ---∑e--------- ∂ βjk i=1 1+ Jk=1e(x⊤β) ∑n = xik(yij − πij). i=1

(7.107)

The information matrix $I$ , or Hessian matrix, is a partioned matrix of size $p(J − 1)× p(J − 1)$ . Thus, if there are 3 classes $(J = 3)$ , there will be 2 logit functions, each of which is represented with a $p× p$ matrix $Ijj$

( ) I(β ) = I11(β) I12(β) . I21(β) I22(β)

(7.108)

By deﬁnition, the second partial derivative for diﬀerent features $(k,k′)$ within the same logit $(j,j)$ , that is, same matrix is

−-∂2log-L(β) ∑n Ijk,jk′(β ) = ∂ βjk∂βjk′ = − xik′xikπij(1 − πij), i=1

(7.109)

whereas the second partial derivative for diﬀerent features $′ (k,k )$ in diﬀerent logits $′ (j,j)$ , that is, diﬀerent matrices is

2 ∑n Ijk,j′k′(β) = −-∂-logL(β) = xik′xikπijπij′. ∂ βjk∂βj′k′ i=1

(7.110)

Similarly, using the Newton-Raphson method, maximum likelihood estimates of each $β jk$ can be found by determining the vector $β$ iteratively with the matrix manipulation

βi+1 = βi + I(β )− 1s(β)

(7.111)

until convergence is reached when $||β|| < 𝜖$ . Values of $𝜖$ are typically in the range $10− 8 ≤ 𝜖 ≤ 10−4$ .

Example 4 - Tumor class prediction with SRBCT gene expression dataset. This example uses the class-speciﬁc dataset for SRBCT data. First, import the Excel dataset using class-speciﬁc format as shown below:

Next, specify the commands Analysis $→$ Regression $→$ Polytomous logistic, as shown below:

Next, specify the Class feature as the class feature, and all continuous features as the predictors, as shown below:

Regarding the output, the treenode will only contain a Coeﬃcients icon (see below):

Once you click on the Coeﬃcients icon, you will ﬁrst see the following tabular notation with class-speciﬁc univariate model results:

The ﬁrst thing you will notice for a polytomous logistic analysis is that results will be presented for #classes-1 since the ﬁrst class is treated as a referent group. In the output listed above, notice that each predictor is signiﬁcant for predicting at least one of the diagnostic classes. This is expected, since the set of 23 gene expression features in this dataset were identiﬁed as the best predictors of the four diagnostic classes out of 2,308 features. Following the class-speciﬁc univariate results, the log-likelihood values are listed as a function of iteration:

This is followed by the class-speciﬁc multivariate logistic regression coeﬃcients:

Lastly, the predicted class membership probabilities are listed:

7.5 Additive and Multiplicative Poisson Regression

Poisson regression models are commonly used for modeling risk factors and other covariates when the rates of disease are small, i.e., cancer [7, 8, 9, 10, 11]. An area of research in which Poisson regression of cancer rates is commonly employed is the investigation of radiation-induced cancer among Hiroshima and Nagasaki atomic bomb survivors [12, 13, 14] and individuals occupationally [15], medically [16] and environmentally [17] exposed to ionizing radiation. Poisson regression has also been used to investigate the eﬀects of information bias on lifetime risks [18], diagnostic misclassiﬁcation [19, 20], and dose-response curves for radiation-induced chromosome aberrations [21]. This report documents the NXG Explorer Poisson regression approach for ﬁtting additive, multiplicative, and power models, and regression diagnostics using leverages of the Hat matrix, deletion and standardized residuals. The critical data required to perform Poisson regression using NXG Explorer are:

Number of cases for each stratum;
Number of person-years for each stratum; and
Design matrix of independent features.

7.5.1 Poisson Assumption

The Poisson assumption states that, for stratiﬁed data, the number of deaths, $d$ , in each cell approximates the values $x$ =0,1,2,... according to the formula

x P(d = x) = (λn)-exp(− λn-), x!

(7.112)

where $λ$ is the rate parameter, which is equal to the number of deaths in each cell divided by the respective number of person-years ( $n$ ) in that same cell. According to McCullagh and Nelder [22], a generalized linear model is composed of the following three parts:

Random component, $λ$ , which is normally distributed with mean $1ni ⋅μi$ and variance $μni2i$ .
Systematic component, $x$ , which are covariates related to the linear predictor, $(x⊤ β) i$ , by the form
$⊤ ∑p (xi β) = xijβj. j=1$ (7.113)
The link between the random and systematic components for a linear model is given by
$λi = λi(x⊤i β),$ (7.114)

where $λi$ is the observed rate parameter and $λi(x⊤β ) i$ is the predicted rate parameter. It should be pointed out that the rate parameter, $λi = (x ⊤β) i$ , serves as its own link function. Thus, the rate parameter provides a linearization of our distributional models so that we can use multiple linear regression to obtain estimators. In eﬀect, the rate parameter allows us to reduce non-linear (non-Gaussian) models to linear models so that we can solve the system of linear equations. The rate parameter can take on various linear and non-linear functions. These are shown in Table 7.1. The power transformation was adapted from [23].

Table 7.1: Link function and predicted rates.


Model	Type risk	Link function	Fitted rate

Additive	Absolute	$λ$	$λi(x⊤β ) = (x⊤β) i i$

Multiplicative	Relative	$log(λ)$	$λi(x⊤i β ) = exp(x ⊤i β)$

Power	Geometric	$λρ$	${(x⊤ β)1∕ρ 0 < ρ ≤ 1 λi(x⊤i β ) = i ⊤ exp(xi β ) ρ = 0$

Parameters $β$ are solved through the use of iteratively reweighted least squares, where the parameter update vector at each iteration is

⊤ −1 ⊤ Δ β = (Z WZ ) Z Wy,

(7.115)

with data matrix elements

ˆ zij = ni-∂λi, ∂ βj

(7.116)

with weights

wii = (niˆλi)−1,

(7.117)

and dependent features

ˆ yi = di − niλi,

(7.118)

where the ﬁtted rates are

ˆλi = λi(x⊤i β ).

(7.119)

Iterations start by setting the vector of regression coeﬃcients $β$ equal to small near-zero starting values. Successive iterations involve adding the gradient-based update vector to the most recent vector of coeﬃcients in the form

β(t+1) = β(t) + Δ β,

(7.120)

until convergence is reached when $∥Δ β ∥ < 10−4$ .

7.5.2 Residuals and Regression Diagnostics

The error residual, $ei$ has mean 0 and measures the lack of concordance of the observed deaths $di$ against the ﬁtted value $ˆ ˆ di = niλi$ . The greater the value of $ei$ , the worse the ﬁt of the model. We will now introduce a number of residuals which are helpful for determining whether or not certain observation are overly inﬂuential on the ﬁt of a model. Firstly, the Pearson residual, $rP$ , for each observation is estimated as

di − dˆi rP = -√di--,

(7.121)

and provide a measure of the residual variation. If we take into account the number of ﬁtted parameters (coeﬃcients) in a model, we can determine an adjusted residual as

r = √-rP---, A 1 − hi

(7.122)

where $hi$ is the leverage for the $i$ th observation (described later). Values for $rA$ that exceed 1.96 tend to be outliers at the $α = 0.05$ level of signiﬁcance. The deviance residuals

rD = di[log di] + (dˆi − di), ˆdi

(7.123)

allow the investigator to determine the goodness of ﬁt of the model, i.e., how well the observed data agree with the ﬁtted values.

The leverage [22] for each observation, $hi$ , provides information about the geometric distance between a given point ( $zi1$ , $zi2$ , $⋅⋅⋅$ , $zij$ ) and the centroid of all points ( $¯zi1$ , $¯zi2$ , $⋅⋅⋅$ , $z¯ij$ ) in the predictor space. Individual leverages, $hi$ , are obtained from the diagonal of the “Hat” matrix given by

√ --- ⊤ −1 ⊤ √--- H = WZ (Z WZ ) Z W.

(7.124)

Because $∑ hi = trace(H) = p$ , observations with high leverage can approach unity. A criterion introduced by Hoaglin and Welsch [24] for determining high leverage for an observation states that a point has high leverage if $hi > 2p∕n$ . Points with high leverage are located at a substantial distance from the ﬁtted regression line, and act to “pull” the ﬁtted line toward their location. Increases in the goodness of ﬁt of a model can be obtained by re-ﬁtting a model after the observation(s) with high leverage have been removed.

Cook [25] introduced the residual, $Δ(β)−ij$ , which measures the diﬀerence between $β$ when the observation is included in the model and $β$ when the observation is not in the model. The more positive the values of $Δ(β)−ij$ , the more inﬂuence an observation has when compared with other observations. Individual $Δ(β)−ij$ residuals are obtained by the $n× p$ matrix $Δ (β)−ij$ , by the relationship

− (Z⊤WZ-)−1zijyiwij- Δ(β)−ij = 1− hij .

(7.125)

7.5.3 Goodness-of-Fit Statistics

The Pearson $2 χ$ goodness of ﬁt is

∑n χ2 = rP2, i=1

(7.126)

and the deviance goodness of ﬁt is

n D = ∑ r , i=1 D

(7.127)

which is also $2 χ$ distributed. If a model ﬁts, its $2 χ$ and $D$ will be lower than the degrees of freedom ( $n − p$ ).

Example 5 - British Doctors data - Purely Additive Model ( $ρ = 1$ ). This example will use the British doctor data in Table 9.1 to determine the additive risk $ρ = 1$ (see Table 7.1) of coronary-related deaths due to smoking. Table 9.1 lists the British doctor data for coronary disease and smoking [6]. The rates for non-smokers, $λi(0)$ , are simply the ratio of the number of deaths to person-years $ni$ in age group $i$ and exposure group $0$ , whereas the rates for smokers, $λi(1)$ , are the ratio of deaths to person-years $ni$ in age group $i$ for the exposure group $1$ .

Table 7.2: Deaths from coronary disease among British male doctors [6].


	Non-smokers			Smokers
Age group, $i$	$di$	$ni$	$λi(0)$	$di$	$ni$	$λi(1)$


35-44	2	18,790	0.1064	32	52,407	0.6106
45-54	12	10,673	1.1243	104	43,248	2.4047
55-64	28	5,710	4.9037	206	28,612	7.1998
65-74	28	2,585	10.8317	186	12,663	14.6885
75-84	31	1,462	21.2038	102	5,317	19.1838

Note: Notice that in the table of rates for coronary deaths among British doctors, that the rate for each group was multiplied by 1,000. This is commonly done so that the majority of rates are greater than one, in order to facilitate the numerical methods used.

First, import the British doctors coronary death dataset, and you will see the following arrangement in the datagrid:

Next, select Analysis $→$ Regression $→$ Poisson from the Analysis pull-down menu, and select the person-year feature, count feature, and the covariates for age and smoke:

Next, select Additive from the model choices, and click Apply (shown below):

After the run, the treenodes resulting from this run will appear as follows:

Click on the Coeﬃcients icon, and the additive risks will be listed in the diplay as follows:

Firstly, notice the value of the deviance GOF of 7.43. Because this run used a purely additive model $(ρ = 1)$ , the predicted baseline rate for each age group is equal to the value of the regression coeﬃcients for the speciﬁc age group. The absolute risk (AR) is equal to the regression coeﬃcient for smoke ( $βsmoke = 0.591$ ), and is actually an excess rate that adds to the baseline rate. A key point to be made is that the AR is a rate that serves as a ﬁxed constant which adds to the baseline rate of any group. The table below lists the calculated baseline rates (deaths per 1,000 person-years) for each age group based on the ﬁtted coeﬃcients, as well as the ﬁtted rate among smokers for each age group.


Feature	Coef.	Baseline rate(non-smokers)	Rate among smokers

age_35-44	0.0841	0.0841	0.0841+0.5907=0.6748

age_45-54	1.6407	1.6407	1.6407+0.5907=2.2314

age_55-64	6.3035	6.3035	6.3035+0.5907=6.8942

age_65-74	13.5241	13.5241	13.5241+0.5907=14.1148

age_75-84	19.1696	19.1696	19.1696+0.5907=19.7603

smoke	0.5907	AR=0.5907

Next, click on the Residuals icon, and the following table will be shown:

We notice that two observations (#1 and 6) were overly inﬂuential on the model’s ﬁt. Next, click on the DFBETAs icon, and the following table will be shown:

Recall, the DFBETAS reveal the inﬂuence of each record each covariate if the record was removed and the model was ran again. In the DFBETAS output listed above, there appears to be a greater number of residuals for the older age groups.

Example 6 - British Doctors data - Purely Multiplicative Model ( $ρ = 0$ ). In this example, we will employ a purely multiplicative model to identify the relative risk (RR) of smoking. Import the Britisk doctor data again, select Analysis $→$ Regression $→$ Poisson from the Analysis pull-down menu, select the features, and select Multiplicative for the model choices (shown below):

Click on the Coeﬃcients icon and the following table will be displayed:

The value of deviance GOF for this multiplicative mode is 12.13, which implies a worse ﬁt when compared with the purely additive model. Using the coeﬃcient values, the ﬁtted baseline rates (deaths per 1,000 person-years) among non-smokers and smokers are listed in the table below:


Feature	Coef.	Baseline rate(non-smokers)	Rate among smokers

age_35-44	-1.0116	exp(-1.0116)=0.3636	0.3636 $×$ 1.4225 = 0.5184

age_45-54	0.4724	exp(0.4724)=1.6038	1.6038 $×$ 1.4225 = 2.2863

age_55-64	1.6159	exp(1.6159)=5.0324	5.0324 $×$ 1.4225 = 7.1737

age_65-74	2.3389	exp(2.3389)=10.3698	10.3698 $×$ 1.4225 = 14.7822

age_75-84	2.6885	exp(2.6885)=14.7096	14.7096 $×$ 1.4225 = 20.9685

smoke	0.3545	RR=exp(0.3545)=1.4255

The relative risk (RR) is determined by exponentiating the coeﬃcient for exposure (i.e., smoke) and multiplying the RR by each ﬁtted baseline rate. Next, click on the Residuals icon, and the following table will be shown:

We notice that four observations were overly inﬂuential on the model’s ﬁt, and since there is greater number of outliers for the multiplicative model, this may imply that the multiplicative models ﬁts worse. Next, click on the DFBETAs icon, and the following table will be shown:

The DFBETAS results indicate fewer issues with the multiplicative model.

Example 7 - British Doctors data - Power Model ( $ρ = 0.55$ ). NXG Explorer can be used to evaluate power models with varying values of $ρ$ where $(0 < ρ ≤ 1).$ To begin, import the British doctors’ dataset, select Analysis $→$ Regression $→$ Poisson from the Analysis pull-down menu, select the features, and then specify “Evaluate rho at 0.1 intervals,” from the popup windows (shown below):

After the run, you will see the following icons in the treenode:

Click on the Power evaluation icon, and the following output table will appear:

Note that the deviance GOF is lowest when $ρ = 0.55$ , which indicates that a power model with $ρ = 0.55$ ﬁt better than a purely additive model ( $ρ = 1$ ) or a purely multiplicative model ( $ρ = 0$ ). Below is a plot of the deviance values as a function of $ρ$ :

Next let’s run a Poisson regression model using a ﬁxed value $ρ = 0.55$ . Import the dataset, select the Poisson regression command and features, and now specify $ρ = 0.55$ in the options popup window:

Click on the Coeﬃcients icon, and the following table will appear:

Note the low value of deviance of 2.14, which was similar to an estimate of the interpolated value in the power table listed above, and is much lower than the deviance of 7.43 for the purely additive model and 12.13 for the purely multiplicative model. Again, using the coeﬃcient values, the ﬁtted baseline rates (deaths per 1,000 person-years) among non-smokers and smokers are listed in the table below:


Feature	Coef.	Baseline rate(non-smokers)	Rate among smokers

Feature	Coef.	Baseline rate(non-smokers)	Rate among smokers

age_35-44	0.276	$(0.276)(1∕0.55) = 0.0963$	0.0963+0.2767=0.373

age_45-54	1.1145	$(1.1145)(1∕0.55) = 1.2179$	1.2179+0.2767=1.4946

age_55-64	2.4563	$(2.4563)(1∕0.55) = 5.1239$	5.1239+0.2767=5.4006

age_65-74	3.8593	$(3.8593)(1∕0.55) = 11.6514$	11.6514+0.2767=11.9281

age_75-84	4.7632	$(4.7632)(1∕0.55) = 17.0822$	17.0822+0.2767=17.3589

smoke	0.4933	$P R = (0.4933)(1∕0.55) = 0.2767$

There were no outliers for this model:

Nor were there any overly inﬂuential DFBETAs for this power model:

Summary of ﬁtted rates. In summary of the additive, multiplicative, and power models, the Table 7.3 shows the basic interpretation of coeﬃcients and how the AR and RR are estimated and then applied to the baseline rate among non-smokers to obtain the rate among smokers.

Table 7.3: Link function and predicted rates.


Link function	Fitted rate	Rate among exposed

$λ$	$λi(x⊤β ) = (x⊤β) i i$	$λe = λi(x⊤ β)+ AR = βj + βexp i$

$log(λ)$	$λi(x⊤β ) = exp(x⊤β) i i$	$λe = λi(x⊤ β)× RR = exp(βj)× exp(βexp) i$

$λ ρ$	$λ(x⊤β ) = (x⊤β)1∕ρ i i i$	$λ = λ (x⊤ β)+ P R = β1∕ρ + β1∕ρ e i i j exp$

Bibliography

[1] F. Galton. Regression towards mediocrity in hereditary stature. J. Anthropol. Inst., 15:246–263, 1886.

[2] M.S. Bartlett. Further aspects on the theory of multiple regression. Proc. Cambridge Phil. Soc, 34:33–40, 1938.

[3] D.C. Hoaglin, R.E. Welsch. The hat matrix in regression and ANOVA. Am. Stat. 32:17-22, 1978.

[4] D.A. Belsley, E. Kuh, R.E. Welsch. Regression Diagnostics: Identifying Inﬂuential Data and Sources of Collinearity. New York (NY), John Wiley and Sons, 1980.

[5] J. Neter, W. Wasserman, M.H. Kutner. Applied Linear Statistical Models. 2nd Ed. Homewood (IL), Richard D. Irwin, 1985.

[6] M.S. Bartlett. Multivariate analysis. J. Royal Stat. Soc., 9:176–197, 1947.

[7] E.L. Frome, M.H. Kutner, J.J. Beauchamp. Regression analysis of Poisson-distributed data. JASA. 68:935–940, 1973.

[8] E.L. Frome. The analysis of rates using Poisson regression methods. Biometrics. 39:665–674, 1983.

[9] Frome, E.L. PREG: A computer program for Poisson regression analysis. Oak Ridge Associated Universities Report ORAU-178 (1981). Oak Ridge National Laboratory. Oak Ridge(TN), ORAU.

[10] C.R. Muirhead, S.C. Darby. Modelling the relative and absolute risks of radiation-induced cancers. J. R. Stat. Soc. A. 150:83–118, 1987.

[11] N.E. Breslow, N.E. Day. Statistical Methods in Cancer Research. Volume II: The design and anaylsis of cohort studies. Chapter 4. IARC Scientiﬁc Publication No. 82. Lyon, IARC, 1987.

[12] D.A. Pierce, Y. Shimizu, D.L. Preston, M. Vaeth, K. Mabuchi. Studies of mortality of atomic bomb survivors. Report 12, Part I. Cancer: 1950-1990. Radiat. Res. 146:1–27, 1996.

[13] D.E. Thompson, K. Mabuchi, E. Ron, M. Soda, M. Tokunaga, S. Ochikubo, S. Sugimoto, S. Ikeda, M. Terasaki, S. Izumi. Cancer Incidence in Atomic Bomb Survivors. Part II: Solid Tumors, 1958-1987. Radiat. Res. 137:S17-S67, 1994.

[14] D.L. Preston, S. Kusimi, M. Tomonaga, S. Izumi, E. Ron, A. Kuramoto, N. Kamada, H. Dohy, T. Matsui, H. Nonaka, D.E. Thompson, M. Soda, K. Mabuchi. Cancer Incidence in Atomic Bomb Survivors. Part III: Leukemia, Lymphoma and Multiple Myeloma, 1950-87. Radiat. Res. 137, S68–S97, 1994.

[15] E. Cardis, E.S. Gilbert, L. Carpenter, G. Howe, I. Kato, B.K. Armstring, V. Beral, G. Cowper, A. Douglas, J. Fix, S.A. Fry, J. Kaldor, C. Lavé, L. Salmon, P.G. Smith, G.L. Voelz, L.D. Wiggs. Eﬀects of low doses and low dose rates of external ionizing radiation: Cancer mortality among nuclear industry workers in three countries. Radiat. Res. 142: 117–132, 1995.

[16] E. Ron, J.H. Lubin, R.E. Shore, K. Mabuchi, B. Modan, L.M. Pottern, A.B. Schneider, M.A. Tucker, J.D. Boice, Jr. Thyroid cancer after exposure to external radiation: a pooled analysis of seven studies. Radiat. Res. 141:259-277, 1995

[17] R.A. Kerber, J.E. Till, S.E. Simon, J.L. Lyon, D.C. Thomas, S. Preston-Martin, M.L. Rallison, R.D. Lloyd, W.S. Stevens. A cohort study of thyroid disease in relation to fallout from nuclear weapons testing. J. Am. Med. Assoc. 270:2076–2082, 1993.

[18] L.E. Peterson, W.J. Schull, B.R. Davis, P.A. Buﬄer. Information Bias and Lifetime Mortality Risks of Radiation-Induced Cancer. U. S. NRC NUREG Report GR-0011. Nuclear Regulatory Commission, Washington, D.C., 1994.

[19] R. Sposto, D.L. Preston, Y. Shimizu, K. Mabuchi. The eﬀect of diagnostic misclassiﬁcation on non-cancer and cancer mortality dose-response in A-bomb survivors. Biometrics. 48:605-617, 1992.

[20] A.S. Whittemore, G. Gong. Poisson regression with misclassiﬁed counts: application to cervical cancer mortality rates. Appl. Stat. 40; 81–93, 1991.

[21] E.J. Frome, R.J. DuFfrain. Maximum likelihood estimation for cytogenetic dose-response curves. Biometrics. 42:73–84, 1986.

[22] P. McCullagh, J.A. Nelder. Generalized Linear Models. London, Chapman and Hall, 1983.

[23] F.J. Aranda-Ordaz. An extension of the proportional-hazards model for grouped data. Biometrics. 39:109–117, 1983.

[24] D.C. Hoaglin, R.E. Welsch. The Hat matrix in regression and ANOVA. Am. Statistician. 32:17–22, 1978.

[25] R.D. Cook. Detection of inﬂuential observations in linear regression. Technometrics. 19:15–18, 1977.

[26] R. Doll, A.B. Hill. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Nat’l. Cancer Inst. Monogr. 19:205–268, 1966.

Chapter 8
SURVIVAL ANALYSIS

Survival data analysis involves use of time-to-event data for both failures and censored outcomes. In order to perform survival analysis, there are two outcome measures required from each object (animal, human subject, plant, etc.), which are (i) the time under observation from the beginning of the study until the objects fails (“failure”) or the object withdraws or survives until study end without failure (“censored”), and (ii) whether or not the object failed or not (0-No, 1-Yes). NXG Explorer can handle two kinds of survival data analyses, which are Kaplan-Meier grouped-based analysis and Cox proportional hazard regression analysis.

8.1 Kaplan-Meier Analysis and Logrank Test

8.1.1 Product Limit Estimate of Survivorship Function

The primary goal of Kaplan-Meier (KM) analysis is to determine the product limit estimate of the survivorship function, $S (t)$ , based on the survival times and censoring status of $n$ objects which have been observed for failure over a given follow-up period [1]. The survivorship function, $S(t)$ , is deﬁned as the probability of surviving (not failing) beyond time $t$ , and is only determined for failure times, not censoring times for objects which have dropped out of follow-up monitoring (i.e., lost to follow-up, missing). Objects that have been censored, that is, have not failed throughout the entire study or dropped out from follow-up, are assigned a censoring time equal to the time they left the study early (without failing) or the time until the end of the study. If a human research subject withdrew consent from a clinical trial before the end of the trial at 4.5 months after enrollment, then the subject’s survival (censoring) time is 4.5+ months, where $+$ denotes censored (did not fail). Otherwise, if a subject failed at 13.7 months after treatment by having a primary outcome, for example, tumor progression, stroke, myocardial infraction, etc., then the survival (failure) time for this subject would be 13.7 months.

The table below lists $n = 10$ subjects who were monitored during a 65-day study (6 who failed and 4 who were censored) along with their survival times (failure times or censoring times) and product limit estimates of the survivorship function values, $S (t)$ calculated at the unique failure times.


Time, $t$	Rank, $r$	$h(t) = nn−−rr+1$	$∏ S (t) = tr≤th(tr)$

5	1	$910$	0.9
13+	2	-	-
21+	3	-	-
28	4	$67$	$0.771 = 910 × 67$
31	5	$56$	$0.643 = 190 × 67 × 56$
33+	6	-	-
45	7	$3 4$	$9 6 5 3 0.482 = 10-× 7 × 6 × 4$
53+	8	-	-
58	9	$1 2$	$0.241 = 9-× 6× 5× 3 × 1 10 7 6 4 2$
63	10	0	0

$+$ denotes censored object, i.e. did not fail.

To construct the table, we ﬁrst sort the censoring and failure times of objects in ascending order, so that the shortest time is ﬁrst and the longest time is last. Next, assign a rank, $r$ , to each observation using values $ri = 1,2,...,n$ . At each unique failure time, calculate the hazard rate

-n-−-r-- h(t) = n− r +1 ,

(8.1)

where $r$ is the rank of the observation at failure time $t$ . The survivorship function value at each failure time $t$ is then calculated as the product of all hazard rates up to time $t$ using the relationship

∏ S(t) = h(tr). tr≤t

(8.2)

The Nelson-Aalen baseline cumulative hazard function is then determined as

H0(t) = − log(S(t)).

(8.3)

8.1.2 Log-rank Test

Let $τ$ be the greatest observed study time among the subjects being studied. Let $t1 < t2⋅⋅⋅ < ti < ...< tD$ be the ordered unique failure (death) times for the pooled sample. At time $ti$ $(i = 1,2,...,D)$ we observe $dij$ deaths in the $j$ th sample $(j = 1,2,,...,K )$ out of $nij$ subjects at risk.

The hypotheses applied to test for inequality of hazard functions for each group is

$H : h (t) = h (t) = ⋅⋅⋅ = h (t) ∀t ≤ τ 0 1 2 k$

$Ha : at least one of the hj(t)’s is diﬀerent for some t ≤ τ$

In order to develop a hypothesis test, we ﬁrst obtain diﬀerences between the observed and expected number of failures, using the form

∑D [ ( ) ] zj(τ) = dij − nij di . i=1 ni

(8.4)

Following Lawless [2], the $K − 1$ variance-covariance matrix of $zj(τ)$ has elements

D ( ) vjk(τ) = ∑ (dini −-di)nij δ− nik j,k = 1,2,...,K − 1, i (ni − 1) ni ni

(8.5)

where $δ = 1$ if $j = k$ . Note that $∑K j=1 zj(τ) = 0$ , so the $2 χ$ test statistic with $K − 1$ d.f. is constructed by selecting any $K − 1$ of the $zj(τ)$ . The variance-covariance matrix of these statistics is given by the $(K − 1)(K − 1)$ matrix $V$ , and again, only $K − 1$ of the variances and covariances are needed. The test statistic is based on the quadratic form

⌊ z1(τ) ⌋ [ ] | z2(τ) | zTV −1z = z1(τ) z2(τ) ⋅⋅⋅ zK −1(τ) V−1 || .. || ⌈ . ⌉ ◟------------------◝◜----------zK−1(τ)◞ 1×1

(8.6)

An $α$ level test of $H0$ rejects when $χ2$ is greater than the $α$ th upper percentage point of a $χ2$ random variable with $K − 1$ degrees of freedom.

Example 1 - Kaplan-Meier Group Test (Edema) of Primary Biliary Cirrhosis. In this example, we will run a KM analysis on the Edema categorical feature in the Primary Biliary Cirrhosis (PBC) data. To begin, import the pbc.xlxs Excel ﬁle. Next, select Analysis $→$ Survival analysis $→$ Kaplan-Meier group test from the Analysis pull-down menu, shown as follows:

Select edema as the grouping feature, fudays as the survival time feature, and dead as the failure feature, shown as follows:

Following the run, the treenodes will appear as follows:

Click in the Kaplan-Meier Survival By Group icon, and the survival plot will appear as follows:

Note: You can generate your own survivorship function, $S(t)$ , plots by opening the .csv output ﬁle, which contains sets of columns with time and $S(t)$ for each group. You would need to employ a graphics package which can generate line plots from multiple XYXY data, where each group consists of a pair of x- and y-values.

Click on the Logrank Test icon and the results of the logrank test for equality of hazard(survival) functions between the 3 edema groups is provided:

The p-value indicates the hazard functions for the 3 groups are signiﬁcantly diﬀerent.

Kaplan-Meier Group Test (Drug) of Primary Biliary Cirrhosis. In this example, we will perform KM analysis of the drug eﬀect, with two options for recoding. Recall that the drug feature is coded as a 1 for drug and 2 for placebo. We now show how to create a new feature called treated (0-placebo, 1-drug) based on recoding and labeling of the drug and placebo groups. To get started, select Preferences $→$ Labels and transforms, and in the editor of open the script ﬁle called pbclabel.nxgs. Select Run, so that the new treated feature is generated. Next, select Analysis $→$ Survival analysis $→$ Kaplan-Meier group test, and specify dead as the failure feature, fudays as survival time, and treated as the group feature. In addition, repeat these steps while using drug (1-drug, 2-placebo). The results below illustrate how changing the coding resulted in a change in the plot and the KM logrank observed and expected counts. Quite basically, the groups are only swapped, but the point of this example is to remind the user to pay particular attention to coding schemes.

The logrank test results indicate that there is no signiﬁcant diﬀerence in hazard(survivorship) functions between the two groups. Furthermore, you can notice how the survival curves overlap one another, providing visual conﬁrmation that there is no signiﬁcant drug eﬀect of survival.

8.2 Cox Proportional Hazards Regression

Cox proportional hazards (PH) regression is a multiplicative maximum-likelihood technique to derive parameters for modeling “time-to-event” failures or survival times [3]. Each object used for ﬁtting the model requires a failure(censoring) time and a binary outcome response indicating whether the object failed or was censored, that is, either left(migrated out of) the study or did not fail before the study ended. When an object fails, the time value is set to the duration of time under observation until the failure time (i.e., survival time), and its failure status is assigned a one (1) to indicate failure. However, an object that doesn’t fail under study will have a failure status of zero (0) and a survival time set equal to the time it was observed until the end of the study or the time it dropped out or migrated out of the study. Either way, every object under study receives two outcome codes: one for the survival time and one for the binary code indicating failure or no failure (censoring).

To begin, assume the ordered unique failure times $t(1) < t(2) < ⋅⋅⋅ < t(k)$ . We recall that subjects in the risk set at time $t(i)$ , i.e., $R (t(i))$ , are those subjects with survival times that are at least $t(i)$ . In other words, the group of subjects at risk at time $t(i)$ are subjects which survived up to time $t(i)$ and subjects who failed or were censored during or after $t(i)$ (since they too were also at risk before failure time $t(i)$ ). The probability contribution of each failure at time $t(i)$ conditional on those subjects at risk, $R(t(i))$ , is

( ) exp ∑pj=1βjxij p(fail|R (t(i),β) = ∑-----------(∑--------)-. ℓ∈R (t(i))exp pj=1 βjxℓj

(8.7)

For $k$ total failures, the conditional likelihood becomes

(∑p ) k∏ ----exp---j=1βjxij------ L(β) = ∑ exp( ∑p βx ). i=1 ℓ∈R(t(i)) j=1 j ℓj

(8.8)

We ﬁrst take the log of the likelihood above. Taking the log of the numerator gives us

⌊ ( p ) ⌋ p log⌈exp( ∑ βjxij) ⌉ = ∑ βjxij. j=1 j=1

(8.9)

The denominator term in the likelihood cannot be reduced by taking the log and remains as

⌊ ∑ ( ∑p )⌋ log⌈ exp( βjxℓj)⌉ . ℓ∈R(t(i)) j=1

(8.10)

Looking at the likelihood kernel $p(fail|R (t(i),β)$ , we also notice that it is a ratio, so we have to use the property for logarithms $log(A∕B ) = log(A )− log(B)$ . Combining the logarithms of the numerator and denominator of the kernel yields the log likelihood function

( ⌊ ( )⌋ ) ∑k { ∑p ∑ ∑p } LL(β) = ( βjxij − log ⌈ exp ( βjxℓj)⌉ ) . i=1 j=1 ℓ∈R(t(i)) j=1

(8.11)

Next, let’s only work with partial derivatives for each $i$ th risk set, so for now we can drop the summation symbol $∑k i=1$ . Therefore, we are dealing with the log likelihood for each risk set in the form

∑p ⌊ ∑ ( ∑p ) ⌋ LL(β,R(t(i))) = βjxij − log⌈ exp( βjxℓj) ⌉ . j=1 ℓ∈R(t(i)) j=1

(8.12)

Now we take the ﬁrst partial derivative of the left term $∑ pj=1 βjxij$ in the log likelihood function w.r.t. each $β$ term to get the score vector. We already know that for the function

f (x) = β0 + β1x1 + β2x2,

(8.13)

the partial derivative of the function with respect to $β1$ is

∂(f(x)) -------= x1. ∂β1

(8.14)

Therefore, the ﬁrst partial derivative for each $β$ in the left term of the log likelihood is

∑p ∂(--j=1βjxij) = xij. βj

(8.15)

At this point, we know that the ﬁrst part of the score function is

s(βj) = xij ...

(8.16)

We are left with the right term of the log likelihood in the form

⌊⌈ ∑ (( ∑p ))⌋⌉ log exp βjxℓj . ℓ∈R(t(i)) j=1

(8.17)

Notice that this is a log function. We know that the derivative of the logarithm of some function $u$ , $log(u)$ is the inverse of the function, $du∕u$ . This gives us a hint that part of the right term of the score contains the inverse

-----------∂u∕∂βj------------ [∑ (∑p )]. log ℓ∈R(t(i))exp j=1βjxℓj

(8.18)

Since we already handled the logarithm, we can now attack the function itself. Our target is $βj$ , and we note that the subscript $j$ indicates the coeﬃcients are not a function of each subject $ℓ$ in the risk set $R(t(i))$ considered. Thus, $β$ values don’t change across risk sets, so their addition across risk sets (and anywhere they appear in the summation) has to be recognized and left intact. Given this, we can now work with the terms for a single individual $ℓ$ in the risk set who contributes the factor

( p ) exp (∑ βjxℓj) . j=1

(8.19)

Taking the ﬁrst partial derivative of this function w.r.t. $βj$ gives

( ( )) ∂ exp ∑p βjxℓj ( ∑p ) ---------j=1------- = xℓjexp( βjxℓj) . ∂βj j=1

(8.20)

For the entire group of subjects in the risk set, we can now include the ﬁrst partial derivative into the summation and rewrite this as

( p ) ∑ x exp( ∑ β x ) . ℓ∈R (t ) ℓj j=1 j ℓj (i)

(8.21)

Going back to our original log function on the right side of the log likelihood, we now have

∑ (∑ ) ℓ∈R(t(i))xℓjexp pj=1βjxℓj --∑-----------(∑p-------)--. ℓ∈R(t(i)) exp j=1βjxℓj

(8.22)

For the risk set considered, we now have the score

( ) ∑ ℓ∈R(t )xℓ1exp ∑pj=1β1xℓ1 s(β1) = xij −-∑---(i)-----(-∑--------)--. ℓ∈R(t(i))exp pj=1 β1xℓ1

(8.23)

Now that the ﬁrst partial derivative (score) of the log-likelihood function w.r.t. $β1$ is solved, we can calculate the second partial derivative of the log-likelihood function w.r.t. $β2$ . First, we drop the term $xij$ on the left of the ﬁrst partial derivative since it is independent of $β2$ . Looking at the remaining terms, we notice that the ﬁrst partial derivative is a quotient in the form

∑ (∑p ) --ℓ∈R(t(i))xℓ1exp---j=1β1xℓ1- ∑ (∑p ) , ℓ∈R (t(i))exp j=1β1xℓ1

(8.24)

so we use the quotient rule for diﬀerentiation. The quotient rule is

f(x) g(x)∂f∂(xx)−-f(x)∂g∂(xx)- g(x) = g(x)2 ,

(8.25)

so we have

( ) ∑ (∑p ) f(x) ∑ ( ∑p )∂--ℓ∈R(t(i))xℓ1exp--j=1β1xℓ1- g(x) = ℓ∈R(t )exp j=1β1xℓ1 ∂β2 − (i) ( ) ∑ (∑p )∕ ∑ ( ∑p ) ∂--ℓ∈R(t(i))exp--j=1β1xℓ1-- ℓ∈R(t )xℓ1 exp j=1β1xℓ1 ∂β2 (i) ( )2 ∑ ( ∑p ) ℓ∈R(t )exp j=1β1xℓ1 , (i)

(8.26)

and after rearranging gives

f(x) ∑ ( p∑ ) ∑ (∑p ) g(x) = exp( β1xℓ1) xℓ1xℓ2exp( β1xℓ1)− ℓ∈R(t(i)) j=1 ℓ∈R(t(i)) j=1 ∑ ( p∑ ) ∑ (∑p ) ∕ xℓ1exp( β1xℓ1) xℓ2exp( β1xℓ1) ℓ∈R(t(i)) j=1 ℓ∈R(t(i)) j=1 ∑ ( p∑ ) 2 exp( β1xℓ1) , ℓ∈R(t(i)) j=1

(8.27)

and ﬁnally

∑ (∑p ) ∑ (∑p ) f(x)-=--ℓ∈R(t(i))exp---j=1β1xℓ1--ℓ∈(R(t(i))xℓ1xℓ2)e2xp---j=1-β1xℓ1-− g(x) ∑ ℓ∈R(t(i))exp ∑pj=1β1xℓ1 ∑ (∑p )∑ (∑p ) --ℓ∈R(t(i))xℓ1exp---j=1β1xℓ1-(-ℓ∈R(t(i))xℓ2)exp---j=1-β1xℓ1-. ∑ ℓ∈R(t(i))exp ∑pj=1β1xℓ1 2

(8.28)

Cancelling out terms, we get

( ) ∑ xℓ1xℓ2exp ∑p β1x ℓ1 I(1,2) = --ℓ∈∑R(t(i))------(∑---j=1--)---− ℓ∈R (t(i))exp pj=1 β1xℓ1 ∑ (∑p )∑ (∑p ) --ℓ∈R(t(i))xℓ1exp---j=1β1xℓ1---ℓ∈R(t(i))xℓ2-exp----j=1β1xℓ1- ∑ (∑p )2 , ℓ∈R (t(i))exp j=1 β1xℓ1

(8.29)

which forms element $(1,2)$ in the information matrix (second partial derivative “Hessian” matrix). Each element in the information matrix is calculated using the equation above and by substituting in appropriate values for subjects in all risk sets. A running total of elements in the score vector and information matrix is kept as the covariates (e.g., $xℓ1$ ) for each subject are looped through within each risk set and over each $βj$ term.

A gradient ascent approach is employed using the Newton-Raphson method to ﬁnd maximum likelihood estimates of the solution parameters $β$ in the form

−1 βi+1 = βi + I(β) s(β).

(8.30)

Convergence is reached when $||β|| < 𝜖$ . Values of $𝜖$ are typically in the range $10−8 ≤ 𝜖 ≤ 10−4$ .

8.2.1 Regression Diagnostics

This section covers diagnostics for assessing the ﬁt of a Cox proportional hazards regression run, as well as determining overly inﬂuential observations used during an analysis.

Cox-Snell (CS) residuals. The Cox-Snell (CS) residual [4] for the $i$ th observation is

( ) cs ∑p ri = ˆH0(t)exp ( xijβj) , j=1

(8.31)

where $xij$ is the covariate value for the $i$ th object and $j$ th feature, $βj$ is the regression coeﬃcient for the $j$ th feature, and $Hˆ0 (t)$ is Breslow’s cumulative hazard function [4], deﬁned as

( ) ∑ d ˆH0(t) = ( ∑------(----i∑p--------)-) . ti≤t ℓ∈R (ti) exp ( j=1xℓjβj)

(8.32)

Once the CS residuals are determined for each object, their values are employed as the survival time during a new KM analysis, with the same censoring information for each object. The values of $S(t)$ from the KM analysis are then transformed to new values of the Nelson-Aalen (NA) baseline cumulative hazard using (8.3). A plot of the CS residual vs. Nelson-Aalen baseline cumulative hazard derived from using CS residuals as survival times is then generated. If the Cox PH model ﬁts well, then the CS residuals will closely fall on a line with a slope of one, assuming that that CS residuals follow an exponential distribution with parameter unity, i.e., exp(1).

Martingale (M) residuals. The Martingale residual [4] is determined for each risk set (for objects having unique failure times) using the relationship

rMi = di − rcis ,

(8.33)

where $di$ is the censoring indicator for the $i$ th risk set.

Deviance (D) residuals . The deviance residual [4] is also determined for each risk set using the relationship

dev M ∘ ---[-M-------------M-] ri = sign(ri ) − 2 ri + dilog(di − ri ),

(8.34)

where $M ri$ is the Martingale residual and $di$ is the censoring indicator for the $i$ th risk set.

Schoenfeld (SCH) residuals. Schoenfeld residuals [4] are determined for each observation and each feature, based on the functional form

rsch= di(xij − aij), ij

(8.35)

where $a ij$ is

∑ ∑p --ℓ∈R(ti)xℓjexp(--m=1-xℓmβm)- aij = ∑ ℓ∈R(t)exp(∑pm=1xℓm βm) , i

(8.36)

and $R(ti)$ is the risk set of individuals at risk up to time $ti$ for the $i$ th observation.

Scaled Schoenfeld (SCA) residuals. Scaled Schoenfeld residuals [4], on the other hand, are determined as

rscia = ˆβ+ dV (ˆβj,β ˆk)rsich,

(8.37)

where $ˆβ$ is the $p × 1$ vector of regression coeﬃcients, $d$ is the total number of failures, $V(ˆβj, ˆβk)$ is the $p× p$ variance-covariance matrix of the coeﬃcients, and $rsich$ is a $p × 1$ vector of Schoenfeld residuals for the $i$ th observation.

The table below lists the representation of and existence at each level of measurement for the residuals and hazards described above. The Cox-Snell, Martingale, and deviance residuals along with the Nelson-Aalen cumulative hazard are only determined for observations with unique failure times, whereas the Schoenfeld and scaled Schoenfeld residuals are determined for each observation and each feature.


Residual	Representation	Existence

Cox-Snell	Risk sets	Unique failure time, $t$
Martingale	Risk sets	Unique failure time, $t$
Deviance	Risk sets	Unique failure time, $t$
Nelson-Aalen cum hazard	Risk sets	Unique failure time, $t$
Schoenfeld	Observations	Each observation and feature, $i,j$
Scaled Schoenfeld	Observations	Each observation and feature, $i,j$

Plots for proportional hazards assumption. Plots of $− log(− log[S(t)])$ vs. $log(t)$ for categorical features such as treatment group or diagnostic group can be constructed using stratum-speciﬁc baseline cumulative hazard, deﬁned as

∑ ( ) Hˆk (t) = ( ∑-------(--dik∑---------)) , tik≤t ℓ∈Rk(t) exp( pj=1 xℓjβj)

(8.38)

where $k$ represents the $k$ th category, $tik$ is a failure time in the $k$ th stratum, $dik$ is the number of failures at time $tik$ , and $Rk (t)$ is the risk set for censored and failed objects exclusively in the $k$ th category at risk up to time $tik$ . This calculation requires use of only the covariate values, unique failure times, and risk sets containing objects in the $k$ th group, which results in stratum-speciﬁc baseline cumulative hazard. The stratum-speciﬁc survivorship function value for each observation in the $k$ th category and $t$ th risk set is then determined as

⊤ Sik(t) = exp{−Hˆk (t)}(xi β).

(8.39)

Tests of proportional hazards assumption. Grambsch and Therneau [5] introduced a test based on scaled Schoenfeld residuals to determine if the PH assumption holds for each feature, and for the global collection of features used in a model. Under the null hypothesis, the PH assumption will hold true if the slope of scaled Schoenfeld residuals over some function of time, $f(t)$ , is zero. The eﬀect of a speciﬁc feature on violation of the PH assumption can be tested using the following zero-slope test statistic

[∑n ¯ sca]2 χ2j = ---∑ni=1{dif(ti)-−f-(t)}rij∑n----, vjj i=1{dif(ti)− ¯f(t)}2 i=1di

(8.40)

where $f(t)$ is either the survival time, log survival time, etc., $f¯(t) = (∑ diti)∕(∑ di)$ , and $vjj$ is a diagonal element of the variance covariance matrix $V (ˆβj,β ˆk)$ of the coeﬃcients. Any feature yielding a signiﬁcant slope test should either be use as a strata term for stratiﬁed Cox PH regression, or be removed from the model. In addition, an overall global zero-slope test for all features can be evaluated using the following formulation

2 ⊤ χg = e Ye,

(8.41)

where each element $ej$ in the $p ×1$ vector e is $∑ni=1{dif(ti)− f¯(t)}rsicja$ and each element $yjk$ in the $p× p$ matrix Y is $vjk(∑n {dif(ti)− ¯f(t)}2)(∑n di) i=1 i=1$ . If the global zero-slope test is signiﬁcant, then the features employed in the model result in violation of the PH assumption.

Example 2 - Cox PH regression of Primary Biliary Cirrhosis. In this example, we will be using the Mayo Clinic data for Primary Biliary Cirrhosis again to perform Cox PH regression.

First, import the PBC survival dataset. From the Analysis pull-down menu, select Survival analysis $→$ Cox proportion hazards regression, shown as:

Once the following popup window appears, specify dead as the censoring feature, fudays as the survival time feature, and for categorical features select edema, and stage, and ﬁnally select the rest of the continuous and binary variables as covariates for the model:

Once the run is complete, the following treenode icons will appear:

Click on the Coeﬃcients icon, and the univariate regression coeﬃcients and their hazards ratios (95% CI) will be listed at the top of the output:

The univariate regression coeﬃcient for treatment reveals there is not signiﬁcant treatment eﬀect on survival, which coincides with the KM example performed on these data in the previous section. Many of the other covariates are signiﬁcantly related to survival, and these should be included in the multiple regression model. Next, the multiple regression coeﬃcients are listed in the output:

We notice that treatment is not signiﬁcant when adjusting by potential nuisance and medical history factors among the subjects. and ﬁnally the hazard ratios (95% CI) for the multiple regression model:

Among the hazard ratios, we can observe quite high values for the edema_3 and stage_4 levels of edema and stage of disease, further attesting to the fact that these covariates should be included with the treatment feature (treatment) in the multiple regression model.

The proportional hazard assumption tests are shown below:

and indicate that the group 2 category for edema violates the PH assumption. See the -log(-log[S(t)) vs. log(t) plot later in this section to observe this violation.

The predicted survival for each edema group is shown below:

The predicted cumulative hazard for each edema category is shown below:

The -log(-log[S(t)) vs. log(t) plot for the 3 edema groups for this model is shown below:

and it is apparent that the group 2 survival curve crosses the other curves, suggesting that the proportional hazard assumption is violated for the categorical edema feature. The Nelson-Aalen cumulative hazard vs. Cox-Snell residual plot is shown below:

The Martingale residuals are as follows:

The deviance residuals are as follows:

Bibliography

[1] E.T. Lee. Statistical Methods for Survival Data Analysis. Belmont(CA), Lifelong Learning Publications, 1980. (ISBN: 0-534-97987-4).

[2] J.F. Lawless. Statistical Models and Methods for Lifetime Data, 2nd Edition. New York(NY), John Wiley and Sons, 2002. (ISBN: 978-0-471-37215-8).

[3] J.D. Kalbﬂeisch, R.L. Prentice. The Statistical Analysis of Failure Time Data. New York(NY), John Wiley and Sons, 1980. (ISBN: 978-0-471-05519-0).

[4] J.P. Klein, M.L. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. New York(NY), Springer-Verlag Press, 1997. (ISBN: 0-387-94829-5).

[5] P.M. Grambsch, T.M. Therneau. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 81:515-526, 1994.

Chapter 9
DATASETS USED

9.1 Plasma Retinol

The Retinol dataset is based on results of a cross-sectional study to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids [1, 2, 3]. Study subjects ( $n = 314$ ) were patients who had an elective surgical procedure during a three-year period to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found to be non-cancerous. Plasma beta-carotene levels were log-transformed prior to the analyses due to severe asymmetry of the residuals on the original scale. For log beta-carotene, dietary intake, regular use of vitamins, and intake of ﬁber were associated with higher plasma concentrations, while Quetelet Index (deﬁned as weight/height $2$ in the units kg/m $2$ ) and cholesterol intake were associated with lower plasma levels, adjusting for the other factors ( $R2 = 0.5$ ). There was one extremely high leverage point in alcohol consumption that was deleted prior to the analyses.

Plasma concentrations of retinol and beta-carotene were not correlated. The authors conclude that there is wide variability in plasma concentrations of these micronutrients in humans, and that much of this variability is associated with dietary habits and personal characteristics. A better understanding of the physiological relationship between some personal characteristics and plasma concentrations of these micronutrients will require further study.

Authorization: This data set can be used to demonstrate multiple regression, transformations, categorical variables, outliers, pooled tests of signiﬁcance and model building strategies.


Feature	Description	Scale

age	Age (years)	Continuous

sex	Sex (1=Male, 2=Female)	Categorical(1,2)

smokstat	Smoking status (1=Never, 2=Former, 3=Current Smoker)	Categorical(1,2,3)

quetelet	Quetelet (weight/(height^2))	Continuous

vituse	Vitamin Use (1=Yes, 2=Sometimes, 3=No)	Categorical(1,2,3)

calories	Number of calories consumed per day	Continuous

fat	Grams of fat consumed per day	Continuous

ﬁber	Grams of ﬁber consumed per day	Continuous

alcohol	Number of alcoholic drinks consumed per week	Continuous

cholesterol	Cholesterol consumed (mg per day)	Continuous

betadiet	Dietary beta-carotene consumed (mcg per day)	Continuous

retdiet	Dietary retinol consumed (mcg per day)	Continuous

betaplasma	Plasma beta-carotene (ng/ml)	Continuous

retplasma	Plasma Retinol (ng/ml)	Continuous

SIZE: Excel with 314 objects, 14 features.

9.2 Low Birth Weight

Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.

The goal of this study [4] was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on $n = 189$ women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the ﬁrst trimester of pregnancy.

The variables identiﬁed in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. Data were collected at Baystate Medical Center, Springﬁeld, Massachusetts during 1986.


Feature	Description	Scale

id	Identiﬁcation Code	NA

low	Low Birth Weight (0=Birth Weight $≥$ 2500g, 1=Birth Weight $<$ 2500g)	Binary(0,1)

age	Age of the Mother in Years	Continuous

lwt	Weight in Pounds at the Last Menstrual Period	Continuous

race	Race (1=White, 2=Black, 3=Other)	Categorical(1,2,3)

smoke	Smoking Status During Pregnancy (1=Yes, 0=No)	Binary(0,1)

ptl	History of Premature Labor (0=None, 1=One, etc.)	Binary(0,1)

ht	History of Hypertension (1=Yes, 0=No)	Binary(0,1)

ui	Presence of Uterine Irritability (1=Yes, 0=No)	Binary(0,1)

ftv	Number of Physician Visits During the First Trimester (0=None, 1=One, 2=Two, etc.)	Continuous

bwt	Birth Weight in Grams	Continuous

SIZE: Excel with 189 objects, 11 features.

9.3 Small Round Blue-Cell Tumors

The purpose of this study [5] was to develop a method of classifying cancers to speciﬁc diagnostic categories based on their gene expression signatures using artiﬁcial neural networks (ANNs). ANNs were trained using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories (Ewing’s sarcoma, Burkitt’s lymphoma, neuroblastoma, and rhabdomyosarcoma) and often present diagnostic dilemmas in clinical practice. The ANNs correctly classiﬁed all samples and identiﬁed the genes most relevant to the classiﬁcation. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRBCTs, the study team analyzed additional blinded samples that were not previously used for the training procedure, and correctly classiﬁed them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the identiﬁcation of candidate targets for therapy.

The 23 features genes listed in tabular notation below were identiﬁed from the full set of 2,308 genes using Greety PTA feature selection.


Feature	Description	Scale

fc fragment of igg	Gene expression	Continuous

growth arrest-speciﬁc 1-36582	Gene expression	Continuous

fgfr4	Gene expression	Continuous

antigen-monoclonal antibodie	Gene expression	Continuous

apj receptor	Gene expression	Continuous

sarcoglycan	Gene expression	Continuous

histone deacetylase	Gene expression	Continuous

actin alpha 2	Gene expression	Continuous

eh domain containing	Gene expression	Continuous

ests-66552	Gene expression	Continuous

n-acetylglucosamine receptor 1	Gene expression	Continuous

mitogen-activated protein kina	Gene expression	Continuous

non-metastatic cells 4	Gene expression	Continuous

ribonucleaseangiogenin inhibit	Gene expression	Continuous

glycine amidinotransferase	Gene expression	Continuous

translocation protein 1	Gene expression	Continuous

recoverin	Gene expression	Continuous

igf-2	Gene expression	Continuous

transmembrane 4 superfamily	Gene expression	Continuous

myh-1c	Gene expression	Continuous

plasminogen activator	Gene expression	Continuous

ccaat	Gene expression	Continuous

n-cadherin 2	Gene expression	Continuous

class	Diagnostic class (1-Ewing’s sarcoma(EWS), 2-Burkitt’s lymphoma(BL), 3-neuroblastoma(NB), 4-rhabdomyosarcoma(RMS)	Categorical(1,2,3,4)

SIZE: Excel with 63 objects, 24 features.

9.4 British Doctors Cardiovascular and Smoking Data

This dataset is very amenable for Poisson regression because the data are presented in terms of the number of cases and person-years of follow-up of British doctors and coronary disease as a function of smoking (yes/no). Table 9.1 lists the number of coronary deaths as a function of age among non-smoking and smoking doctors.

Table 9.1: Deaths from coronary disease among British male doctors [6].


Feature	Description	Scale

Cases	# cases	Continuous

PY	Person-years of follow-up	Continuous

Age 35-44	Age category 35-44 (1-Yes,0-No)	Binary(0,1)

Age 45-54	Age category 45-54 (1-Yes,0-No)	Binary(0,1)

Age 55-64	Age category 55-64 (1-Yes,0-No)	Binary(0,1)

Age 65-74	Age category 65-74 (1-Yes,0-No)	Binary(0,1)

Age 75-84	Age category 75-84 (1-Yes,0-No)	Binary(0,1)

Smoke	Smoking history (1-Yes, 0-No)	Binary(0,1)

SIZE: Excel with 10 rows, 8 features.

9.5 Primary Biliary Cirrhosis

The data is from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984 [7]. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The ﬁrst 312 cases in the data set participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants. Missing data items are denoted by a period.

This dataset is nearly identical to the data set found in appendix D of Fleming and Harrington. The diﬀerences with this data set are: age is in days, status is coded with 3 levels, and the sex and stage variables are not missing for obs 313-418. Table 9.2 lists the data format.

Table 9.2: Primary Biliary Cirrhosis dataset [7].


Feature	Description	Scale

id	case number	NA

treatment	0-Placebo, 1=D-penicillamine	Binary(0,1)

age	age in years	Continuous

sexfem	sex (0=Male, 1=Female)	Binary(0,1)

ascites	presence of ascites (0=No, 1=Yes)	Binary(0,1)

hepatomegaly	presence of hepatomegaly (0=No, 1=Yes)	Binary(0,1)

spiders	presence of spiders (0=No, 1=Yes)	Binary(0,1)

edema	presence of edema (1=No edema and no diuretic therapy for edema; 2=edema present without diuretics, or edema resolved by diuretics; 3 = edema despite diuretic therapy	Categorical(1,2,3)

bilirubin	serum bilirubin in mg/dl	Continuous

albumin	albumin in gm/dl	Continuous

copper	urine copper in ug/day	Continuous

alkphos	alkaline phosphatase in U/liter	Continuous

sgot	SGOT in U/ml	Continuous

platelets	platelets per cubic ml / 1000	Continuous

cholesterol	serum cholesterol in mg/dl	Continuous

prothrombin	prothrombin time in seconds	Continuous

stage	histologic stage of disease (1,2,3,4)	Categorical(1,2,3,4)

fudays	number of days between registration and the earlier of death, transplantation, or study analysis time in July, 1986	Continuous

dead	failure status (0=Alive, 1=Dead)	Binary(0,1)

SIZE: objects 312, features 19

Bibliography

[1] D.W. Nierenberg, T.A. Stukel, J.A. Baron, B.J. Dain, E.R. Greenberg. Determinants of plasma levels of beta-carotene and retinol. Am. J. Epidemiol. 130:511–521, 1989.

[2] http://lib.stat.cmu.edu

[3] http://staﬀ.pubhealth.ku.dk/ tag/Teaching/share/data/Plasma.html

[4] D.W. Hosmer, S. Lemeshow, R.X. Sturdivant. Applied Logistic Regression: Third Edition. New York (NY), John Wiley and Sons, 2013.

[5] J. Khan, J.S. Wei, M. Ringnér, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, P.S. Meltzer. Classiﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcial neural networks. Nature Medicine. 7:673–679, 2001.

[6] R. Doll, A.B. Hill. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Nat’l. Cancer Inst. Monogr. 19:205–268, 1966.

[7] T.R. Fleming, D.P. Harrington. Counting Processes and Survival Analysis. New York (NY), John Wiley and Sons, 1991.

Index

Association
    Covariance, 83
    Euclidean distance, 83
    Force plots, 88, 90
    Minimum spanning tree, 88
    Pearson correlation, 83
    Prim’s algorithm, 88
    Spearman rank correlation, 83
Average, 18

Behrens-Fisher problem, 52
Binary features, 5

Categorical features, 5
Chi-squared test, 71
Classiﬁer
    Polytomous logistic regression (PLOG), 130
Clipboard
    Copying HTML output, 15
    Copying output images, 15
Coding eﬀects
    Corner-point constraints, 127
    Sum-to-zero constraints, 127
Color
    Background of images, 11
    Cluster heat maps, 11
    Datagrid selected items, 11
    Filtered records, 11
    Histogram bars, 11
    Lines in cluster images, 11
    Preferences, 5
    Text in cluster images, 11
Contingency table analysis, 71
Contingency tables
     $r× c$ , 75
    Multiway, 75
Continuous features, 5
Copying
    HTML output, 15
    images, 15
Correlation
    Matrix, 94
    Pearson product moment, 83
    Spearman rank, 83
Covariance, 83

Datasets
    British doctors, 171
    Low birth weight, 170
    Primary Biliary Cirrhosis, 172
    Retinol, 169
    Small round blue-cell tumors, 170
Default settings, 5
    Categorical feature criterion, 7
    Categorical table iterations, 8
    Class discovery run outputs, 8
    Class prediction run outputs, 9
    Convert input text to numeric, 7
    Data size control, 7
    Feature names, 8
    Filtered record color, 11
    Graphic image format, 12
    Graphics, 11
    Graphics resolution, 11
    Histogram bar color, 11
    Image background color, 11
    Image line color, 11
    Image save format, 7
    Input binary and categorical as continuous, 7
    Input string manipulation, 8
    Minimum sample size, 8
    Missing data codes, 6
    Model building strategy, 8
    Monte Carlo uncertainty analysis, 9
    Output options, 6
    Regression n-2-p ratio, 8
    Replace missing data, 7
    Report test results, 8
    Script editor font, 8
    Table fonts and colors, 9
    Text color, 11
Dependency
    Logistic regression, 122
    Multiple linear regression, 101
        ANOVA table, 103
        Cook’s distance, 108
        DFBETAS, 109
        DFFITS, 109
        Diagnostics, 106
        Jackknife residuals, 108
        Multicollinearity, 109
        Partial F-tests, 111
        Residuals, 108
        Standardized residuals, 108
        Variance inﬂation factors, 110
        Variance partitioning, 102
    Multivariate linear regression, 116
    Poisson regression, 136
    Polytomous logistic regression, 130
Distance
    Euclidean, 83
Distribution
    normal, 18
    Standard normal, 37

Editor, Filtering records, 33
Editor, Labels for categories, 30
Editor, Mathematical transforms, 31
Eigenvalues, 117
Euclidean distance, 83

Feature types
    Binary, 5
    Categorical, 5
    Continuous, 5
    Text, 5
File
    Category code map ﬁle, 16
    Commands, 15
    Export
        Comma-delimited (.csv), 16
        Tab-delimited text (.txt), 16
    Import
        Category codes, 16, 33
        Comma-delimited (.csv), 16
        Excel (xlsx, xls), 16
        Excel ﬁle (xlsx, xls), 20
        Tab-delimited text (.txt), 16
    Merge Excel ﬁles, 16
    Open Explorer project, 15
    Save Explorer project, 16
Font
    Preferences, 5
Fonts
    Script editor, 8

Geometric mean, 58
Geometric standard deviation, 58

Harmonic mean, 58
Harmonic standard deviation, 58
Heteroscedasticity, 49, 65
Homogeneity of variance test, 49
Homoscedasticity, 49
Hypothesis test
    Bartlett’s equality of variances, 49
    Chi-squared, 71
    Homogeneity of variance, 49
    Kruskal-Wallis, 62
    Mann-Whitney U, 53
    Proportions, two independent, 69
    T-test, 49
    T-test, assuming equal variances, 51
    T-test, assuming unequal variances, 52
    T-test, independent samples, 50
    T-test, paired samples, 66
    T-test, variances assumed unequal, 51, 52
    Welch ANOVA, 65
    Welch ANOVA test, 62

Import
    Category code ﬁle, 16, 33
    Comma-delimited ﬁle (.csv), 16
    Excel ﬁle (xlsx, xls), 16, 20
    Tab-delimited text ﬁle (.txt), 16

Jacobi method, 117

Kruskal-Wallis Test, 62
Kurtosis, 19

Likert scale, 13
Linear predictor, 107
Link function, 136
Logistic regression, 122

Mann-Whitney U Test, 53
Mathematical transforms
     $∗$ , 31
     $+$ , 31
     $−$ , 31
     $∕$ , 31
     $<$ , 31
     $>$ , 31
     $exp(x)$ , 31
     $≥$ , 31
     $≤$ , 31
     $log10(x)$ , 31
     $loge(x)$ , 31
     $√-- x$ , 31
     $ab$ , 31
Matrix
    Correlation, R, 94, 96
    Covariance, $Σ$ , 95
    Covariance, S, 94
    Data, multivariate, Y, 93
    Data, X, 17
    Determinant, 96
    Distance, D, 95
    Hessian, 125, 131, 157
    Identity, 97
    Regression
        Coeﬃcient, $β$ , 117
        Data, X, 101
        Hat, H, 108
        Variance-covariance of coeﬃcients, $V (β)$ , 104
        Variance-covariance of coeﬃcients, $V (βj,βk)$ , 117
    Standardized data, Z, 94
    Variance-covariance, 106
    Variance-covariance, S, 94
Matrix setup of data, 17
Mean, 18
Mean-zero standardization, 37
Median, 19

Newton-Raphson method, 125, 132, 157
Normal distribution, 18
Normal variates, 18
Normality test, see Shapiro-Wilk, 19
Normalization, 37

Paired t-test, 66
Percentiles, 40
Polytomous logistic regression, 130
Preferences, 5

Quantile, 19

Range, 19
Recoding features
    Fuzziﬁcation, 41
    Logarithm, 38
    Nominal-to-binary, 41
    Normalization, 37
    Percentiles, 40
    Quantiles, 39
    Ranks, 39
    Standardization, 37
    van der Waerden scores, 40
    Z-scores, 37
Regression
    Cox proportional hazards, 155
    Deviance, 126
    Diagnostics, 106, 157
    Goodness-of-ﬁt, 125
    Logistic (polytomous), 130
    Logistic regression, 122
    Logistic, unconditional (binary), 123
    Multicollinearity, 109
    Multiple linear regression, 101
    Multivariate linear regression, 116
    Newton-Raphson method, 125, 132, 157
    Partial F-tests, 111
    Poisson, 136
    Polytomous logistic regression, 130
    Residuals, 108, 125
    Simple linear, 101

Screenshots
    Copying HTML output to clipboard, 15
    Copying images to clipboard, 15
Shapiro-Wilk normality test, 19
Singular value decomposition, 117
Skew-zero transform, see van der Waerden transform, 41
Skewness, 19
Standard deviation, 18
Standard normal distribution, 37
Standardization, 37
Survival analysis
    Breslow’s cumulative hazard, 158
    Cox proportional hazards regression, 155
    Cox-Snell residuals, 158
    Deviance residuals, 158
    Hazard function, 151
    Kaplan-Meier analysis, 151
    Log-rank test, 151
    Martingale residuals, 158
    Nelson-Aalen baseline cumulative hazard, 152
    Product limit estimate of survival, 151
    Scaled Schoenfeld residuals, 158
    Schoenfeld residuals, 158
    Stratum-speciﬁc baseline hazard, 159
    Survivorship function, 151
    Tests of PH assumption, 159

Tests of hypothesis
    2-by-2 table frequencies, 71
    ANOVA (one-way), 59
    Bartlett’s equality of variances, 49
    Categorical data, 71
    Chi-squared, 71
    Contingency tables, 71
    Kruskal-Wallis, 62
    Mann-Whitney U test, 53
    McNemar paired binary outcomes, 77
    Paired t-test, 66
    Permutation, 77
    Proportions, two independent, 69
    r-by-c table frequencies, 71
    Randomization, 77
    T-test, 49
    T-test, assuming equal variances, 51
    T-test, assuming unequal variances, 52
    T-test, independent samples, 50
    T-test, paired, 66
    Wilcoxon rank sum, 55
    Wilcoxon signed rank, 67
Text features, 5
Transformations
    Categorical-to-Class, 45
    Continuous-to-categorical, 42
    Fuzziﬁcation, 41
    Inside CV folds, 45
    Logarithm, 38
    Nominal-to-binary, 41
    Normalization, 37
    Percentiles, 40
    Quantiles, 39
    Ranks, 39
    Standardization, 37
    Text-to-categorical, 43, 44
    van der Waerden scores, 40
    Z-scores, 37, 38

van der Waerden scores, 40
Variance, 18

Welch ANOVA test, 62

CONTENTS

Chapter 1PREFERENCES AND DEFAULT SETTINGS

1.1 Feature Types

1.2 General Default Settings

1.2.1 Missing Data Code

1.2.2 Output Options

1.2.3 Replace Missing Continuous Data

1.2.4 Replace Missing Binary Data

1.2.5 Replace Missing Categorical Data

1.2.6 Default image (save) format

1.2.7 Input Data Size Control

1.2.8 Data Transformations

1.2.9 Statistical Analysis

1.2.10 Script editor

1.2.11 String Manipulation

1.2.12 Class Discovery Runs

1.2.13 Class Prediction Runs

1.2.14 Monte Carlo Uncertainty Analysis

1.3 Default Table Fonts and Colors

1.4 Color and Graphics Defaults

1.4.1 Graphics

1.4.2 Histograms

1.4.3 Correlation, matrix plots, cluster images

1.4.4 SOM Map Colors

1.4.5 Cluster Heat Map Colors

1.4.6 Series colors

1.4.7 Class discovery score plot markers.

1.4.8 Multichart panel.

1.4.9 Histogram, line, and multipanel graph fonts.

1.4.10 Options for YYYY and XYYY plots.

1.5 Changing Feature Types

1.6 Copying HTML output

1.7 Copying images

1.8 File Command

1.8.1 Open Project

1.8.2 Save Project

1.8.3 Save Project As

1.8.4 Import Data

1.8.5 Export

Chapter 2SUMMARIZE

2.1 Data Arrangement

2.2 Normal Distribution

2.3 Central Tendency and Dispersion

Bibliography

Chapter 3LABELS, TRANSFORMS, FILTERING, AND GROUPING

3.1 Editor

3.1.1 Labels for levels of categorical factors

3.1.2 Mathematical transforms

3.2 Importing Category Labels from Text File

3.3 Filtering records

3.4 Grouping

3.5 Limitations of Filtering and Grouped Analysis

Chapter 4PRESET TRANSFORMATIONS

4.1 Normalization

4.2 Mean-Zero Standardization

4.3 Logarithmic

4.4 Quantile

4.5 Rank

4.6 Percentile

4.7 van der Waerden Scores

4.8 Nominal-to-Binary

4.9 Fuzziﬁcation

4.10 Continuous to Categorical

4.11 Text to Categorical (converts all text features)

4.12 Text to Categorical (select which text feature)

4.13 Assign Categorical Features w/labels to Class Feature

4.14 Transformations During Cross-Validation for Supervised Class Prediction

Bibliography

Chapter 5INDEPENDENCE

5.1 2 Unrelated Samples

5.1.1 Student’s T-test

5.1.2 Homogeneity of Variance Test for Two Samples

5.1.3 Mann-Whitney U Test

5.2 Geometric and Harmonic Means

5.3 Unrelated Samples

5.3.1 Analysis of Variance (ANOVA)

5.3.2 Welch ANOVA Test for Unequal Variances

5.3.3 Kruskal-Wallis Test

5.4 Related Samples

5.4.1 T-test for Paired Samples

Chapter 1
PREFERENCES AND DEFAULT SETTINGS

Chapter 2
SUMMARIZE

Chapter 3
LABELS, TRANSFORMS, FILTERING, AND GROUPING

Chapter 4
PRESET TRANSFORMATIONS

Chapter 5
INDEPENDENCE

5.3 $k$ Unrelated Samples

Chapter 6
ASSOCIATION

Chapter 7
DEPENDENCY

Chapter 8
SURVIVAL ANALYSIS

Chapter 9
DATASETS USED