Explorer - User’s Guide
Standard (Free) Edition
©NXG Logic, LLC.
October 7, 2019
©Copyright 2019 by NXG Logic, LLC.
All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express
written permission of the publisher except for the use of brief quotations in a user guide review.
Printed in the United States of America
ISBN: 978-1-5323-6817-2
United States Library of Congress - US Copyright Office Registration: 1-6268467801
NXG Logic, LLC.
Houston, Texas 77030
www.nxglogic.com
There are 4 types of features used in NXG Explorer: continuous, binary, categorical, and text. Feature type is determined when inputting a dataset. In order to distinguish the 3 numeric types of features, during input of data NXG Explorer enumerates the number of unique values identified for each feature. The number of unique values is compared with a default setting called categorical threshold (default=10). If the number of unique feature values exceeds the categorical threshold, then the feature is specified as a continuous type. If the number of unique values of a features and the maximum value is less than the default categorical threshold, the variable is specified as a categorical. However, if there are two unique values of a feature identified, and min=0 and max=1, then the feature is specified as being binary. Any time text is identified within an input feature’s values, the feature is specified as a text feature.
The table below illustrates the feature types along with the graphic icon used during feature selection for all analyses.
NXG Explorer searches for a decimal in the input values, and if found, the feature is set to the continuous type.
An example of the feature codes for the Retinol dataset in Explorer is shown below:
From time-to-time, users will want to modify some of the default settings in NXG Explorer, such as heat map colors, text and line colors on charts, and text-formatting during data input. To change defaults settings, select PreferencesGeneral default settings as shown below:
and the following popup window will appear:
The following options available in the popup window are described below.
Remove spaces and special symbols on input. This filter provides a useful way to filter out common errors in input files. Examples are empty spaces, dashes, Excel error codes, or special symbols. Values for the empty data cells identified will be set to a null set. No periods or codes are used for replacement.
Missing data code. If a preset missing data code such as NA or 9999 was used during coding, then the missing data code can be used in this textbox. NXG Explorer will replace any table cells with the missing code to a null value.
Remove empty rows in Excel. This option will remove empty rows from the input dataset.
Required proportion of records with string entries for text features. This option helps guard against the input of messy features containing, for example, half text and half numeric values. If string characters are found in a feature’s values, then this proportion can be used as a threshold for determining which features are text-based. For example, when set to 0.8, if less than 80% of the input records for an input feature contain text, then the feature will not be identified as a text-based feature, and any records with text will be set to nothing. On the other hand, if more than 80% of the input values for this feature contain text, then the entire feature will be input as a text feature. Use of this threshold is a good way to enforce that true text-based features do indeed contain mostly text.
Empty output folder at program startup. Due to Windows security issues, especially in Windows 10, there are limitations for the folders which can freely be used for output by installed software. By default, at run-time, NXG Explorer will create an output folder called “NXG Explorer” in the user’s Document folder, which will contain all output files. When this option is selected, the entire folder will be deleted before each run, preventing the accumulation of old files from previous runs. We have used this option most of the time, and have never run into any problems, knowing full well that if an output is to be copied, then we will specifically save outputs to working folders used for report-writing or manuscript preparation. Otherwise, if this option is not selected, then the output files form all runs will be placed in this folder, without being purged.
Remove logos. This option will remove NXG Logic logos from various output files.
Apply. When checked, will apply the following selected option (only one selection allowed).
With zeroes. Missing continuous values will be replaced with zeroes.
With feature means. Missing continuous values will be replaced with the average of the valid input values.
With feature medians. Missing continuous values will be replaced with the median of the valid input values.
Apply. When checked, will apply the following selected option (only one selection allowed).
With zeroes. Missing binary values will be replaced with zeroes.
With ones. Missing binary values will be replaced with ones.
Apply. When checked, will apply the following selected option (only one selection allowed).
With zeroes. Missing categorical values will be replaced with the lowest feature category value (level of the factor).
With ones. Missing categorical values will be replaced with the greatest feature category value (level of the factor).
Available formats. NXG Explorer available default image formats are png, wmf, jpg, gif, tif, bmp, emf, exi, ico, and memorybmp.
Max limit for display of input data points. For large datasets, limitations can be set on the maximum size of data which can be displayed in the visible datagrid.
Recode text categories into numeric codes on input. This option (true, false) will prompt the user with text-to-numeric conversion options during data input to assign numeric class membership to various levels of a categorical text field. If text is found in a feature’s value when set to true, NXG Explorer will tabulate the number of unique text strings found among the features data and prompt the user if the unique text words should be transformed into numeric class values. Therefore, the command will result in text-to-numeric class assignments when specified by the user. When set to false, and text is found in features upon input, the features will remain as text-based features.
Default level of categorical features: 10. This default value, known as the categorical threshold, forms the basis for determining which features are to be specified as categorical and continuous. The section above on “Feature types” describes how this threshold is used. Quite basically, if the number of unique feature values exceeds this default setting, the feature will be specified as a continuous feature, and categorical otherwise.
Input binary and categorical as continuous. When checked, this option will change the feature type from binary or categorical to continuous.
Minimum sample size for grouped tests: . The default value for the minimum sample size for grouped tests is . Below a sample size of 10, which would equate on average for a 2-sample test, the robustness and reliability of a hypothesis becomes degraded as sample size decreases. Users can set this criterion to whatever value they choose.
Report reduced test results table. For any 2- or -sample test of equality of means test for continuous features, this option will cause NXG Explorer to output a reduced table of results based on the choices listed below. This option is particularly useful for presentations and/or publications, when not all of the results are used in published tables. The more detailed tables which are generated when this option is not specified are more amenable to research and exploration, when more details are needed.
Regression: Minimum n-2-p ratio. This option determines the minimum ratio of =sample size to =feature ratio. If the ratio of the sample size to model-based features is less than the specified ratio, then regression runs will be disabled. It is recommended that an : ratio of 10 is used, which follows the typical recommendation of having 10 records for every regression parameter used.
Model building strategy screen p-value. During regression, a univariate regression model whose p-value is less than the specified value will be included in the following multi-feature regression model.
Model building strategy table p-value. Regression covariate’s whose p-values are less than the specified value will be listed in the multi-feature regression output.
Iterations for R C categorical tables - randomization tests (Fisher’s Exact). This value determines the number of iterations used during an empirical p-value test for chi-squared contingency table analysis, in order to mimic the Fisher exact test. A default value of 100,000 is recommended; however, values of e.g. 10,000 should not result in unreliable results, based on Monte Carlo statistics.
Change font. This default setting changes the font typeface for the label and transform editor.
Font size. This default setting changes the font size for the label and transform editor.
Create unique feature names on input. If there are long feature names among input features names, it has been observed that using this option is beneficial, since many long feature names can be drastically shortened. By default, this option is turned off.
Input feature names as lower case. This option allows users to reduce any upper case characters among input feature names to lower case. Many software packages now work with lower case feature names as an attempt to get away from use of case-sensitive feature names.
Drop special characters from input feature names. When this option is selected, special characters such as, # $ %, etc. will be filtered out of input feature names.
Dimension reduction only. Class discovery runs include non-linear manifold learning (dimensional reduction) techniques, which have varying output. When this option is specified, only the reduced 2D feature values are written to an Excel file along with the 2D scatter plot of the scores for the 2D.
Full output. This option will results in the full output by the class discovery runs, which can include a variety of different charts and Excel files, depending on the algorithm used.
Use corr matrix for CMF output(Cov when unchecked). During PCA, component subtraction, and Monte Carlo analysis, when checked, the correlation matrix will be used, vs. use of the covariance matrix when unchecked.
Transform features during CV. When checked, this option will employ feature transformation over all objects within training folds, and apply parameters to objects in the testing fold.
Mean-zero standardization. Specifies that feature value mean-zero standardization will be used for feature transformation. Feature mean and s.d. will be derived from objects in the training folds, and will be applied to objects in the testing fold.
Normalize into range [0,1]. Specifies that feature value normalization to the range [0,1] will be used for feature transformation. Feature and will be derived from objects in the training folds, and will be applied to objects in the testing fold.
Discrete features are zero-based. When checked, this option will establish a zero base for any simulated discrete features.
Add distribution parameters to feature names. When checked, this option will append feature names with an abbreviated name of the probability distribution used as well as the parameter values.
This section applies to the format of the table with summary statistics (command: AnalysisSummarize) and hypothesis test results (command: AnalysisIndependenceUnpaired 2- and k-sample tests). To specify the default table fonts and colors, select PreferencesDefault table fonts and colors, as shown below:
and you will see the following popup window:
The following descriptions describe the various options in the popup window.
Font. The font used for the entire table.
Font size. The font size used for the entire table.
Header font color. Color of the header text.
Row font color. Color of the text in rows.
Table border width. Width of the border.
Table border color. Color of the border.
Header border width. Width of the header’s border.
Header border color. Color of the headers’s border.
Cell border width. Width of cell border.
Cell border color. Color of the cell border.
Header background color. Background color of the header.
Odd row background color. Background color for odd rows. Note: row numbering starts with row 0 below the header row.
Even row background color. Background color for even rows.
Make header background same as odd row background. This option will set the background of the header to be the same as the background for odd rows.
Auto-color picker (base color). This color option allows the user to set the base color for the entire table.
To change color and graphics default values, select Preferences Color and graphic defaults, shown below:
and the following popup window will appear:
Low resolution. This option results in low resolution graphics near 96 dpi.
High resolution. This option results in higher resolution graphics at 300 dpi.
Histogram bar color. This default color option specifies the color choice for histogram bars.
Histogram spline color. The default color of the line when spline fitting is specified for histograms.
Histogram bins. The default value of the number of histogram bins.
Datagrid selection color. This default option specifies the color of either the alternating row colors in the datagrid, or selected rows after record filtering has been invoked.
Gradient 1st color. This gradient color defines the color starting at the top of the cluster image.
Gradient 2nd color. This gradient color defines the color at the bottom of the cluster image.
Line color. This color specifies the color of the dendograms for objects and features.
Font color. This color specifies the color of all object and feature names, as well as header information regarding the distance metric used and scale for heat map colors. Note that if “Alternate object name colors” is checked, then those alternating colors override the default text font color.
Alternate object name colors. This option only applies when the true classes are input with the dataset. When checked, the color of object names within classes will vary in the order blue, green, red, and black. Modulo 4 is used for the object’s class numbers. If no classes are input with the data, then the default font color will be applied.
Missing values. This color specifies the heat map cell colors for missing values.
This default setting specifies the default color gradient used for the U-matrix and feature-specific component plots, which are generated when self-organizing maps are used.
This default setting specifies the default color combination used for heat maps in cluster images.
Some example cluster heat map results are shown as follows:
Default image (save) format This default setting specifies the file format for all graphic output files. The choices are png, wmf, jpg, gif, tif, bmp, emf, exi, ico, and memorybmp.
Windows color palettes. Numerous Windows-based color palettes are available for use with 2D, 3d, line and X-Y scatter plots. If object classes are specified at input, then classes will be assigned the series colors which are selected for default. Otherwise, in YYYY and XYYY line or scatter plots, the different series-specific colors will be applied to the features or dimensions used.
Custom color palette. You can specify which default series colors are to be used by clicking on each series color, and then selecting the desired color. After clicking on Apply, the default custom colors will be applied, as long as the option above the custom colors is selected.
You can specify which default series colors are to be used for class-specific colors in 2d and 3D plots generated during unsupervised class discovery.
Default series colors are also available for object class in the multichart plot generated at the end of unsupervised class discovery.
Default fonts can be selected for class discovery plots, as well as line and scatter plots (YYYY, XYYY), and multipanel plots.
Default fonts can be selected for class discovery plots, as well as line and scatter plots (YYYY, XYYY), and multipanel plots.
Line only. This selection will result in lines between data points in YYYY and XYYY line (scatter) plots.
Points only. This selection will result in points being plotted in YYYY and XYYY line (scatter) plots.
Line and points. This selection will result in lines and points being plotted in YYYY and XYYY line (scatter) plots.
X- and Y-axis thickness. This selection specifies the X- and Y-axis line thickness in YYYY and XYYY line (scatter) plots.
Changing categorical features to continuous. On occasion, you may need to change categorical features to continuous. As an example, let’s suppose you have integer-based scores on the Likert scale for which the item response is 1-Strongly disagree, 2-Disagree, 3-Neutral, 4-Agree, 5-Strongly agree. By default, this particular feature will be read in as categorical type, however, for the analysis you want to use means and standard deviations perhaps for t-tests, ANOVA, or regression. There are two ways to change categorical features to continuous:
Then click on Apply. You will need to reload the input data, and then, for each text feature, NXG Explorer will code the feature as continuous.
The following popup window will appear:
Simply click on continuous for the categorical features you want changed.
Changing text features to categorical. You cannot use the popup window above to change text features to categorical. Rather, this has to be done one of two ways:
Then click on Apply. You will need to reload the input data, and then, for each text feature, NXG Explorer will identify the number of unique categories based on the text values, and prompt you to convert to categorical for each text feature found in the data. (If you have a lot of text features, this method is not recommended, see the next option).
Then select the text feature you want converted to categorical and NXG Explorer will identify the number of unique values among the text values and convert the feature to categorical.
Right-clicking on HTML output. Whenever the Output tab is activated after clicking on a treenode that was generated during a run, a region of interest can be selected for copying & pasting by right-clicking anywhere in the HTML output and then dragging the rectangular “rubber band” around the region of interest. As soon as the left-mouse button is let go, the ROI selected will be copied into memory, which can be pasted in any software that accepts Windows objects (Word, Powerpoint, etc.).
Shift-Right-clicking on HTML output. If the Shift key is pressed and then the mouse pointer is right-clicked on an HTML output, the Windows printer options will appear.
Right-clicking on any output image will raise a popup window, which will show the following choices:
If Region of interest is selected for copying, then click on the left mouse button and drag the mouse pointer rectangular “rubber band” around the region of interest. As soon as the left-mouse button is released, the ROI selected will be copied into Windows clipboard, which can be pasted in any software that accepts Windows objects (Word, Powerpoint, etc.). If Copy is selected, the entire image will be copied into the Windows clipboard, which can then be pasted into other Windows application that accept Windows objects. Otherwise, if Save is selected, several image format options will appear before saving the image.
This option opens an already-saved Explorer project, for which the filename extension is .nxgp.
This option saves an already-saved Explorer project, for which the filename extension is .nxgp.
This option saves a dataset in memory to an Explorer project, for which the filename extension is .nxgp.
Excel. This option imports either an Excel file (.xlsx, or .xls), comma-delimited file (.csv), or a tab-delimited text file (.txt).
Category codes. This option allows the import of a tab-delimited text file (.txt) to affix category labels to levels of the factors for which labels are provided. The command can be invoked (used) before or after data input. See the catcodes_Retinol.txt file for an example of category label formats for the three categorical factors in the Retinol dataset (sex, smoking status, and vitamin use).
Category code map. This option allows the import of a tab-delimited text file (.txt) with categorical feature names followed by a mapped term which is used in the dataset. For example, consider the following example categorical features and their counterpart mapped terms which are used in a dataset:
Factor | Feature name | Mapped name |
Arm | arm | armcode |
Sex | sex | sexcode |
Smoking status | smokstat | smokstatcode |
In the case of the Arm factor, the dataset contains the mapped term armcode, however, the user wants to use a feature name arm in all the analyses. Use of this command option will allow the user to translate a mapped term to any desired feature name.
Merge Excel. Use this option for merging multiple files together. Since 1-to-1 and 1-to-many cannot be specified, you should import the largest file (with a primary key), followed by smaller files (with primary keys). In this fashion, a many-to-fewer can be used successively to achieve the desired final merged file.
Comma-delimited (.csv) Selecting this option will export data in memory to a comma-delimited .csv file, for which the name must be provided.
Tab delimited text (.txt) Selecting this option will export data in memory to a tab-delimited text (.txt) file, for which the name must be provided.
This chapter covers NXG Explorer’s Summarize methods for summarization of variables in a dataset. Construction of a typical dataset is first discussed, which is followed by measures of central tendency for continuously-scaled variables, followed by a brief discussion of categorical data analyses.
A set of continuously-scaled data are often arranged into individual records and variables. A record represents an individual measurement or observation made from one unit of study, such as a patient, mouse, cell, gene, or protein. Variables, on the other hand, represent the collection of observations for all the records. When arranging the data in tabular form, records are arranged in rows, where the index is used to denote each record, with range . Variables are arranged in columns where the index is used to denote each variable, with range . Let us consider the random variable for representing a single observation, where the subscript denotes the th record and subscript denotes the th variable. A collection of multiple ’s for many observations can be arranged in tabular form according to the arrangement in Table 2.1. The symmetric arrangement of data in Table 2.1 can also be represented by use of a matrix. For records and variables, we introduce the matrix X in the form:
| (2.1) |
Each row in X represents a record, and each column represents a variable. A matrix is characterized, among other things, by the dimension. The dimension of X for our example is , called “ by ,” representing the records and variables. Each element denoted , represents the observation for the th record and the th variable in the data set. Because the matrix notation can be used to represent any record or variable in the data set, matrices occupy a very powerful area of statistics based on what is known as matrix algebra. As we shall see in the remaining chapters, matrix algebra is fundamental to many areas of statistics, including analysis of variance, maximum likelihood modeling, etc.
However, before we address more advanced topics, let’s stick with methods used for describing the distribution of a random variable like . When considering a set of data, independent of study design or analysis, a fundamental question among all statisticians is: “How are the data distributed?” With regard to scale of measurement, are the data categorical and measured on the nominal scale, ordinal scale, or are the data continuous and measured on the ratio scale? In terms of probability distributions, when the data are arranged into a histogram, are they normally distributed with a bell-curve shape, distributed according to the binomial distribution, Poisson distribution, or multinomial distribution? A frequency histogram (distribution) of most data used in statistics can be characterized by a specific probability distribution that captures information related to the scale of measurement, observation response, and type of variable.
There are many kinds of continuous distributions. The one we will be most concerned with is the is called the normal distribution. Normal distributions are also known as Gaussian distributions, and consist of a sample of normal variates. A frequency histogram of normal variates is typically bell-shaped, with the bulk of data lying in the middle of the curve and smaller less frequent data residing in vanishing tails. The particular shape of a histogram is characterized by its central tendency and dispersion.
For normal distributions, two parameters used to distinguish central tendency and dispersion are the mean, or average, and standard deviation. Let be a random sample of size from the distribution of random variable . The mean of the variates is denoted and is a measure of central tendency that reflects the average value of :
| (2.2) |
The mean is also equal to the expected value of , in the form
| (2.3) |
Terminology used for the mean is for a population, and for a sample. Typically is used since one usually never knows the true population mean, but rather an estimate of the sample mean through some kind of sampling strategy. The sample variance of variable is a measure of dispersion reflecting the overall width of the distribution, and takes the form
| (2.4) |
and the standard deviation of variable is simply the square root of the variance, functionally composed as
| (2.5) |
The variance can also be determined by using the expected value of . The expected values of a random variable is
| (2.6) |
and the variance of becomes
| (2.7) |
Note that there was a shift in the notation used for several of the parameters introduced above. Various forms of notation are used interchangeably. These are described as follows.
| (2.8) |
There are several other important summary statistics for the normal distribution, which are the range, median, mode, skewness, and kurtosis. The range can be simply determined by use of the relationship
| (2.9) |
The range of a distribution is informative about how far apart the most extreme data points are, which is useful when comparing variables with another. For example, if and for variable are 2.53 and 0.4, the range is 2.13. Whereas, if for variable the values of and are 79.6 and 21.2, the range is 58.4. A large difference in range is also known as a large change in scale, affecting the mean difference and covariance between two variables. Another statistic of interest is the median, which reflects the value of a variable in the midpoint of the distribution. The median can easily be found by sorting the data in a distribution in ascending order and identifying the midpoint. Leaving the variable subscript aside, let be a set of normal variates and be their rank ordered counterpart. The median is equal to the rank ordered variate . Definition of the median introduces a new concept called the th quantile. A quantile is any value among the rank ordered variates . Consider 100 rank-ordered normal variates such that . The lower quartile is the 25th percentile which is equal to . As mentioned previously, the median is the 50th percentile and for this example would be equal to . The upper quartile is the 75th percentile and is equal to is .
In cases when the most frequent quantile is sought, the mode is determined, which is the most frequently occurring quantile. The mode occurs at the peak of the distribution, since the -value at this location is the most frequent.
We now introduce the th moment about the mean of a distribution. The th moment about the mean is defined as
| (2.10) |
| (2.11) |
and the coefficient of kurtosis is
| (2.12) |
Shapiro-Wilk Normality Test. NXG Logic Explorer also performs a normality test on each continuously-scaled feature to determine if the feature is normally-distributed. The Shapiro-Wilk test [1] first sorts -values for each feature in ascending order: . Next, a standard normal variate is generated for each , based on , which are then summed and squared to give
| (2.13) |
Next, calculate , and determine a series of values using the relationships:
| (2.14) |
| (2.15) |
and finally, for , we use
| (2.16) |
where
| (2.17) |
The lower values on the other end of the scale are , and . The statistic is calculated as
| (2.18) |
The Shapiro-Wilk test statistic is standard normal distributed with
| (2.19) |
where
| (2.20) |
and
| (2.21) |
Values of indicate departure of the data being tested from the normal distribution. Hence, if the p-value of the Shapiro-Wilk test is less than 0.05, the test result indicates that the feature being tested is not normally-distributed.
Example 1 - Summary Statistics. This example will involve import of a (Microsoft) Excel file into NXG Explorer, and then generation of summary statistics for the variables in the file. Open the Retinol dataset which includes records for research subjects who studied in a nutritional investigation on the antioxidants retinol and beta-carotene.
First, import the retinol.xlsx file in the Data directory where NXG Explorer is installed. This is done by selecting File Import Excel, and then selecting the Retinol.xlsx file.
Another popup window will appear which allows the user to specify the input format of the Excel workbook which is being imported. Let the default format be the choice, then click Apply:
After the Retinol.xlsx filename is selected and double-clicked, the data tab will show the data that were input (see image below).
Next, select the Summarize command in the Analysis pull-down menu:
You will now see a list of variables,
Click on Select all:
You will then notice check marks placed in all of the checkboxes,
Next, click on the Apply button, and the summary statistics will be generated, revealing a tree-node in the treeview on the left:
Click on the Summary statistics treenode, and the results will be shown in the Output tab:
Click on the Category frequencies icon, and a listing of categorical summary statistics will be shown for the categorical features:
Next, click on the Histogram icon, and you will notice a histogram plot that was generated for each feature:
To manually generate histograms for all the continuous variables, click on the Histogram command of the Graphics pull-down menu:
Next, uncheck the categorical variables representing sex, smoking status (smokstat), and vitamin use (vituse):
Click on Apply in the next popup window:
Select the betadiet variable (treenode) from the treeview:
The following plot shows the histogram for the betadiet variable, revealing a log-normally distributed variable with right-tail skewness.
[1] Royston, P. An extension of Shapiro and Wilks’s W test for normality to large samples. Applied Statistics. 31: 115–124; 1982.
NXG Explorer provides a label and mathematical transform editor which can be employed to assign category labels to the various levels of a nominal or ordinal categorical feature. The editor can also be used for creating new features based on mathematical transforms of existing features, or dummy indicator (binary) features with values (0,1).
The editor is invoked by selecting the Labels and transforms command of the Preferences pull-down menu.
The image below shows the editor after it opens.
Next, navigate to the C:\Program Files\NXG Logic \NXG Logic Explorer \Scripts folder and Open the script file called retinol.nxgs, as shown by
In the image shown below, after opening the script file, you will notice any existing script within the file.
The scripts are divided into two sections that begin with either labels and/or trans, followed by the relevant syntax used in the specific section.
Recall, the categorical features in the Retinol dataset are as follows:
Variable | Explanation (Unit) |
sex | Sex (1=Male, 2=Female) |
smokstat | Smoking status (1=Never, 2=Former, 3=Current Smoker) |
vituse | Vitamin Use (1=Yes, 2=Sometimes, 3=No) |
Labels for various levels of categorical factors (features) can be assigned quite simply by entering the level of the factor and listing the name of the category (level) immediately after, as shown below:
Number of categories | Command |
2 | catvarname^1 cat1label^2 cat2label |
3 | catvarname^1 cat1label^2 cat2label^3 cat3label |
4 | catvarname^1 cat1label^2 cat2label^3 cat3label^4 cat4label |
5 | catvarname^1 cat1label^2 cat2label^3 cat3label^4 cat4label^5 cat5label |
catvarname^1 cat1label^2 ... ^k catklabel | |
An example for the categorical factors in the Retinol dataset for smoking status, vitamin use, and gender, would be:
To invoke (implement) the label and transformations listed in the table above, open the Retinol dataset, open the script file, and then select Run.
The labeling results can be observed in the following image which reveals the category labels in an ANOVA test of several continuously-scaled and categorical variables in the Retinol dataset on the 3-category feature vituse.
Examples of mathematical transformations available in NXG Explorer are listed in the following table:
Type | Command | Example | Result |
Binary | male=sex==1; | males=1, females=0 | |
female=sex==2; | females=1, males=0 | ||
smok1=smokstat==1; | smokstat(1)=1, other=0 | ||
smok2=smokstat==2; | smokstat(2)=1, other=0 | ||
smok3=smokstat==3; | smokstat(3)=1, other=0 | ||
agegt70age70; | age70 =1, age70 = 0 | ||
agege70age70; | age70 =1, age70 = 0 | ||
agelt70age70; | age70 =1, age70 = 0 | ||
agele70age70; | age70 =1, age70 = 0 | ||
Math | ageplus5age+5; | age+5 | |
ageminus5age-5; | age-5 | ||
agetimes5age*5; | age5 | ||
ageover5age/5; | age/5 | ||
^2 | agesqage^2; | ageage | |
sqrt | agesqrtsqrt(age); | ||
log | logagelog(age); | ||
log | log10agelog10(age); | ||
exp | expageexp(age); | ||
Combination | femagegt70female*agegt70; | females w/age70=1, other=0 | |
malesmok3male*smok3; | male w/smokstat(3)=1, other=0 | ||
Examples of new features are:
Results of several of the mathematical feature transformations are listed in the figure below.
Another method for inputting and affixing category labels is to import them from a text file directly, instead of using the Editor.
Example 1 - Assign category labels to the Retinol dataset. First, import the retinol.xlsx file in the Data folder where Explorer is installed. Next, import the category-label text file distributed with Explorer, using the pull-down menu command FileImport DataCategory codes (labels), and then select the catcodes_Retinol.txt file. Once the file is imported, the category labels will be assigned to the levels of the categorical features (sex, smoking status, and vitamin intake). You will now be able to see the categorical labels in most of the analyses which use categorical features.
Consider the case where only females in a dataset are to be filtered out and used for hypothesis testing. The first effort should entail filtering on a gender-related feature. Let’s use the Retinol dataset, for which the gender feature is called sex.
To get started, open the Filter form by selecting Filter on the Analysis pull-down menu:
Next, specify the sex categorical feature, and level 2. then click on Apply.
The filtered records will then be highlighted in the datagrid, and also will be selected in memory for future use.
Summary statistics for female records in the Retinol data set are shown below:
NXG Explorer can perform grouped analysis of most of the analytic modules. Let’s look at an example involving summary statistics of the Retinol dataset, grouped on smoking status. For this example, we will also label the smoking categories using methods outlined earlier in this chapter.
Example 2 - Summary statistics of Retinol dataset, grouped by smoking status.
1. To begin, import the Retinol dataset.
2. Apply the labels provided in the retinol.nxgs file (see earlier labeling example).
3. Select AnalysisGrouped Analysis, as shown in the image below:
4. Next, select the smokstat feature, shown as:
5. Click on Apply, and you will notice that a red background is now shown in the warning textbox to the upper right of the user screen, shown as:
6. In the next popupwindow that appears, select the features age, quetelet, calories, fat, and fiber.
7. When the run is complete, you will notice three red icons representing the three smoking status groups, with the accompanying group names and categorical codes listed in parenthesis next to each red icon:
Click on the Summary statistics icon beneath the red icon with label “Summarize(smokstat=3),” and the summary statistics table for current smokers will be shown as follows:
When using Explorer, filtering and grouping do not work together. Filtering records should only be followed by non-grouped analyses such as summary statistics, correlation, or regression after the specified records have been selected. Whereas grouped analysis can be used for 2- and k-group hypothesis testing, summary statistics, regression, and Kaplan-Meier grouped analysis.
In summary: use the following sequence:
Filtering (gender=males): can run summary statistics, correlation, regression, but not independence tests (T-tests, MW, ANOVA, KW) or KM grouped survival analysis.
Grouped analysis (gender): can run summary statistics, independence (T-tests, MW, ANOVA, KW), regression, correlation, and KM survival analysis.
For example, if you want to run ANOVA with only male data, then perform a grouped analysis using gender as the grouping variable, and then run ANOVA using e.g. treatment arm as the categorical factor, and age, BMI, protein expression as continuous features
The range and scale of a feature can have a large impact on the results of an analysis, especially if the heterogeneity of range and scale among the inputs are widely varying. NXG Explorer can transform features (variables) using various preset methods to remove skewness or to change the scale and range.
Once a dataset is opened, the preset transformations can be accessed from the Transformations command of the Analysis pull-down menu, as shown below.
The following sections describe the various preset transformations provided in NXG Explorer.
Normalization re-scales the values of a continuous feature to be in the range [0,1] using the relationship
| (4.1) |
where is the th value of a feature, is the minimum value of the feature, and is the maximum.
Normal distributions can vary by central location and dispersion based on the mean and variance. A frequency histogram for a group of patients having a mean weight of 130 pounds and standard deviation 12, will be centrally located 70 units (pounds) to the left of a histogram for patients with mean weight 200 pounds and standard deviation 12, when plotted on the same graph. Distributions with varying means and standard deviations are called scale and range dependent, since their location and spread on the same histogram will be different. A powerful transformation which can be applied to any normal distribution to remove the effects of scale and range is called the Z-transformation.
Let and be the mean and standard deviation of a normal distribution of sample size =1,000. The -score for each of the normal variates is given as
| (4.2) |
where represents a normal variate, and is a newly derived standard normal variate. The process of subtracting the mean from each normal variate and dividing by the standard deviation is called standardization, which results in a standard normal distribution. A common property of all standard normal distributions is that they all have a mean equal to zero and a standard deviation of one, with variance equal to one (). In addition, the Z-score equates to the number of standard deviations each normal variate lies away from the mean. Large values of above 2 are said to be “2-sigma” away from the mean, and are commonly considered as extreme values. For a tail probability of , the value of in the negative tail is -1.65, whereas in the positive tail the value of =1.65. When the tail probability is divided by 2, that is , the values of in the negative and positive tails become -1.96 and 1.96, respectively. A probabilistic definition of tail probabilities is that they represent the probability that a given value of is greater than or equal to reference value of . For example, , . Analogously, , .
Standard normal distributions also form the basis of many areas in statistics, including
It is important to note that Z-transformations do not remove skewness from an input distribution. If the feature(s) being transformed is skewed, then the resulting Z-scores will also contain the same degree of skewness. To remove skewness, the log-transform is typically used for log-normally-distributed data, or the van der Waerden transform is used, which always removes skewness.
The logarithmic transform is one of the better transforms for rescaling a skewed dataset with a right tail so that it becomes more normally-distributed. As an example, in the Retinol dataset, the distribution of RETDIET is as follows:
whereas the log-transform (base e) of the RETDIET is illustrated as follows:
and it can be seen that the transformed version of RETDIET is more normally-distributed since large outlier values of RETDIET in the right tail of the untransformed RETDIET are essentially removed after taking the log_e transform.
The quantile transform will first determine the percentile value for each observation based on each value’s rank divided by , and then generate a new variable whose codes represent the various quantile category the percentile falls into. For example, if 4 quartiles (default) are specified, then any original percentile values less than or equal to 25% will be assigned a 1, percentiles greater than 25% and less than or equal to 50% will be coded as a 2, greater than 50% and less than or equal to 75% will be coded as a 3, and the remaining percentiles greater than 75% will be coded as a 4.
The rank transform will arrange the values for a variable in ascending order (smallest first) and then assign the record number corresponding to the sort position. For example, if there are 100 values for a variable, then the ranks for these original variable values will take on ranks . A common approach for displaying values of that are sorted in ascending order is to use the notation , where the subscript () denotes ranking. Another common notation to reflect ranked variate values is .
The table below lists the ranks for 10 x-values sorted in ascending order.
Rank() | |
5 | 1 |
7 | 2 |
9 | 3 |
10 | 4 |
11 | 5 |
17 | 6 |
23 | 7 |
27 | 8 |
31 | 9 |
62 | 10 |
The percentile value for each observation is equal to the value’s rank divided by (n+1). Percentiles reveal the location of each value on a scale of 0 to 1. Therefore, the p5 is the 5th percentile, p10 is the 10th percentile, p25 is the 25th percentile, p50 is the median or 50th percentile, p75 the 75th percentile, p90 the 90th percentile, and p95 the 95th percentile. The table below lists the percentiles for 10 x-values sorted in ascending order.
Rank() | ||
5 | 1 | 0.091 |
7 | 2 | 0.182 |
9 | 3 | 0.273 |
10 | 4 | 0.364 |
11 | 5 | 0.455 |
17 | 6 | 0.545 |
23 | 7 | 0.636 |
27 | 8 | 0.727 |
31 | 9 | 0.818 |
62 | 10 | 0.909 |
van der Waerden (vdw) scores transform a variable’s values into skew-zero standard normal variates for which the mean is zero and standard deviation is unity (one). Recall that the normal cdf function converts an input Z-score into a tail probability using the normal cdf, whereas the inverse cdf function maps the probability back to the Z-score, as listed in the following table:
Function | Notation | Conversion |
CDF | Z-score to tail probability () | |
Inverse CDF | Tail probability () to Z-score | |
The steps involved for calculating the vdw score for each value of a variable are:
The table below lists the VDW scores for 10 x-values sorted in ascending order.
Rank() | |||
5 | 1 | 0.091 | -1.335 |
7 | 2 | 0.182 | -0.908 |
9 | 3 | 0.273 | -0.605 |
10 | 4 | 0.364 | -0.349 |
11 | 5 | 0.455 | -0.114 |
17 | 6 | 0.545 | 0.114 |
23 | 7 | 0.636 | 0.349 |
27 | 8 | 0.727 | 0.605 |
31 | 9 | 0.818 | 0.908 |
62 | 10 | 0.909 | 1.335 |
The VDW transform is a skew-zero transform, since it always removes skewness that may exist in the input distribution.
The nominal-to-binary transform creates a new binary yes/no variable for each level of a category of a nominally-scaled categorical variable. For example, if there are 3 levels of a category, 3 new variables will be generated, and for each record, only one of the three new variable will be set to 1 (0 otherwise) corresponding to the category of the particular record. The table below lists the nominal to binary coding results for a fictitious 3-level categorical variable called “catvar.”
3 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
2 | 0 | 1 | 0 |
1 | 1 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
Fuzzy logic provides a mixture of methods for flexible information processing of ambiguous data [1, 2, 3]. Fuzzy transformations were are used to map the original values of an input feature into 3 fuzzy sets representing linguistic membership functions, in order to facilitate the semantic interpretation of each fuzzy set. The fuzzy sets low, medium, and high exploit uncertainty among the original feature values, reducing the information in order to obtain a robust, less-expensive, and tractable solution. To fuzzify a continuous feature, first determine and as the minimum and maximum values of for feature over all input samples and and as the quantile values of at the 33rd and 66th percentile. Then, calculate the averages , , and . Next, translate each value of for feature into 3 fuzzy membership values in the range [0,1] as , , and using the relationships
| (4.3) |
| (4.4) |
| (4.5) |
The above computations result in 3 fuzzy sets (vectors) , and of length which replace the original input feature. Figure 4.1 illustrates the values of the membership functions as a function of . When features are fuzzified using the methods described above and then input into class prediction analysis the classification analysis is called “fuzzy,” whereas without feature fuzzification, the classification analysis is called “crisp.”
This option in the feature transformation module allows one to convert continuously-scaled features to categorical using cutpoints and lower and upper boundary limits. Below is a snapshot of the popup window that appears when performing the continuous to categorical transformations:
The various options are described below.
Cutpoints are lower bounds. For example, if age is the feature, cutpoints 20,30,40 will result in category codes 1 for 19, 2 for 20-29, 3 for 30-39, and 4 for 40+.
Cutpoints are upper bounds. For example, if age is the feature, cutpoints 20,30,40 will result in category codes 1 for 20, 2 for 21-30, 3 for 31-40 and 4 for 41+.
Apply lower limit (30-39, .... This option will enforce a lower bound to the categorization.
Apply upper limit ....51-60). This option will enforce an upper bound to the categorization.
Use real-valued cutpoints (integer is default). This option will replace the standard mathematical notation such as with e.g. , and replace with “60+.”
This feature transformation option will prompt the user to convert all text features in the input dataset. As an example, import the Fisher_Iris_wo_class.xlsx file, and then select TransformationsText-to-Categorical (converts all text features) from the command pull-down menu as shown in the following image:
Since the class feature is a text feature, it’s elements will be loaded into a form showing all words found along with their frequencies.
and then click on the Set as Categorical button, and the categorical levels of the class text feature will be transformed into class categorical values of 1, 2, and 3. NXG Logic Explorer will also automatically copy the original text value for each category into memory, so that their labels can be output during analyses. As an example, the image below shows results of running ANOVA on the 4 continuous features in the Fisher Iris dataset, with the category labels shown in the column headers.
This transformation will also perform a text-to-categorical transformation for any text feature, however, the user must select the specific feature to be transformed.
Any categorical feature can be set to be a class feature (for class prediction analysis) as long as the labels have been input. This can be accomplished by either importing a file with class values and their labels, or by transforming text-to-categorical using the above transformations, or by assigning categorical labels using the Editor. The numeric class feature values will then be used for assigning symbol colors in all class discovery runs.
Explorer will transform features values over objects inside CV folds used for training and separately for objects used for testing, based on parameters obtained from objects within training folds. For example, if 10 folds are used for CV, feature transformation occurs over all objects within partitions (folds) ,, and then feature values for testing objects in fold are transformed using parameters from training folds ,,. When normalization is specified, feature values for and are based on objects in the training folds and applied to objects in the testing fold. Boundary constraints for the range [0,1] are maintained by setting negative feature values for objects in the testing fold to zero, and values greater than 1 are set to 1. For mean-zero standardization, the feature mean and s.d. are obtained from all objects in the training folds, and then are applied to objects in the testing fold.
Default options for feature transformations during CV for class prediction runs can be set in the general default settings popup window, which are shown in the image below:
[1] G.J. Klir, B. Yuan. Fuzzy Sets and Fuzzy Logic, Upper Saddle River(NJ), Prentice-Hall, 1995.
[2] D. Dubois, H. Prade. Fundamentals of Fuzzy Sets, Boston(MA), Kluwer, 2000.
[3] S.K. Pal, P. Mitra. Pattern Recognition Algorithms for Data Mining: Scalability, Knowledge Discovery and Soft Granular Computing. Boca Raton(FL), Chapman & Hall, 2004.
Applies To | Hypothesis testing to determine if means are equal in 2 groups |
|
|
Origin | Parametric statistics (based on means, variances, etc.) |
|
|
Null Hypothesis | , both means are equal |
| |
Alternative Hypothesis | , Both means are significantly different |
|
|
Variables | One normally distributed continuously-scaled variable, one grouping variable |
|
|
Variants | Welch test, Behrens-Fisher problem |
|
|
The t-test is one of the most commonly used tests for determining whether two means are the same. Naturally, since the t-test is based on the average and standard deviation of each sample, the test is of the parametric type. In addition, the t-test assumes normality among the data being tested, so the data should also be continuously-scaled. Specific t-tests have been developed for cases in which the variances among variables being tested are assumed to be equal or unequal, the latter of which is known as the Behrens-Fisher problem. The null and alternative hypotheses for the t-test are:
Before conducting a t-test, it is instructive to know whether the variances for the two samples are equal or unequal. A simple set of hypotheses can be constructed for this scenario, where the null and the alternative hypotheses are
Tests for equal variance among groups are also known as homogeneity of variance tests. Equal variances is known as homoscedasticity, where unequal variances is known as heteroscedasticity. Bartlett’s test [1] is used to test for equality of variances between 2 or more group normally-distributed samples. The test statistic is calculated as
| (5.1) |
where is the total samples size, is the number of independent samples, is the sample size of the th group, and is the sample variance for the th group. The test statistic is distributed with degrees of freedom. If the calculated statistic exceeds the critical value of , the p-value is less than and the the variances are considered to be unequal. Otherwise, if the p-value exceeds 0.05, the the variances are considered equal (not significantly different).
Assume that two samples of size and were drawn from two infinitely large populations. Calculate the sample means and , replace the sample data, resample, recalculate means, resample, etc. The mean differences obtained from multiple samplings would be normally distributed with variance . A typical question is: is the difference () between two sample means significant. The hypotheses are as follows:
| (5.2) |
Because we are limited to only samples from the infinitely large populations, we obtain samples of size and , which have averages and , with sample variances and . For the two independent samples, the test statistic
| (5.3) |
follows a distribution with degrees of freedom. When sampling during a study or experiment, a priori values of the population variances can either be known or unknown. If they are known, then they can either be equal or unequal. If they are unknown, it is a common practice to assume they are unequal, since this would be the most conservative approach. In the remaining sections, we address the formulation of the t-test when the sample variances are equal or unequal.
Understanding two-tailed and one-tailed tests. There are often times when the belief among an investigator is that the two means are not significantly different, but rather, one mean is significantly greater or significantly lower than the other. In such cases, the use of one-tailed tests are employed. Recall, the criteria for whether a test is called two-tailed or one-tailed is based on the form of the alternative hypothesis, or the belief formed by the investigator.
Let us look at some example descriptions in the table below:
Hypothesis | Null | Alternative | # ways alternative is true | Terminology |
Equality | 2 ( , ) | “Two-tailed” | ||
Inequality | 1 ( ) | “One-tailed” | ||
Inequality | 1 ( ) | “One-tailed” | ||
The first alternative hypothesis is based on the belief that the means are significantly different, which can occur two ways: first , and second . A test performed under this alternative hypothesis is called a two-tailed test. On the other hand, when the belief is only that , the test is called a one-tailed test. Analogously, if the only belief is that , then this is also called a one tailed test. Quite basically, the way to remember whether a one-tailed or two-tailed test is employed is to think of the number of ways the alternative hypothesis can be true. For a two-tailed test, there are two ways the two means can be significantly different; however, under the one-tailed test, there is only one way the alternative hypothesis can be true.
Formally, there is set of hypotheses (null and alternative) for each of the alternative hypotheses shown in the table above. Thus, for two means there can only be three sets of hypotheses, and these are:
For the two-tailed test:
For the first one-tailed test:
For the second one-tailed test:
Variances assumed equal (). When the population variances are assumed to be equal, it makes sense to combine together the variance from each sample. This is known as the common pooled variance. To calculate the common pooled variance, we must weight each sample variance by the sample size in the form
| (5.4) |
Going back the general equation for the t-test in Eq. (5.3), a substitution is made replacing the variance for each sample with the common pooled variance, so that the test statistic now becomes
| (5.5) |
Significance of the test statistic is compared with tail probabilities for Student’s distribution at for degrees of freedom.
Worked problem. A researcher was interested in determining if the mean concentration of a protein in tumor tissue was significantly different from the mean expression in normal tissue. The mean expression in tumors was , with standard deviation , and sample size , while the mean expression in normal tissue was , with standard deviation , and sample size . Are the two means significantly different when assuming equal variances?
For the assumption of equal variances, the pooled variance is:
| (5.6) |
Substituting in the data, we have:
| (5.7) |
Given , we now find that .
| (5.8) |
Significance of the test statistic is compared with cumulative probabilities for Student’s distribution at for degrees of freedom. The tabled critical value, , is equal to 2.05 , and since we conclude that the two means are not significantly different. In fact, the p-value is 0.4639, which is not less than .
Variances assumed unequal (). When the sample variances for the populations are unequal, we keep the sample variances intact in the general equation for the t-statistic given as
| (5.9) |
however, the value of no longer represents the distribution. This problem was first recognized by Behrens [2], who introduced one of the first solutions. Behrens’ approach was later verified by Fisher [3]. We implement the approach introduced by Dixon and Massey [4], which involves a correction to the degrees of freedom, , in the form
| (5.10) |
The computed value of is rounded to the nearest integer. The significance of the test statistic is compared with tail probabilities for Student’s t distribution at values of and degrees of freedom.
Worked problem. A researcher was interested in determining if the mean concentration levels of a drug in serum plasma of placebo patients (untreated) was significantly different from the mean serum plasma concentrations in treated patients. The mean concentration in placebo patients was , with standard deviation , and sample size , while the mean concentration in treated patients was , with standard deviation , and sample size . Are the two means significantly different when assuming unequal variances?
| (5.11) |
Significance of the test statistic is compared with cumulative probabilities for Student’s distribution at for degrees of freedom. The tabled critical value, , is equal to 1.97 , and since we conclude that the two means are significantly different. In fact, the p-value is 0.0008, which is less than .
Applies To | Hypothesis testing to determine if two samples are from the same distribution |
|
|
Origin | Non-parametric statistics (based on ranks) |
|
|
Null Hypothesis | , both samples are from the same distribution |
|
|
Alternative Hypothesis | , both samples are not from the same distribution |
|
|
Variables | One test variable that is at least ordinally distributed, and one categorical variable |
|
|
Variants | Friedman test |
|
|
The Mann-Whitney test is a non-parametric 2-sample test to determine whether two independent samples are from the same distribution [5]. Unlike the t-test and F-test which assume an underlying normal distribution of the data, the Mann-Whitney requires that the data be at least ordinal and not necessarily normally distributed. The asymptotic relative efficiency (A.R.E.) of the Mann-Whitney test approaches 95.5% as increases when applied to data for which parametric tests are appropriate [6]. However, in cases where parametric tests are inappropriate, the Mann-Whitney test will have greater power to reject than the t-test. The Mann-Whitney test should be applied when there is little known about the distributional properties for both samples, or the assumptions of normality do not hold.
The goal of the Mann-Whitney test is to compare two sets of data to determine if they are from the same distribution. By “same distribution” we mean that values of one distribution are less than or greater than values of the other one. If values of one distribution are less than or greater than values the other distribution, then the two distributions are different. Stated formally, the null and alternative hypotheses of the Mann-Whitney test are
: Both samples from the same distribution (null hypothesis) |
: Both samples from different distributions (alternative hypothesis) |
Given the hypotheses, the goal is to discredit or reject the null hypothesis, and disprove that values of the two distributions are not the same. We begin calculations for the Mann-Whitney test by first determining the statistic for each group in the form:
| (5.12) |
where and are the sample sizes for each of the two groups, and and are the sum of the ranks within each group. The statistic is taken as the lowest value among and :
| (5.13) |
The test statistic for the Mann-Whitney test is standard normal distributed as
| (5.14) |
Worked problem. Let’s consider an example that tests whether expression of a single protein in two independent groups of tissues are from the same distribution. A laboratory assay was carried out in which fluorescent intensity levels were generated for expression of protein A in two samples of mice: wild-type (group 1) and knock-out (group 2). The data are shown in the table below:
ID | Protein A | Group | ||||
1 | -0.017 | 5 | 1 | 5 | ||
2 | -0.349 | 1 | 1 | 1 | ||
3 | 0.136 | 8 | 1 | 8 | ||
4 | 0.111 | 7 | 1 | 7 | ||
5 | -0.149 | 4 | 2 | 4 | ||
6 | -0.181 | 3 | 2 | 3 | ||
7 | -0.307 | 2 | 2 | 2 | ||
8 | 0.071 | 6 | 2 | 6 | ||
Ranks are determined by first sorting the data for both samples in ascending order, and then assigning ranks. Tied expression values are handled by averaging their original rank assignments. Substituting the results into the formulation gives us
| (5.15) |
| (5.16) |
| (5.17) |
| (5.18) |
Since is less than 1.96, the test is not significant, and the conclusion is that the two distributions are not significantly different.
Wilcoxon Rank Sum Test. We now show the similarity between the Wilcoxon rank sum test and the Mann-Whitney test. Assume the following data:
ID | Protein A | Group | ||||
1 | -0.062 | 6 | 1 | 6 | ||
2 | 0.052 | 7 | 1 | 7 | ||
3 | -0.096 | 4 | 1 | 4 | ||
4 | -0.175 | 2 | 1 | 2 | ||
5 | -0.166 | 3 | 2 | 3 | ||
6 | 0.089 | 8 | 2 | 8 | ||
7 | -0.092 | 5 | 2 | 5 | ||
8 | -0.293 | 1 | 2 | 1 | ||
While both the Mann-Whitney and Wilcoxon rank tests use the standard normal -variate as the test statistic, the Wilcoxon rank sum test uses a slightly different form of the numerator of (the denominator is the same). In spite of the similarity between these two non-parametric tests, NXG Explorer employs the Mann-Whitney test for all 2-sample non-parametric tests. Computation of the Wilcoxon rank sum test goes as follows:
| (5.19) |
Example 1 - Two-tailed t-tests (). First, select Analysis Independence Unpaired 2 and k-sample tests shown as
Next, click on sex as the grouping feature, and then select all of the continuous features on the left side of the popup window, shown as:
Lastly, leave the default settings that are visible on the popup window for grouped tests:
Using the Retinol dataset, two-tailed t-tests were performed using continuous variables with sex considered as the grouping variable. Recall, the sex categorical feature has two levels: 1-males, 2-females. The two-tailed tests are based on the belief (alternative hypothesis) . Results are listed in Figure 5.1. For each continuous variable, the appropriate t-test is used, along with the Mann-Whitney test, and Bartlett’s variance test. Tail probabilities for t-test and Mann-Whitney test will be colored either red or blue, depending on results of Bartlett’s test. Whenever Bartlett’s test for equality of variances is significant for a given continuous variable, NXG Explorer will automatically apply the appropriate t-test, that is assuming equal or unequal variances. If Bartlett’s test is significant, the Welch t-test assuming unequal variances is used. The non-parametric Mann-Whitney test is not an alternative to the Welch t-test when variances are unequal between groups, but is nevertheless performed for each 2-group test.
Example 2 - One tailed t-tests, (). In this example, one-tailed t-tests were performed using the same continuous variables with sex as the grouping variable. The specific alternative hypothesis for the tests is . Results are listed in Figure 5.2.
Example 3 - One tailed t-tests, (). In this example, one-tailed t-tests were performed using the same continuous variables with sex as the grouping variable. The specific alternative hypothesis for the tests is . Results are listed in Figure 5.3.
NXG Logic Explorer can use geometric or harmonic means during 2- and -sample hypothesis tests for inequality of means. When the geometric mean is specified for a continuous feature, the mean is defined as
| (5.20) |
and the geometric standard deviation is
| (5.21) |
If the harmonic mean is specified, then the mean of the reciprocals is
| (5.22) |
and the harmonic mean is
| (5.23) |
The variance of is
| (5.24) |
and the standard deviation is then defined as
| (5.25) |
Applies To | Hypothesis testing to determine if one or more pairs of means are significantly different in unrelated groups |
|
|
Origin | Parametric statistics (based on means, variances, etc.) |
|
|
Null Hypothesis | , all means are equal |
| |
Alternative Hypothesis | , any pair of means are significantly different |
|
|
Variables | One normally distributed continuously-scaled variable, one grouping variable |
|
|
Variants | Welch ANOVA |
|
|
Analysis of variance (ANOVA) is essentially like the 2-sample t-test for inequality of means, however, ANOVA is employed when there are or more groups. Recall, ANOVA is for comparing means of a single continuously-scaled variable across 3 or more groups. The null hypothesis is that all of the means in the groups are equal, and the alternative hypothesis, i.e., what the belief is, that one of more pairs of means are significantly different.
Using notation, the null and alternative hypotheses for ANOVA are
Define as a continuously-scaled observation in group of sample size =1,2,…,;= 1,2,…,, where is the number of groups. The ANOVA model states that each random observation can be represented by the model
where is the grand mean, is the effect of treatment , and is the error. Estimates of the model parameters are
A typical data setup for ANOVA is given as follows:
Group 1 | Group 2 | Group | Group | ||
The grand mean is determined as
and the treatment mean for group is
Redefining the error terms, we have
Subtracting from the right side yields
and squaring all terms and summing over and gives us the sum of squares
A more numerically efficient method for calculating the sum of squares is in the form
where the grand and treatment-specific sums are
and
In summary, the sum of squares total is equal to the treatment sum of squares (error between treatments) and the error sum of squares (within treatments), given as
Dividing each of these sums of squares by their individual degrees of freedom, one can obtain the mean squares due to treatment and the mean squares due to error:
| (5.26) |
Worked problem. In the example below, the expression of a single serum plasma protein, called “PSA” (prostate-specific antigen), was measured in laboratory mice treated with Drug A or Drug B. A third group was used for placebo or untreated “Control” mice. Perform an F-ratio test using ANOVA to determine if mean expression is significantly different between the 3 groups.
Table 1. Mouse PSA expression after treatment with Drug A and Drug B.
Control | Drug A | Drug B | ||
8.24 | 16.94 | 2.92 | ||
11.79 | 21.46 | 2.18 | ||
9.44 | 21.48 | 1.80 | ||
12.82 | 18.24 | 2.15 | ||
7.82 | 15.79 | 3.02 | ||
11.36 | 18.34 | 2.37 | ||
9.32 | 2.33 | |||
7.84 | ||||
8 | 6 | 7 | 21 | |
78.63 | 112.25 | 16.76 | 207.64 | |
798.84 | 2127.38 | 41.29 | 2967.52 | |
Substituting in our results, for the sum of squares treatment (SST) we have:
| (5.27) |
The degrees of freedom for are determined next, based the number of groups
| (5.28) |
The or “residual error,” which picks off the error between each observation and its treatment mean reveals the within treatment error. When is large, the residual error can outweigh the treatment effects, resulting in a non-significant test result.
| (5.29) |
| (5.30) |
The F-statistic can now be calculated as the ratio of the two mean square errors
| (5.31) |
The tabled is 3.55, suggesting that the means are not all equal, with a p-value less than 0.05.
If Bartlett’s test for equality of variances is significant, meaning that the variances are significantly different across groups, i.e., heteroscedasticity, then the Welch ANOVA test should be used [7]. Calculation of the F-ratio statistics and adjusted degrees of freedom are provided below. The weight for each th group is defined as
| (5.32) |
and the sum of the weights is
| (5.33) |
A weighted mean is then calculated as
| (5.34) |
The F-ratio statistic is given as
| (5.35) |
where the numerator and denominator degrees of freedom are
| (5.36) |
Applies To | Hypothesis testing to determine if three or more samples (i.e., ) are from the same distribution |
|
|
Origin | Non-parametric statistics (based on ranks) |
|
|
Null Hypothesis | , variates from all samples are from the same distribution |
|
|
Alternative Hypothesis | , samples are not from the same distribution |
|
|
Variables | One test variable that is at least ordinal, and one categorical variable |
|
|
Variants | Friedman test |
|
|
The Kruskal-Wallis test is used to determine if data for a single variable in k groups are the from the same distribution.
Worked problem. A research team performed a study to explore concentrations of a serum marker in laboratory mice 24 hours after treatment with Drug A, Drug B, or no treatment (Control). Perform a Kruskal-Wallis test using the data listed in the table below, and provide the calculated test statistic, and comment on the p-value and your conclusion.
Table: Mouse serum plasma concentration of protein marker 24 hours after treatment.
Control | Rank | Drug A | Rank | Drug B | Rank |
8.24 | 10 | 16.94 | 17 | 2.92 | 6 |
11.79 | 14 | 21.46 | 20 | 2.18 | 3 |
9.44 | 12 | 21.48 | 21 | 1.80 | 1 |
12.82 | 15 | 18.24 | 18 | 2.15 | 2 |
7.82 | 8 | 15.79 | 16 | 3.02 | 7 |
11.36 | 13 | 18.34 | 19 | 2.37 | 5 |
9.32 | 11 | 2.33 | 4 | ||
7.84 | 9 | ||||
8 | 6 | 7 | |||
92 | 111 | 28 | |||
The test statistic used for the Kruskal-Wallis test is the chi-squared distribution with degrees of freedom.
| (5.37) |
Since there are groups, the chi-squared statistic is approximately distributed with 2 d.f. The calculated test statistic of 17.73 is significant, since it exceeds the critical value of chi-squared for and 2 d.f., . Therefore, the p-value would be less than 0.05, i.e., .
Example 4 - ANOVA and Kruskal-Wallis Tests. In this example, we provide NXG Explorer results for ANOVA and KW analysis of the continuous variables in the Retinol, with vitamin use status vituse as the grouping variable. Recall, there are 3 levels of the vitamin use category: 1=Yes, 2=Sometimes, 3=No. To begin, import the Retinol.xlsx file. Next, select AnalysisIndependenceUnpaired 2 and k-sample tests shown as
Next, specify vituse as the grouping feature, and select all of the remaining features (left panel) as the test features:
Themn click on the Apply in the next popup window that appears:
When the run is complete, click on the Grouped analysis icon showing in the treenodes:
The panel with the tabular results of the ANOVA test will appear:
The above results were generated with the following default setting in PreferencesGeneral default settings
which will exclude non-parametric test results for the Kruskal-Wallis test as well as the Bartlett’s equality of variance test results. Un-check the box shown, and then rerun the ANOVA for the same features, and the following results will be generated:
which shows the Kruskal-Wallis test results, as well as Bartlett’s equality of variance test results.
You can also specify default settings for outputting only parametric or non-parametric test results. Another important point is that NXG Explorer will default to the Welch ANOVA when Bartlett’s equality of variance test is significant, and in this case, you will notice that d.f. for the tests will change. This cannot be changed, since it is inappropriate to not use the Welch ANOVA in the presence of heteroscedasticity.
Confounding occurs when the variables not being tested are independently related to the outcome (e.g., age, gender) and the exposure. Matching is used to make subjects in different exposure groups more similar in terms of extraneous variables that are independently associated with the outcome. The goal of matching is therefore to achieve differences between outcome between the exposure groups that are more likely to be due to exposure rather than confounding variables. The overall effect of matching may reduce the standard deviation of differences rather than the differences themselves. Since standard deviation is in the denominator of the -test, the reduced standard deviation increases the test statistic. This reduction in variance cause an increase in statistical efficiency, which is the power to detect a difference for a fixed sample size.
For paired samples, the -test is
| (5.38) |
where is the mean of the pairwise differences for each record, and is the standard deviation of differences given as
| (5.39) |
Worked problem. Perform a matched paired t-test to determine if concentration of a protein A is significantly different in tissue and plasma. The table below shows the data:
ID | Protein A(plasma) | Protein A (tissue) | ||
1 | 0.289 | 1.027 | -0.738 | 0.54 |
2 | -0.243 | 0.592 | -0.835 | 0.70 |
3 | 0.082 | 0.795 | -0.713 | 0.51 |
4 | -0.168 | 1.393 | -1.561 | 2.44 |
5 | 0.007 | 0.814 | -0.808 | 0.65 |
Substituting in the data, we have for the s.d.
| (5.40) |
and for the t-statistic, we get:
| (5.41) |
The tabled critical value, , is equal to 2.78 , and since we conclude that the two means are significantly different. In fact, the p-value is 0.0042, which is less than .
Wilcoxon Signed Rank Test. The Wilcoxon signed rank test is the non-parametric variant of the paired t-test. Let’s perform a Wilcoxon signed rank test for matched pairs data and then show worked equations.
ID | Protein A | Protein B | |||||
1 | 0.122 | 0.004 | 0.117 | 0.117 | 3 | 3 | |
2 | 0.456 | 0.418 | 0.038 | 0.038 | 2 | 2 | |
3 | -0.139 | 0.522 | -0.661 | 0.661 | 8 | 8 | |
4 | 0.155 | -0.048 | 0.203 | 0.203 | 5 | 5 | |
5 | 0.158 | 0.481 | -0.323 | 0.323 | 6 | 6 | |
6 | -0.112 | 0.066 | -0.179 | 0.179 | 4 | 4 | |
7 | 0.374 | 0.380 | -0.006 | 0.006 | 1 | 1 | |
8 | -0.032 | 0.373 | -0.404 | 0.404 | 7 | 7 | |
In the table above, is the difference between the two values in a row, is the absolute value of the difference, is the rank of the absolute difference, is the rank of when the original (raw) difference is negative, and is the rank of when the original (raw) difference is positive.
Next, we substitute in the value of along with the sample size to get
| (5.42) |
Since , we cannot reject the null hypothesis that the two concentrations are equal, and conclude that they are not significantly different.
Example 5 - Paired t-test and Wilcoxon signed rank tests. In this example, both the paired t-test and Wilcoxon signed rank tests results are run for the fat and fiber features of the Retinol dataset. To begin, specify the “Paired mean inequality tests” option of the Independence pull-down menu, shown as follows:
Next, select two continuous variables for testing:
The output for both the paired t-test and the Wilcoxon signed rank tests will be visible after clicking on the output icon, which is shown below
Results of the tests indicate that the mean values for fat and fiber measurements are significantly different.
Tests for equality of 2 independent proportions are commonly used in statistics. The notation used for tests of proportions uses and , where is the proportion of successes(failures) out of the total sample size for sample 1, and is the proportion of successes(failures) among the total population in sample 2. Let be a binary outcome feature taking on values (0,1) which may represent (1=success,0=failure), (1=tumor,0=normal), etc., for the th object in the th group . The proportion in each group can be determined by either summing the x-values within each group to obtain the total number of occurrences for which , or by using a logical statement reflecting the total number of ones in each group.
| (5.43) |
where represents the total number of ones among all values of . Similarly, is defined as
| (5.44) |
NXG Explorer employs a pooled and unpooled version of the test, whose statistic is distributed standard normal, N(0,1). The pooled version is functionally composed as
| (5.45) |
where and . The unpooled variant of the test is
| (5.46) |
Example 6 - Test for 2 independent proportions. For this example, we will use NXG Explorer to create the binary outcome feature. To begin, open the Retinol dataset and select the test of proportions option from the Independence command of the Analysis pull-down menu, shown as
Using the Retinol dataset, select sex as the grouping variable, and smokstat as the outcome variable. Next, we will define a success as subjects that have values of 3 for smoking status (smokstat=3) and failure for subjects with other values of smokstat equal to 1 or 2. (Recall, the categorical levels of smoking status in the Retinol dataset are 1=Never, 2=Former, 3=Current Smoker). In addition, select an alternative hypothesis of in order to specify a one-tailed test. This can be done using options in the popup window for test of proportions:
The results of the one-tailed test of proportions are shown below, which indicate that the proportion of subjects with values of smokstat=3 are not significantly different because the p-values of 0.726 and 0.713 for the pooled and unpooled tests are not less than 0.05.
Data are often collected to identify the number of subjects which fall into categories. Examples are the number of new patients with breast cancer, lung cancer, or no cancer; low-intermediate-high risk, or diagnostic category A, B, C, etc. The test is suitable for analyzing count data arranged into or more categories [9]. The method is a goodness-of-fit test to determine if a significant difference exists between observed and expected counts based on the null hypothesis. In order to test a hypothesis, we need to be able to determine expected frequencies. The null hypothesis () states that the proportion of counts falling into the categories for an equal number of counts form the presumed population. Thus, we deduce the expected frequencies form the null hypothesis. The test determines how close the observed frequencies are to the expected frequencies under , and is tested with the chi-squared function
| (5.47) |
where is the number of observed counts and is the number of expected counts in the th cell. When differences between the observed and expected are small, the chi-squared test statistic will be small. Whereas for large differences in , the chi-squared test statistic will be large. Squaring the non-zero differences results in positive values, and therefore the chi-squared test statistic is always greater than or equal to zero. For a one-way chi-squared test, the number of degrees of freedom (d.f.) is equal to . For a multiway chi-squared test, the d.f. are equal to the number of rows minus one times the number of columns minus one, that is, .
A two-way table provides a cross-tabulation of counts as a function of population and class. This can be shown in the following configuration:
Class 1 | Class 2 | Total | |
Population 1 | |||
Population 1 | |||
Total | |||
If we consider the counts which are used for the chi-squared test, we have
Characteristic B | |||
Characteristic A | Present | Absent | Total |
Characteristic Present | |||
Absent | |||
Total | |||
which can be transformed into proportions according the tabular notation:
Characteristic B | |||
Characteristic A | Present | Absent | Total |
Characteristic Present | |||
Absent | |||
Total | |||
Thus, the chi-squared test is really a test of equal proportions among the various categories involved
| (5.48) |
which translates to our original formulation
| (5.49) |
The the expected number of counts in each cell, is determined by the relationship
| (5.50) |
and is the total counts in row and are the total counts in column . Let’s consider an example chi-squared test involving the frequency of breast cancer among mothers whose first age at giving birth was less than or greater than age 30:
First age at giving birth | |||
Cancer status | Total | ||
Case | |||
Control | |||
Total | |||
First, we need to calculate the number of expected cases in each cell. Recall, the expected frequency in each cell is
| (5.51) |
Upon substitution of the data, we have
and finally, substituting the observed and expected frequencies into the equation for chi-squared, we have
| (5.52) |
| (5.53) |
The calculated value of 79.99 far exceeds the lookup critical value of 3.84 for the with 1 d.f. Hence, we conclude that there is a significant association between first age at giving birth and breast cancer based on these data.
Fisher’s exact test is a statistical test used to determine if there are nonrandom associations between two categorical variables [10]. Let there exist two such variables and , with and observed states, respectively. Now form an matrix in which the entries represent the number of observations in which and .
Marker | |||
Disease | Positive | Negative | Total |
Present | |||
Absent | |||
Total | |||
where the row total is given by
| (5.54) |
and column total by
| (5.55) |
with a grand table total
| (5.56) |
Next, calculate the conditional probability of getting the actual matrix given the particular row and column sums, given by
| (5.57) |
which is a multivariate generalization of the hypergeometric probability function. Next, find all possible matrices of nonnegative integers consistent with the row and column sums and . For each one, calculate the associated conditional probability , where the sum of these probabilities must be 1.
To compute the p-value of the test, the tables must then be ordered by some criterion that measures dependence, and those tables that represent equal or greater deviation from independence than the observed table are the ones whose probabilities are added together. There are a variety of criteria that can be used to measure dependence. In the case, which is the one Fisher looked at when he developed the exact test, either the Pearson chi-squared or the difference in proportions (which are equivalent) is typically used. Other measures of association, such as the likelihood-ratio-test, G-squared, or any of the other measures typically used for association in contingency tables, can also be used.
The test is most commonly applied to matrices, and is computationally unwieldy for large or . For tables larger than , the difference in proportions can no longer be used, but the other measures mentioned above remain applicable (and in practice, the Pearson statistic is most often used to order the tables). In the case of the matrix, the p-value of the test can be simply computed by the sum of all .
For an example application of the test, let represent the positive or negative result of a protein marker, and be the number of patients with and without a disease. If there are 5 diseased patients and 1 non-diseased which test positive, and 0 diseased and 4 non-diseased that test negative, the the configuration of these data would be
Marker | |||
Disease | Total | ||
Present | 5 | 0 | 5 |
Absent | 1 | 4 | 5 |
Total | 6 | 4 | 10 |
Computing gives
,
and the other possible matrices and their s are
Marker | |||
Disease | Total | ||
Present | 4 | 1 | 5 |
Absent | 2 | 3 | 5 |
Total | 6 | 4 | 10 |
;
Marker | |||
Disease | Total | ||
Present | 3 | 2 | 5 |
Absent | 3 | 2 | 5 |
Total | 6 | 4 | 10 |
;
Marker | |||
Disease | Total | ||
Present | 2 | 3 | 5 |
Absent | 4 | 1 | 5 |
Total | 6 | 4 | 10 |
;
Marker | |||
Disease | Total | ||
Present | 1 | 4 | 5 |
Absent | 5 | 0 | 5 |
Total | 6 | 4 | 10 |
,
which indeed sum to 1, as required. The sum of values less than or equal to p=0.0238 is then 0.0476 which, because it is less than 0.05, is significant. Therefore, in this case, there would be a statistically significant association between the marker and disease.
When the table being analyzed is too large to consider all possible configurations, permutations are used by re-shuffling categories for one of the two categorical variables, each time recalculating in order to simulate the null distribution. After iterations, the empirical p-value is
| (5.58) |
Tables with more than two rows and columns are called multiway contingency tables. The table below shows the setup of a multiway contingency table:
Class | |||||
1 | 2 | Totals | |||
Population 1 | |||||
Population 2 | |||||
Population | |||||
Totals | |||||
The same notation using an for the number of rows and a for the number of columns is the same as for two-way and one-way tables [11]. Using an table, we can test whether cell probabilities differ from sample to sample. We can also use an table for a single sample whose counts have been partitioned into different categories according to one criterion and into different categories according to another criterion. The application of chi-squared testing for both methods is the same.
Example 7 - Contingency table analysis. Gleason scoring is one of the most reliable pathological scoring systems for staging prostate cancer severity and growth. The higher the Gleason score for a tumor tissue sample, the more severe the prostate cancer. The table below lists the number of prostate cancer patients as a function of Gleason score category and diagnostic grade.
Class | |||||
1 | 2 | Totals | |||
7.3 | 30.3 | 38.0 | 5.4 | ||
Gleason 1-3 | 23 | 40 | 16 | 2 | 81
|
18.6 | 77.5 | 97.1 | 13.8 | ||
Gleason 4-6 | 11 | 75 | 107 | 14 | 207
|
9.1 | 38.2 | 47.9 | 6.8 | ||
Gleason 7-9 | 1 | 31 | 60 | 10 | 102
|
Totals | 35 | 146 | 183 | 26 | 390 |
Null Hypothesis: The proportion of cases in Gleason categories is the same for all stages. The alternative hypothesis is that the proportions of cases in Gleason categories are different across stage.
| (5.59) |
The expected number of counts in each cell is
| (5.60) |
where is the total counts in row and are the total counts in column .
Significance: Let , , the total number of samples.
Sampling Distribution: The distribution for an table has d.f. Specifically, for this example we get d.f.
Rejection Region: will be rejected if the probability of occurrence of with one 6 d.f. under then null hypothesis is less than or equal to .
Decision Rule: Reject values of that exceed for 6 d.f.
| (5.61) |
The calculated test statistic of 69.2 far exceeds the critical value of 22.46 for for 6 d.f., so the decision rule is to reject , and the conclusion is that there exists a significant association between Gleason score and tumor grade, with p0.001.
Consider the table
1 | 2 | Totals | |||||
1 | |||||||
2 | |||||||
Totals | |||||||
For each row, the multinomial probability of the row total is
| (5.62) |
where is the conditional probability in column () given row (Freeman and Halton, 1951; Agresti, 2002).
The probability of the table is the product of all of the multinomial row probabilities
| (5.63) |
Under independence, we assume that in each column the conditional cell probabilities are equal (including the row total), shown as . Thus, let’s substitute for and for in the above equation and obtain
| (5.64) |
The distribution of the cell counts in each row, however, also depend also on . Fisher assumed these to be sufficient nuisance parameters, and conditioned on them so that the resulting conditional probability does not depend on them. The contribution of to the product multinomial distribution depends on the data through , which are the sufficient statistics. Each column total has the multinomial distribution
| (5.65) |
The probability of conditional on equals the probability of divided by the probability of , so we invert the above ratio and multiply as
| (5.66) |
above is first calculated from the observed data. Permutations are then used by re-shuffling categories for one of the two categorical variables being considered, each time recalculating repeatedly in order to simulate the null distribution. After iterations, the exact p-value is
| (5.67) |
Randomization Test. You will notice that even for small tables, e.g. 3x4, 4x4, 4x5, 5x5, the number of iterations required for fully implementing the exact form of the test can exceed than configurations. Therefore, NXG Explorer uses the following randomization test:
The p-value for this randomization test is
| (5.68) |
that is, the number of times the chi-squared value for the permuted configurations exceeds or is equal to the chi-squared value for the raw data configuration (table), divided by the number of iterations.
McNemar’s test is a chi-squared test of the significance of change among objects for which the presence (+) or absence (-) of a trait is determined before and after an exposure (treatment) is given. As an example, consider an intervention study to determine whether hearing loss could be significantly improved after cochlear implant therapy among patients with severe tinnitus (ringing in the ears). The number of patients testing positive for severe tinnitus before and after cochlear implant therapy is listed in the table below. McNemar’s test will allow you to determine whether the therapy significantly changed the frequency of tinnitus, and whether the outcome after therapy was better or worse.
The data table for patients undergoing cochlear implant therapy and the presences of tinnitus is given below:
Before | |||
After | Total | ||
330 | 141 | = 471 | |
351 | 383 | = 734 | |
= 681 | = 524 | = 1205 | |
| (5.69) |
The degrees of freedom (d.f.) for this test is equal to 1. The calculated test statistic of 89.63 is significant, since it exceeds (or is equal to) the critical value of chi-squared for and 1 d.f., . Therefore the p-value would be less than 0.05, i.e., .
Example 8 - McNemar Test. Open the mcnemar_data.xlsx file that is in the Data folder. Next, select IndependenceMcNemar in the Analysis pull-down menu:
It is important to realize that both features used for the McNemar test must be binary (0,1) codes. Binary features will have a icon associated with them. Next, select both binary features:
The results are shown below. While the change is not significant, there were 22% more subjects that changed from positive to negative when compared with subjects that changed form negative to positive, so the pattern indicates that, on average, the cochlear implant therapy resulted in an improvement. Obviously, in clinical medicine, for an efficacious treatment you would expect many more subjects that changed from positive to negative, and few subjects whose conditions worsened.
[1] M.S. Bartlett. Properties of sufficiency and statistical tests. Proc. Royal Stat. Soc., Series A. 160: 268282; 1937.
[2] W. Behrens. Ein beitrag zu fehlerberechnung bei wenige beobachtungen. Landwirtsscheftliche Jahrbucher. 68:807-837, 1929.
[3] R. Fisher. The comparison of samples with possibly unequal variances. Annals of Eugenics, 9:174–180, 1932.
[4] W. Dixon, F. Massey. Introduction to Statistical Analysis. New York (NY), McGraw-Hill, 1969.
[5] H. Mann, D. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat., 18:50–60, 1947.
[6] A.M. Mood. On the asymptotic efficiency of certain non-parametric two-sample tests. Ann. Math. Stat., 25: 514–522, 1954.
[7] B.L. Welch. On the comparison of several means values. Biometrika, 38(3/4): 330–336, 1951.
[8] W. Kruskal, W. Wallis. Use of ranks on one-criterion variance analysis. Amer. Stat. Soc., 47:583–621, 1952.
[9] S. Siegel. Nonparametric Statistics. New York(NY), McGraw-Hill, 1956.
[10] G.H. Freeman, J.H. Halton. Note on an exact treatment of contingency, goodness of fit and other problems of significance. Biometrika, 38: 141–149, 1951.
[11] A. Agresti. Categorical Data Analysis. New York(NY), John Wiley and Sons, 2002.
Association is defined as the co-occurrence of a linear relationship between two continuously-scaled features. For example, the classical example of two associated features in clinical medicine is age and systolic blood pressure (mmHg). As age increases, there is a tendency for systolic blood pressure to increase as well. Recall that the variance of a normally-distributed continuously-scaled random variable is defined in the form
| (6.1) |
If we consider another continuous feature , then the covariance between and is
| (6.2) |
The above definitions apply to the bivariate case, where association between a pair of features is considered. The covariance for multiple features is covered at the end of this chapter.
Pearson Product Moment Correlation. The Pearson product moment correlation coefficient is similar to covariance, but does not include the effect of each feature’s scale. Pearson correlation between features and is the ratio of their covariance to the product of their standard deviations given as
| (6.3) |
The correlation coefficient is scale independent and dimensionless, with range . When approaches 1, then there is a strong linear association between and . However, as approaches 0, the linear association between and gets weaker. A negative value of indicates that as one feature goes up in scale, the other feature goes down. Recall that the population-based notation for covariance and standard deviation are and , , respectively, and thus correlation can be written as
| (6.4) |
Correlation: Z-score implementation. A much simpler formula for correlation involves the cross-product between the Z-scores for and in the form
| (6.5) |
| (6.6) |
in the form
| (6.7) |
We can test the significance of a correlation coefficient using the null hypothesis vs. the alternative hypothesis by use of Fisher’s -transformation. For the sample estimate we use
| (6.8) |
whereas for the hypothesized coefficient we use
| (6.9) |
The two statistics are then used to calculate a standard normal deviate
| (6.10) |
where standard error of is
| (6.11) |
Worked problem. A research study measured log(fluorescent intensity) values of expression of Corin (protein) in (a) prostate tumors and (b) serum plasma of =8 prostate cancer patients. [Note: log(fluorescent intensity) implies that expression values can be negative, e.g., log(0.6)=-0.51)]. Determine the covariance and Pearson correlation between the two “tissue compartments”, and test whether correlation is significantly different from .
Table: Covariance and Pearson correlation between expression of Corin in prostate tumor tissue and serum plasma.
Plasma | Tumor | Cov. | Corr. | ||||
log(intensity) | log(intensity) | (A) | (B) | (A)(B) | (C) | (D) | (C)(D) |
-1.69 | -1.65 | -1.07 | -1.12 | 1.20 | -0.77 | -0.81 | 0.62 |
0.10 | 0.98 | 0.71 | 1.51 | 1.08 | 0.51 | 1.09 | 0.56 |
-0.04 | -0.31 | 0.57 | 0.23 | 0.13 | 0.41 | 0.17 | 0.07 |
-0.89 | -0.68 | -0.27 | -0.14 | 0.04 | -0.20 | -0.10 | 0.02 |
-2.25 | -2.22 | -1.64 | -1.68 | 2.76 | -1.17 | -1.21 | 1.42 |
1.94 | 1.29 | 2.55 | 1.82 | 4.65 | 1.82 | 1.32 | 2.39 |
0.00 | 0.47 | 0.61 | 1.01 | 0.62 | 0.44 | 0.73 | 0.32 |
-2.08 | -2.17 | -1.46 | -1.63 | 2.39 | -1.04 | -1.18 | 1.23 |
-0.61 | -0.54 | 12.87 | 6.62 | ||||
1.40 | 1.39 | 1.8384 | 0.9455 | ||||
The calculated value of the correlation coefficient is =0.9455. Let’s now test whether this value of correlation is significantly different from zero, using the null hypothesis 0.00. Substituting in the values, we have
| (6.12) |
For =0.00 we use
| (6.13) |
Substituting in the remaining items, we get
| (6.14) |
| (6.15) |
So we reject the null hypothesis of zero correlation, since .
The Spearman rank correlation coefficient, is an alternative method for determining the correlation between two features and . Spearman correlation is a non-parametric method that does not depend on means and standard deviations of the pair of vectors and . Let be the ranks of and be the ranks of . Spearman correlation is determined as
| (6.16) |
where:
| (6.17) |
Significance testing of is based on the standard normal distribution. Using the expected value of as
| (6.18) |
| (6.19) |
| (6.20) |
Worked problem. A research group studied log(expression) of two serum plasma proteins (A & B) in =8 laboratory mice after several weeks of feeding via a high-fat diet. [Note: log(expression) implies that expression values can be negative, e.g., log(0.6)=-0.51)]. Determine the Spearman rank correlation between the two proteins, and test whether correlation is significantly different from
Table: Spearman correlation between expression values of two serum plasma proteins.
Protein A | Protein B | ||||
log(expression) | log(expression) | (A) | (B) | (A)-(B) | ((A)-(B)) |
-0.18 | 0.26 | 5 | 7 | -2 | 4 |
-0.57 | -1.03 | 4 | 2 | 2 | 4 |
0.59 | 0.18 | 7 | 6 | 1 | 1 |
1.78 | 1.50 | 8 | 8 | 0 | 0 |
-1.97 | -1.46 | 1 | 1 | 0 | 0 |
0.53 | -0.26 | 6 | 5 | 1 | 1 |
-0.83 | -0.75 | 2 | 4 | -2 | 4 |
-0.69 | -0.85 | 3 | 3 | 0 | 0 |
14 | |||||
Substituting in the results, we get:
| (6.21) |
| (6.22) |
Significance testing of is based on the standard normal distribution. Using the expected value of as
| (6.23) |
and the variance of as
| (6.24) |
and upon substitution, we have
| (6.25) |
So we reject, since .
The advantages of Spearman correlation are as follows:
The disadvantages are:
NXG Explorer also provides force plots derived from the minimum spanning trees of the correlation matrices. Firstly, the distance between features(objects) and is determined as , where is the Pearson or Spearman rank correlation. Prim’s [1] algorithm is then applied to the distance matrix to construct the minimum spanning tree.
Example 1 - Pearson and Spearman rank correlation. This run will determine the Pearson and Spearman correlation between all pairs of continuous features in the Retinol data. To begin, first open the Retinol dataset. Next, select Analysis Association Correlation, Covariance, or Euclidean distance from the Analysis pull-down menu.
Next, select all the continuous features in the Retinol dataset, then use the default settings in the popup window shown below:
The resulting correlation matrix will be shown after clicking on the treenode icon for “Pearson correlation (html)”:
Using html format, the Pearson and Spearman correlation coefficients are provided along with color coding to distinguish significant negative and positive correlation coefficients. In other words, a test of significance is automatically performed at run-time, and testing results are revealed via color coding:
An additional plot is made showing the p-value results based on the same color coding used for the html output:
Click on the Force plot(Pearson) and Force plot(Spearman) icons in the treenodes, and the following images will appear:
The minimum spanning trees shown above depict the various relationships between the features employed, and differences between force plots constructed from Pearson correlation and Spearman correlation matrices is due to the range and scale of the features used. In cases where outliers are present in the data, Spearman rank-based force plots will reveal a better picture of the true minimum spanning tree.
Example 2 - Object-by-Object correlation. Correlation is commonly applied to the features, however, it can also be applied to objects to identify objects with similar correlation across their feature values. This example will use a dataset based on expression of 23 genes in 4 classes of small round blue cell tumors (SRBCT). First, open the SRBCT_wo_classes.xlsx dataset. Transpose the entire dataset by selecting the following commands:
Then select all the features in the dataset, without selecting the Text-based features.
Then, select all of the objects for correlation analysis. The results of the significance plot for Pearson correlation are shown below:
The 23 genes in the SRBCT dataset were identified as the best class predictors for the four tumor types, and the high co-regulation (co-expression) among genes within the four classes (EWS-Ewing’s sarcoma, BL-Burkitt’s lymphoma, NB-neuroblastoma, and RMS-rhabdomyosarcoma) can be seen in the correlation significance plot, where large red areas indicate statistically significant positive correlation. The significant positive correlation (red) is only apparent within each class, but not between each class. This object-by-object approach to correlation can provide tremendous strength to exploratory analysis to confirm impressions on correlation between objects.
Let be the set of all data features over the field of real numbers, and as the set of symmetric -square matrices over the field of real numbers. Define the data matrix with rows and columns as a function on the pairs of integers , with values in Y in which designates the value of Y at the pair shown as
| (6.26) |
The sample mean for the th column of Y is
| (6.27) |
so that the sample mean vector is given by
| (6.28) |
The standardized data matrix of Y is
| (6.29) |
with elements
| (6.30) |
The sample covariance matrix of Y is
| (6.31) |
with diagonal elements determined with the form
| (6.32) |
and off-diagonal elements as
| (6.33) |
Below is a plot of the covariance matrix for all features in the Retinol dataset:
The symmetric sample correlation matrix of Y is
| (6.34) |
with elements
| (6.35) |
The proximity matrix D contains the Euclidean distances between each pair of features and appears in the form
| (6.36) |
with elements
| (6.37) |
The Euclidean distance matrix for all Retinol dataset features is shown below:
Columns of Y are usually measured on different scales (units), and it is therefore possible that one or more features will strongly influence the elements in D simply because of their large scale differences. To remove scale differences, the correlation matrix R is often used for the covariance matrix because its elements are scale independent and have potentially much less variation across the features. We advocate use of the correlation matrix R generated from the standardized data matrix Z rather than from Y in order to minimize additive and relative changes across features while removing possible scale effects.
The definitions of bivariate and multivariate covariance and correlation described in the preceding sections can be further expounded on from the perspective of matrix algebra. The use of covariance and correlation matrices allows us to consider additional concepts involving their determinants, eigenanalysis, and orthogonality.
Design (Data) Matrix Consider the following (data) design matrix:
| (6.38) |
By definition, the population covariance matrix is
| (6.39) |
where is a -tuple for the th object and is a -tuple with means for the features. Because it is impossible to measure , we are limited to working with the sample covariance matrix, given as
| (6.40) |
Let , the sample correlation matrix is
| (6.41) |
During computation, elements of can be determined individually using the relationship
| (6.42) |
If the means of columns of are known, covariance elements of can be directly calculated using the functional form
| (6.43) |
whereas, if the standard deviations and correlations are already known, then the covariance matrix elements can be calculated using
| (6.44) |
Properties of Covariance Matrix.
Properties of Correlation Matrix.
Independence and Multicollinearity. Under independence of features (no covariance), the properties of C and R are:
For multicollinearity:
| (6.45) |
[1] R.C. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal. 36(6): 1389–1401, 1957.
Linear regression focuses on the dependency between a dependent variable (feature) and an independent variable (feature) [1, 2]. Define the following estimators:
| (7.1) |
Next, let the objective function , be dependent on the parameters and . The objective function is minimized using the method of least squares, for which the sum of the squared residuals, , is minimized. Recall that the residuals, .
| (7.2) |
The above section introduced linear regression based on a single predictor; however, it is more common to solve regression problems with multiple predictors in order to adjust for the effects of confounders or nuisance factors. Therefore, we focus more on multiple linear regression, introduced below.
Multiple linear regression employs more than one predictor feature, and still only uses one outcome -feature. The calculations used for determining regression coefficients are based on matrix algebra, which is described in the section below.
Matrix Form of Multiple Linear Regression. A standard multiple linear regression model is
| (7.3) |
In matrix form, the model equation is expressed as
| (7.4) |
where is the data matrix with dimensions , is the number of rows in , is the number of columns in # features, is the th row , and is the th feature (column) .
In theory, since we cannot access an infinitely large universal population to make inferences, we have to generalize results using a sample. When doing so, we replace the terms with the maximum likelihood estimates . Without loss of generality, we nevertheless use interchangeably with .
By use of calculus (involving derivatives of several matrices) it can be shown that the normal equations can be rearranged to solve for the regression coefficients in the form
| (7.5) |
The necessary matrix operations for solving for the regression coefficients are
| (7.6) |
Matrix notation when no constant term is used. When no constant term is used, the matrix notation is:
| (7.7) |
Matrix notation when constant term is used. Using expanded form, when a constant term is introduced into the regression model, a one is “padded” into the left side of the matrix in column one:
| (7.8) |
Once regression coefficients are fitted, we can cross-multiply them against their relevant x-values and add the products to obtain the predicted y-value, also known as “y-hat,” given as
| (7.9) |
where is an estimate of the intercept, and is an estimate of the th regression coefficient for feature . Using the y-hats, the residual for each observation can then be determined as
| (7.10) |
Variance Partitioning – Sums of Squares. Multiple linear regression uses matrix algebra to determine the partitioned sum-of-squares. The total sum of squares for multiple regression is
| (7.11) |
where
| (7.12) |
and
| (7.13) |
The sum of squares error is functionally composed as
| (7.14) |
while the sum of squares regression is
| (7.15) |
The sums of squares helps us partition the variance which is attributable to the regression and the resulting error after the regression coefficients are fit. If SSE=0, then the models fits perfectly to the data. As SSE increases, the original -values become further from their predicted values, . If there is too much variation in the residuals, i.e., , the SSE will be large. In addition, the straight-line model assumption may be inappropriate. If we assume that the linear straight-line assumption holds, then an estimate of can be made from using SSE. We call this estimate, the mean square error, or MSE. The square root of the MSE, that is, , is called the root mean square error.
| (7.16) |
Finally, we calculate the RMSE using the notation
| (7.17) |
The mean squares due to regression, MSR, is also calculated, but it is only used in the ANOVA table for the global F-test. Whereas the MSE is used in multiple later calculations.
After partitioning the variance due to our regression model, we must next construct an ANOVA table, which will allow us to set up the additional calculation for the F-ratio test to test the hypothesis: =0. The alternative hypothesis (your belief) is that at least one of the coefficients is not equal to zero: . The test statistic for the F-ratio test is equal to . The calculated F statistic approximates a variate from the F-ratio probability distribution, which has two parameters: numerator degrees of freedom, , and denominator degrees of freedom, . In a linear regression problem, the F-statistics’s numerator degrees of freedom, , is always equal to , while the denominator degrees of freedom, , is always equal to . The operation implies subtracting one from the number of regression coefficients, and in our single predictor model we know that : due to and . The denominator degrees of freedom is simply equal to the sample size minus the number of regression coefficients, which, again is for a single predictor regression model. The degrees of freedom are required because SSE and SSR are estimators, which are penalized by a certain degrees of freedom during their formation. Obviously, as the F-statistic increases, it suggests that the regression model explains more of the variance of than SSE does. The ANOVA table is provided below.
Source of variation | Computational formula | df | MS | ||
Due to regression | |||||
Due to error | |||||
Total | |||||
Example. A researcher was interested in studying the dependency of a dependent feature, , on three independent features, , , and . Determine the percentage of variance explanation of by the -predictors. Assume that linear regression is appropriate for determining the relationship between and the independent features. The table below shows the values of the features assumed.
ID | Y | X1 | X2 | X3 |
1 | 1.2 | -1.0 | -0.7 | 0.9 |
2 | 2.0 | -0.9 | 1.5 | 0.4 |
3 | 1.9 | 0.9 | -0.2 | 1.3 |
4 | 1.7 | -1.5 | 0.1 | -0.1 |
5 | 1.6 | 1.4 | 0.9 | -0.9 |
6 | 1.8 | -0.9 | 1.5 | 0.7 |
7 | 0.0 | -1.1 | -3.0 | 0.6 |
8 | 0.6 | -0.3 | 1.0 | -1.5 |
9 | 1.9 | 0.6 | -1.5 | 0.7 |
10 | 1.5 | -1.4 | 1.5 | -1.4 |
11 | 0.9 | 0.1 | -1.4 | -1.6 |
12 | 1.0 | 0.3 | -0.2 | -0.1 |
13 | 1.5 | -0.8 | 0.1 | 1.5 |
14 | 1.5 | -0.3 | 0.3 | -0.3 |
15 | 0.6 | -0.6 | -0.6 | 0.3 |
16 | 1.1 | 1.8 | -1.1 | -1.1 |
17 | 2.2 | 0.9 | -0.9 | 0.2 |
18 | 0.9 | -0.2 | -0.4 | 1.1 |
19 | 1.5 | 0.4 | -0.1 | -0.1 |
20 | 1.0 | -1.1 | 1.5 | 0.2 |
ANOVA Table Results.
Below is the resulting ANOVA table:
Source of variation | SS | df | MS | |
Regression | 2.35 | 3 | 0.78 | 3.35 |
Error | 3.74 | 16 | 0.23 | |
Total | 6.09 | 19 | ||
In the ANOVA table, the formal statement of the null and alternative hypotheses is
| (7.18) |
If holds, . Large values of lead to conclusion . For this example, since the calculated test statistic for F was equal to 3.35 and exceeded the lookup critical value of F based on , =3 =16, 3.24, we conclude that is true, suggesting that at least one of the coefficients is not equal to zero.
Notation for the variance-covariance matrix of coefficients is:
| (7.19) |
Results for the variance-covariance matrix of coefficients:
| (7.20) |
The source of s.e. for the denominator of the t-test for each is:
| (7.21) |
After taking the square root of each diagonal element, we get:
| (7.22) |
Let us now determine if the regression coefficient , is statistically significant. If the regression coefficient is a reliable estimate of the change in as a function of a 1-unit change in feature , then its value will will be large, and either positive or negative. A positive slope suggests that increases with increasing , whereas a negative slope indicates that decreases with increasing . Using the standard error of each regression coefficient, i.e., , we now calculate the t-test for each coefficient, , using the relationship
| (7.23) |
The hypothesis for any regression coefficient is: The formal statement of the null and alternative hypotheses is:
| (7.24) |
If holds, . Large values of the calculated test statistic lead to concluding is true.
Regression coefficient results. Below are the regression coefficients, and t-test results:
Variable | Prob | |||
Const | 1.39434 | 0.11144 | 12.5119 | 0.0000 |
X1 | 0.26015 | 0.12386 | 2.1004 | 0.0519 |
X2 | 0.26492 | 0.09878 | 2.6821 | 0.0164 |
X3 | 0.22297 | 0.12244 | 1.8210 | 0.0874 |
We notice that feature is a significant predictor of , since , but that and are not.
Multiple Coefficient of Determination, . One of the overall goals of regression is to estimate the degree of variance explanation of Y by introduction of regression coefficients. This is determined using the coefficient of determination, , and the multiple r-squared, . Notice that reveals the “percentage of variance explanation” of the dependent feature, , by the independent -predictors in the model.
| (7.25) |
| (7.26) |
Similarity between regression and correlation. Although similar to correlation, the difference between correlation and regression is that correlation is essentially the tightness of data points around a straight line irrespective of slope, whereas regression focuses on the dependency of the dependent -feature(s) on one or more predictor -features. The correlation coefficient is a standardized regression coefficient for the regression , since
| (7.27) |
The goal of regression is to determine the amount of variance explanation due to the independent features and the slope measured by regression coefficients due to each independent feature. The degree of slope reflects the linear relationship between the dependent feature(s) and independent features.
Centering and estimation of Intercept Term. Prior to each regression run, NXG Explorer centers each predictor feature by subtracting off the mean value from each observation:
| (7.28) |
where is the mean for feature . After a regression run, NXG Explorer determines the -intercept (constant term) as follows:
| (7.29) |
The standard error of the -intercept is determined using the relationship:
| (7.30) |
where is the number of objects, is the number of features in the model, is the variance of , and is element of the variance-covariance matrix of coefficients, .
Regression diagnostics are statistical techniques for detecting conditions which can lead to inaccurate or invalid regression results. Techniques include:
Model Interpretation: DACOD Acronym. We can use the acronym for assuring valid regression analysis and interpretation:
The majority of regression analyses often include a constant term (-intercept) in the regression model. The basic data setup for regression diagnostics starts with the residuals, which are based on subtraction of the predicted y, or “y-hat” based on notation, , from the observed y-value, . Recall, the subscript represents the th object or subject (human, mouse, rabbit, etc). Let’s look at the many ways the regression model can be shown:
Recall, for multiple linear regression the vector and matrix notation is
| (7.31) |
which is equal to
| (7.32) |
To calculate the predicted y-hat for the th object, we simply cross multiply the x-values for the th object with the corresponding fitted coefficients and add the constant term, given as
| (7.33) |
where , or is called the linear predictor.
Residuals, . Residuals are defined as
| (7.34) |
and reflect the disagreement between the predicted and observed -values. Residuals are distributed with mean zero and common variance, .
A data setup including the y-hats and residuals, would be listed as
y | x1 | x2 | xp | yhat | resid | |
Once the residuals are determined, the sum of square error can also be estimated as follows:
| (7.35) |
| (7.36) |
| (7.37) |
The initial focus in regression diagnostics involves determination of residuals for identifying outliers. An outlier is an observations (record) that is overly influential on the regression coefficients, leading to biased coefficients. In the sections below, we define a majority of the most commonly used residuals for multiple linear regression.
Standardized residuals, . Standardized residuals are defined as
| (7.38) |
where . Standardized residuals are distributed . Obviously, if a standardized residual for a single observation is below -2 or above 2 (very close to ), we know that the residual, , is greater than two standard deviations from the mean. Any observation whose standardized residual is more than two standard deviations from the mean of all residuals is viewed as an outlier.
Studentized Residuals, . Studentized residuals are defined as
| (7.39) |
where and is the leverage of the th observation (leverages are discussed below). Studentized residuals are t-distributed with d.f.
Deletion Variance, . Another measure of the residual is the change in variance resulting from deletion of the th observation from a regression analysis. Deletion variance is functionally defined as
| (7.40) |
where is the leverage of the th observation.
Jackknife Residuals, . Jackknife (“deletion”) residuals are defined as
| (7.41) |
Jackknife residuals are t-distributed with d.f. Thus, if the absolute value, , exceeds the 95th percentile of the relevant t-distribution (i.e., ), then it is reasonable to say that the th observation is an outlier.
Cook’s Distance, . Cook’s distance reflects the change in regression coefficients when an observation is deleted from a regression model.
| (7.42) |
Observations for which should be scrutinized.
Leverages. The leverage, , is a measure of extremeness of an observation and is derived from the diagonal elements of the Hat matrix, given as
| (7.43) |
Another interpretation is that leverage is equal to the distance between the th observation and the mean of all observations. The average leverage value is
| (7.44) |
Hoaglin and Welsh [3] recommended scrutinizing observations for which . Some useful properties of the leverage are and .
DFFITS, . DFFITS is a scaled measure of the change in the predicted value for the th observation, , and is calculated by deleting the th observation. DFFITS are in the form
| (7.45) |
Large values of DFFITS indicate influential observations. A general cutoff to consider is 2; a size-adjusted cutoff recommended by Belsley, Kuh, and Welsch [4] is .
DFBETAS, . DFBETAS are a measure of the change in the th regression coefficient as a result of removing the th observation in a regression run.
| (7.46) |
where is an element of the matrix
| (7.47) |
Large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch [4] recommend 2 as a general cutoff value to indicate influential observations and as a size-adjusted cutoff.
Multicollinearity is a problem caused by correlation between predictor features. Multicollinearity is not really associated with the dependency between dependent feature and independent features, but rather, the correlation among independent features. Multicollinearity mostly biases the coefficients and their standard errors, causing the regression model to become too sensitive to the data. Using our input data matrix X with rows and columns, i.e., features, we first determine the correlation coefficient for all possible pairs of predictor features. We then arrange these correlation coefficients in a correlation matrix denoted by R. We also know that R is a square symmetric matrix, which is given in the form
| (7.48) |
where the diagonal consists of ones, and the off-diagonals are comprised of the between-feature correlation coefficients , where are features . In the next section, we introduce variance inflation factors, which are used for detecting multicollinearity. Rather than exploit the correlation matrix R, the VIFs are based on the model .
Variance Inflation Factors, . Collinearity involves association or correlation between predictor features. Variance inflation factors [5] represent the degree of collinearity between an independent feature of interest compared with other independent features.
| (7.49) |
where is the value when feature is regressed on the remaining independent features. Independent features whose is greater than 10 represent a severe collinearity problem.
Residual Criteria Based on and .
The table below lists the residual screening criteria based on the values of and for this worked example with =20 objects
(records) and =3 predictor features:
Residual | Outlier criteria | Calculated results |
StdResid | 2 | 2 |
StudResid | 1.753 | |
DelResid | 1.761 | |
CooksD | 1 | 1 |
Leverage | 0.400 | |
DFFITS | 0.894 | |
DFBETAS | 0.447 | |
VIFs | 10 | 10 |
Residual Analysis Results.
The table below lists the calculated residuals, with outliers highlighted in red, for the example 3-predictor model:
Object | Resid | StdResid | StudResid | DelResid | CooksD | Leverage | DFFITS |
1 | 0.06714 | 0.13884 | 0.15009 | 0.14543 | 0.00095 | 0.14431 | 0.05972 |
2 | 0.35000 | 0.72372 | 0.79182 | 0.78215 | 0.03089 | 0.16461 | 0.34719 |
3 | 0.01340 | 0.02772 | 0.03244 | 0.03141 | 0.00010 | 0.26989 | 0.01910 |
4 | 0.70871 | 1.46544 | 1.60404 | 1.69539 | 0.12742 | 0.16534 | 0.75459 |
5 | -0.17228 | -0.35624 | -0.42753 | -0.41634 | 0.02012 | 0.30569 | -0.27626 |
6 | 0.15012 | 0.31042 | 0.34535 | 0.33564 | 0.00709 | 0.19204 | 0.16363 |
7 | -0.47882 | -0.99009 | -1.41193 | -1.46115 | 0.51516 | 0.50827 | -1.48553 |
8 | -0.65117 | -1.34646 | -1.52667 | -1.59925 | 0.16640 | 0.22214 | -0.85464 |
9 | 0.61214 | 1.26576 | 1.38208 | 1.42602 | 0.09180 | 0.16124 | 0.62525 |
10 | 0.40670 | 0.84095 | 1.02515 | 1.02689 | 0.12770 | 0.32707 | 0.71591 |
11 | 0.25318 | 0.52351 | 0.64305 | 0.63084 | 0.05260 | 0.33723 | 0.44999 |
12 | -0.44711 | -0.92453 | -0.95618 | -0.95346 | 0.01592 | 0.06511 | -0.25161 |
13 | -0.03844 | -0.07949 | -0.08852 | -0.08573 | 0.00047 | 0.19366 | -0.04201 |
14 | 0.19589 | 0.40506 | 0.41841 | 0.40735 | 0.00293 | 0.06275 | 0.10541 |
15 | -0.53488 | -1.10602 | -1.14977 | -1.16231 | 0.02667 | 0.07467 | -0.33017 |
16 | -0.17943 | -0.37103 | -0.45101 | -0.43949 | 0.02429 | 0.32324 | -0.30373 |
17 | 0.77230 | 1.59694 | 1.71366 | 1.83629 | 0.11123 | 0.13158 | 0.71477 |
18 | -0.55056 | -1.13844 | -1.21376 | -1.23337 | 0.05035 | 0.12026 | -0.45601 |
19 | 0.10185 | 0.21059 | 0.21832 | 0.21170 | 0.00089 | 0.06953 | 0.05787 |
20 | -0.57873 | -1.19668 | -1.30675 | -1.33871 | 0.08215 | 0.16138 | -0.58726 |
DFBETAS Results.
The table below lists the resulting DFBETAS for the example 3-predictor model:
Obs | X1 | X2 | X3 |
1 | -0.03176 | -0.02308 | 0.02208 |
2 | -0.04873 | 0.24620 | 0.11188 |
3 | 0.01269 | 0.00493 | 0.01434 |
4 | -0.62583 | -0.13159 | -0.18269 |
5 | -0.20924 | -0.13755 | 0.04991 |
6 | -0.02122 | 0.11440 | 0.07088 |
7 | 0.79397 | 1.31937 | 0.11380 |
8 | 0.11546 | -0.23569 | 0.63815 |
9 | 0.25813 | -0.29100 | 0.21920 |
10 | -0.38005 | 0.18566 | -0.45673 |
11 | -0.09139 | -0.27079 | -0.36516 |
12 | -0.11312 | -0.01085 | 0.01138 |
13 | 0.00446 | -0.00729 | -0.03367 |
14 | -0.01739 | 0.01565 | -0.03729 |
15 | 0.13180 | 0.14106 | -0.03280 |
16 | -0.20027 | 0.06009 | 0.11454 |
17 | 0.46860 | -0.14184 | 0.15433 |
18 | -0.06106 | 0.00665 | -0.33761 |
19 | 0.02980 | 0.00643 | 0.00023 |
20 | 0.17956 | -0.37680 | -0.06967 |
Below is a the table of VIFs:
Feature | VIF |
X1 | 1.138 |
X2 | 1.133 |
X3 | 1.093 |
Because none of the VIFs exceeded a value of 10, there does not appear to be a multicollinearity problem.
This section introduces computational methods used for determining partial F-tests for the predictor features , , and .
Extra Sum-of-Squares for Variable . When regressing separately on the sets of predictors , , and , we can determine the extra sum-of-squares, defined as follows:
The results of regressing separately on the sets of predictors , , and , are listed below:
Regression of Y=X1
Source of variation | SS | df | MS | |
Regression | 0.31 | 1 | 0.31 | 0.96 |
Error | 5.78 | 18 | 0.32 | |
Total | 6.09 | 19 | ||
Variable | Prob | |||
Const | 1.35401 | 0.12936 | 10.4669 | 0.0000 |
X1 | 0.13323 | 0.13607 | 0.9791 | 0.3405 |
Regression of Y=X1,X2
Source of variation | SS | df | MS | |
Regression | 1.57 | 2 | 0.79 | 2.96 |
Error | 4.52 | 17 | 0.27 | |
Total | 6.09 | 19 | ||
Variable | Prob | |||
Const | 1.38915 | 0.11875 | 11.6979 | 0.0000 |
X1 | 0.20609 | 0.12818 | 1.6078 | 0.1263 |
X2 | 0.22356 | 0.10247 | 2.1818 | 0.0435 |
Regression of Y=X1,X2,X3
Source of variation | SS | df | MS | |
Regression | 2.35 | 3 | 0.78 | 3.35 |
Error | 3.74 | 16 | 0.23 | |
Total | 6.09 | 19 | ||
Variable | Prob | |||
Const | 1.39434 | 0.11144 | 12.5119 | 0.0000 |
X1 | 0.26015 | 0.12386 | 2.1004 | 0.0519 |
X2 | 0.26492 | 0.09878 | 2.6821 | 0.0164 |
X3 | 0.22297 | 0.12244 | 1.8210 | 0.0874 |
The extra sum of squares for the first regression run is simply based on the SSR for that model, therefore, . However for the second and third regression runs, we need to use the following formulation to determine the extra sum of squares:
| (7.50) |
where denotes a single predictor -feature, is the extra sum of squares for feature , , , is the SSR from the full model containing feature , and is the SSR from the reduced model without feature . Using the SSR from the reduced model for the regression of , and the SSR from the full model for the regression , we calculate the extra sum-of-squares using the following relationship:
| (7.51) |
Analogously, SS() is determined as follows:
| (7.52) |
Partial F-test for Variable . Once the extra sum of squares is determined for all the predictor -features, the partial F-test for each feature is determined using the relationship
| (7.53) |
As an example, the partial F-test for feature is determined as
| (7.54) |
Partial F-test Results.
The table below lists the partial F-test results:
Regression | d.f. | MSE | P-value | ||||
Y=X1 | 0.3080 | 18 | 0.32 | 0.96 | 4.41 | 0.3405 | |
Y=X1,X2 | 1.2650 | 17 | 0.27 | 4.76 | 4.45 | 0.0435 | |
Y=X1,X2,X3 | 0.7756 | 16 | 0.23 | 3.32 | 4.49 | 0.0874 | |
Example 1 - Multiple linear regression of Retinol data. This example will use NXG Explorer to perform multiple linear regression of the serum plasma retinol concentration (retplasma) on several predictor features to identify significant predictors. Retinol is one of the A vitamins, and is important for night vision, tooth and bone formation, and skin health. Therefore, it would be desirable clinically to know which features may be predictive of serum retinol in order to make public health recommendations.
Before we begin this example run, we make sure the appropriate criteria for model building strategies are set in the PreferencesGeneral default settings,
which specifies that only univariate regression models for which the p-value of univariate predictor is less than 0.05, i.e. , are used in the full multiple linear regression model. The second entry in the default settings specify that we want all the coefficients shown in multivariable model, based on (which implies all features).
To begin, import the Retinol dataset.
Next, select the command Analysis Regression Multiple linear from the Analysis pull-down menu, shown as:
Then specify the dependent feature, categorical features, and continuously-scaled features, as follows:
Then accept the default values for the options popup window:
The univariate regression coefficients results based on each predictor regressed separately are first listed:
Next, the ANOVA table and multiple regression coefficients based on a full model with all predictors regressed simultaneously is presented:
Coefficients of partial determination are listed next:
For regression diagnostics, the residual criteria are presented first, followed by the residuals, with outliers (overly influential records) identified by red highlighting in the output:
DFBETAS are listed next:
Lastly, VIFs are listed:
We now introduce a linear multivariate regression model for which there are multiple dependent features and multiple independent features [6]. Let be a vector of values for features for the th object and be the vector of dependent features for the same object, where is the total number of -dependent features. The regression equation for each object and each dependent feature is
| (7.55) |
where is the intercept, or constant term, is the regression coefficient for feature , and is the random error term or residual. Regression coefficients are used for predicting the -values for each object, using the relationship
| (7.56) |
Coefficients are determined directly from the matrix operation
| (7.57) |
The matrix form of the dependent features is given as
| (7.58) |
The typical matrix of -values is
| (7.59) |
Regression coefficients are defined using the following matrix notation
| (7.60) |
and the matrix of random error terms is in the form
| (7.61) |
Rearranging the matrices, we can solve for the regression coefficients using the one-step operation
| (7.62) |
The inverse of the dispersion matrix can be obtained using the Jacobi method, singular value decomposition, or some other matrix inversion technique. Once the coefficient matrix is obtained, we obtain the predicted -values by applying to the data matrix using the matrix operation .
The variance-covariance matrix for the coefficients is equal to the direct matrix product of the residual sums of square and the inverse of the variance-covariance matrix of the coefficients, given as
| (7.63) |
The null hypothesis which applies to all of the regression coefficients is determined by comparing statistics derived from the eigenvalues of the non-symmetric matrix , where the hypothesis matrix is
| (7.64) |
and the sum-of-squares matrix is
| (7.65) |
The first test is based on Wilks’ using the eigenvalues of , given in the form
| (7.66) |
Let
| (7.67) |
| (7.68) |
| (7.69) |
| (7.70) |
and
| (7.71) |
Then
| (7.72) |
The Pillai trace statistic is also computed as
| (7.73) |
Let
| (7.74) |
| (7.75) |
and
| (7.76) |
Here, under
| (7.77) |
The Lawley-Hotelling statistic is
| (7.78) |
Then
| (7.79) |
Lastly, Roy’s test is based on the greatest root , and the test statistic is
| (7.80) |
Example 2 - Multivariate regression of Retinol data. For this example, we will regress the four outcome measurements reflecting subject-specific values of beta-carotene and retinol from dietary assessment and serum plasma measurements on continuously-scaled features to identify significant predictors. First, open the Retinol dataset, and select Analyis Regression Multivariate from the Analysis pull-down menu, shown as
Next, specify the four continuous dependent (“target”) features betadiet, retdiet, betaplamsa, and retplasma, and then select the continuous independent features age, quetelet, calories, fat, fiber, alcohol, and cholesterol – which is shown below:
The results of the regression run will listed in the treenodes on the left side of the output window:
Click on the Full model Tests icon, and the following table will appear:
Results of the full model tests indicate that all of the tests suggests that one or more of the independent features is a significant predictor of one or more of the dependent features. To find out which independent features are significant predictors of one or more of the dependent features, we need to look at the feature-specific test results, which are shown below. Click on the Feature Tests icon, and the following table will appear:
In the table above for feature-specific tests, we see notice that age, quetelet, fiber and cholesterol were significant. To see the specific-contribution of each independent feature to each dependent feature, we look at the tables of regression coefficients by clicking on the Coefficients icon, which will display the following output:
In the table above for which betadiet is the dependent feature, we notice that fiber was the only significant predictor. In addition, betadiet increased 138.3 units for each one-unit increase in fiber. For the dependent feature retdiet, we notice that cholesterol was the only significant predictor, with a slight increase in retdiet per unit increase in cholesterol. Coefficient tables for the dependent feature betaplasma and retplasma are shown below. While both quetelet and fiber were significant predictors of betaplasma, only age was a significant predictor of retplamsa.
Logistic regression is a type of categorical regression for which the outcome features (dependent features) are discrete, and input features (independent features) are discrete or continuous. There are several types of logistic regression, and these are listed in the table below:
Type of logistic regression | Dependent feature | Scale | Example |
Unconditional | Binary | Case: , Control: | |
Ordinal | Ordinal | : subject in 2nd group | |
Conditional | Matched-pairs | : record with 1 case and 3 controls | |
Polytomous | Class membership | Subject in class 4: | |
When there is a binary (yes/no) outcome for each object, the dependent (outcome) feature is coded as a binary feature, consisting of zeroes and ones, i.e., (0,1), to represent failure or success. In failure analysis, objects that failed are commonly assigned a -value of 1, while items that didn’t fail are assigned a -value of 0. In clinical medicine and epidemiology, this binary coding scheme is amenable for analysis of case-control studies, where cases (diseased) are assigned while controls (non-diseased) are assigned . Among the various types of logistic regression, NXG Explorer implements two variants: unconditional and polytomous. The theoretical developments for each of these types of logistic regression models are described in the following sections.
Let represent disease and non-disease for subject having covariate vector . The probability of disease for subject is
| (7.81) |
and the probability of not having disease is
| (7.82) |
where is the logit for response category . The general definition for the logit is given as
| (7.83) |
Using the general definition, the logit for non-disease is
| (7.84) |
and thus, . The logit for disease is
| (7.85) |
Through simple substitution of the logit, we get the conditional probabilities
| (7.86) |
and
| (7.87) |
Let represent exposure and represent non-exposure to a risk factor, and represent disease and no disease. The logit, . Hence, our model has one -feature and two regression coefficients: and .
Cross-tabulation of disease and exposure status for , vs.
Disease () | No disease() | Total | |
Exposed () | 1.0 | ||
Non-exposed () | 1.0 | ||
The odds ratio can now be determined as
| (7.88) |
For the OR, the null and alternative hypotheses are:
Since , we know that when the . Therefore, in logistic regression we have
The significance of each regression coefficient is based on a standard normal variate equal to
| (7.89) |
and if then . The p-value is determined using the standard normal distribution, based on if and if . The CI for OR is
| (7.90) |
We now move into discussion surrounding how the coefficients are determined via the iterative maximum likelihood method. The logistic likelihood function is defined as
| (7.91) |
Taking the natural logarithm gives the log-likelihood in the form
| (7.92) |
Once the log-likelihood function is formed, we can use the Newton-Raphson method to perform a gradient ascent approach to find maximum likelihood values of the solution parameters . This entails calculation of the scores or first derivatives of w.r.t parameters and the information (Hessian) matrix of second partial derivatives. The score is
| (7.93) |
and element of the information matrix
| (7.94) |
Similarly, maximum likelihood estimates of each can be found by determining the vector iteratively with the matrix manipulation
| (7.95) |
until convergence is reached when . Values of are typically in the range .
Recall, the residual for a sample is essentially . The fitted regression value for the th subject and the th class is , is determined as
| (7.96) |
| (7.97) |
The Pearson goodness-of-fit (GOF) statistic is
| (7.98) |
The deviance residual is
| (7.99) |
The deviance goodness-of-fit (GOF) statistic is
| (7.100) |
The log-likelihood ratio statistic for a feature is minus twice the ratio of the log-likelihood of a model without the feature and the log-likelihood of the model with the feature:
| (7.101) |
and is chi-square distributed with 1 d.f., that is should exceed 3.84 as each feature is added to the model. Values of exceeding are suggestive that the additional feature in the full model is statistically significant in terms of contribution to the likelihood. Assuming the model fits, the distribution of Pearson and deviance residuals is chi-square distributed with degrees of freedom. For deviance, the GOF statistic is the likelihood ratio of a model with parameters and one with parameters. GOF values of or that are less than the d.f. () are suggestive of a significant fit, that is, the model used and data appropriately describe sample membership in the outcome classes.
Example 3 - Unconditional logistic regression on Low Birth Weight. This example will employ unconditional logistic regression using the low birth weight data. The feature named low is a binary feature (1-Yes, 0-No) representing the birth of an infant with low birth weight, and will be used as the dependent feature. The remaining features such as age (age of the mother in years), lwt(weight in pounds at the last menstrual period), race (1=white, 2=black, 3=other), smoke (smoking status during pregnancy: 1=Yes, 0=No), ht (history of hypertension (1=Yes, 0=No), and ui (presence of uterine irritability (1=Yes, 0=No) will serve as the independent predictors.
First, open the Low Birth Weight dataset. Select Analysis Regression Binary Logistic from the Analysis pull-down menu:
You will then see the following popup window:
Select low as the discrete output feature, race as a disrete input feature, and age, lwt, smoke, ht, and ui as continuous input features. A popup window with options will appear, so accept the default values and click Apply. The logistic regression analysis will then run. The treenodes after the run will appear as follows:
Select the Coefficients icon to display the results. At the top of the output you will notice results for all possible univariate logistic regression models, including univariate odds ratios (OR) and their 95% confidence intervals(CI). This will appear as follows:
During feature input, NXG Explorer automatically recoded the 3-level race feature into two binary features representing the two race groups (2-Black and 3-Other), respectively. This type of recoding is called corner-point constraints recoding of categorical features, for which the effect of the lowest category appears in the Constant term (or y-intercept). An alternative approach would involve use of sum-to-zero constraints recoding, which would force all three binary features into the model. For the univariate models, the features lwt, smoke, ht, and ui were significant predictors of low birth weight. The univariate ORs and 95% CIs are also listed in the coefficient table. Toward the bottom of the Coefficient output, we can see the multiple logistic regression runs, for which all the specified independent features are used in a single model. The multiple logistic regression coefficients are followed by a tabular listing of the ORs and 95% CIs.
Notice that in the multiple logistic regression model, in addition to lwt, smoke, ht, and ui being significant, the two binary features (0,1) for black and other race categories were significant (these categories were not significant in the univariate models). Looking at the ORs, we notice that the OR for smoking (smoke) was 2.79(95% CI, 1.29-6.05), OR for hypertension (ht) was 6.41(95% CI, 1.66-24.72), OR for uterine irritability (ui) 2.45(95% CI, 1.02-5.90), OR for black race (b_race_2) was 3.60(95% CI, 1.28-10.10), and OR for other race (b_race_3) was 2.46(95% CI, 1.05-5.77), which are all quite elevated. Using results of these observations, a public health-based recommendation could be reported as follows: “given the observed results from empirical data, these findings indicate that smoking and hypertension during pregnancy as well as race category were significant predictors of having a low birth weight infant.” Lastly, the class membership predictions are listed in tabular notation:
Next, click on the Residuals icon, and you will see a listing of records (objects) that are considered to be outliers due to their higher-than-expected residual values. For each outlier observation, the criterion which didn’t pass is highlighted in red, as shown below:
Polytomous logistic regression was fundamentally developed for multiple class problems, unlike the dichotomous 2-class nature of unconditional logistic regression described in the previous section. To begin, assume a multiclass problem in which the dependent feature for each th sample is coded 0 or 1 to represent membership in the th class, where , and is the number of classes minus one. Let’s assume that the reference class (corner point) is the first class for which . The general definition of the logit for polytomous logistic regression remains the same, as in
| (7.102) |
However, the conditional probability adds together the exponentiated logits in the denominator, shown in functional form as
| (7.103) |
For a 3-class example the probability that a sample has membership in each of the 3 classes is
| (7.104) |
The likelihood is
| (7.105) |
Taking the natural logarithm gives the log-likelihood in the form
| (7.106) |
Notice that in unconditional (binary) logistic regression the coefficients were obtained in the form of a column vector , while in polytomous logistic regression, however, there is a total of coefficients, resulting in a column vector, which is equal to the total number of features times the number of classes minus one. The score value for the th logit and the th feature will occupy the element in the score vector given as
| (7.107) |
The information matrix , or Hessian matrix, is a partioned matrix of size . Thus, if there are 3 classes , there will be 2 logit functions, each of which is represented with a matrix
| (7.108) |
By definition, the second partial derivative for different features within the same logit , that is, same matrix is
| (7.109) |
whereas the second partial derivative for different features in different logits , that is, different matrices is
| (7.110) |
Similarly, using the Newton-Raphson method, maximum likelihood estimates of each can be found by determining the vector iteratively with the matrix manipulation
| (7.111) |
until convergence is reached when . Values of are typically in the range .
Example 4 - Tumor class prediction with SRBCT gene expression dataset. This example uses the class-specific dataset for SRBCT data. First, import the Excel dataset using class-specific format as shown below:
Next, specify the commands Analysis Regression Polytomous logistic, as shown below:
Next, specify the Class feature as the class feature, and all continuous features as the predictors, as shown below:
Regarding the output, the treenode will only contain a Coefficients icon (see below):
Once you click on the Coefficients icon, you will first see the following tabular notation with class-specific univariate model results:
The first thing you will notice for a polytomous logistic analysis is that results will be presented for #classes-1 since the first class is treated as a referent group. In the output listed above, notice that each predictor is significant for predicting at least one of the diagnostic classes. This is expected, since the set of 23 gene expression features in this dataset were identified as the best predictors of the four diagnostic classes out of 2,308 features. Following the class-specific univariate results, the log-likelihood values are listed as a function of iteration:
This is followed by the class-specific multivariate logistic regression coefficients:
Lastly, the predicted class membership probabilities are listed:
Poisson regression models are commonly used for modeling risk factors and other covariates when the rates of disease are small, i.e., cancer [7, 8, 9, 10, 11]. An area of research in which Poisson regression of cancer rates is commonly employed is the investigation of radiation-induced cancer among Hiroshima and Nagasaki atomic bomb survivors [12, 13, 14] and individuals occupationally [15], medically [16] and environmentally [17] exposed to ionizing radiation. Poisson regression has also been used to investigate the effects of information bias on lifetime risks [18], diagnostic misclassification [19, 20], and dose-response curves for radiation-induced chromosome aberrations [21]. This report documents the NXG Explorer Poisson regression approach for fitting additive, multiplicative, and power models, and regression diagnostics using leverages of the Hat matrix, deletion and standardized residuals. The critical data required to perform Poisson regression using NXG Explorer are:
The Poisson assumption states that, for stratified data, the number of deaths, , in each cell approximates the values =0,1,2,... according to the formula
| (7.112) |
where is the rate parameter, which is equal to the number of deaths in each cell divided by the respective number of person-years () in that same cell. According to McCullagh and Nelder [22], a generalized linear model is composed of the following three parts:
| (7.113) |
| (7.114) |
where is the observed rate parameter and is the predicted rate parameter. It should be pointed out that the rate parameter, , serves as its own link function. Thus, the rate parameter provides a linearization of our distributional models so that we can use multiple linear regression to obtain estimators. In effect, the rate parameter allows us to reduce non-linear (non-Gaussian) models to linear models so that we can solve the system of linear equations. The rate parameter can take on various linear and non-linear functions. These are shown in Table 7.1. The power transformation was adapted from [23].
Parameters are solved through the use of iteratively reweighted least squares, where the parameter update vector at each iteration is
| (7.115) |
with data matrix elements
| (7.116) |
with weights
| (7.117) |
and dependent features
| (7.118) |
where the fitted rates are
| (7.119) |
Iterations start by setting the vector of regression coefficients equal to small near-zero starting values. Successive iterations involve adding the gradient-based update vector to the most recent vector of coefficients in the form
| (7.120) |
until convergence is reached when .
The error residual, has mean 0 and measures the lack of concordance of the observed deaths against the fitted value . The greater the value of , the worse the fit of the model. We will now introduce a number of residuals which are helpful for determining whether or not certain observation are overly influential on the fit of a model. Firstly, the Pearson residual, , for each observation is estimated as
| (7.121) |
and provide a measure of the residual variation. If we take into account the number of fitted parameters (coefficients) in a model, we can determine an adjusted residual as
| (7.122) |
where is the leverage for the th observation (described later). Values for that exceed 1.96 tend to be outliers at the level of significance. The deviance residuals
| (7.123) |
allow the investigator to determine the goodness of fit of the model, i.e., how well the observed data agree with the fitted values.
The leverage [22] for each observation, , provides information about the geometric distance between a given point (, , , ) and the centroid of all points (, , , ) in the predictor space. Individual leverages, , are obtained from the diagonal of the “Hat” matrix given by
| (7.124) |
Because , observations with high leverage can approach unity. A criterion introduced by Hoaglin and Welsch [24] for determining high leverage for an observation states that a point has high leverage if . Points with high leverage are located at a substantial distance from the fitted regression line, and act to “pull” the fitted line toward their location. Increases in the goodness of fit of a model can be obtained by re-fitting a model after the observation(s) with high leverage have been removed.
Cook [25] introduced the residual, , which measures the difference between when the observation is included in the model and when the observation is not in the model. The more positive the values of , the more influence an observation has when compared with other observations. Individual residuals are obtained by the matrix , by the relationship
| (7.125) |
The Pearson goodness of fit is
| (7.126) |
and the deviance goodness of fit is
| (7.127) |
which is also distributed. If a model fits, its and will be lower than the degrees of freedom ().
Example 5 - British Doctors data - Purely Additive Model (). This example will use the British doctor data in Table 9.1 to determine the additive risk (see Table 7.1) of coronary-related deaths due to smoking. Table 9.1 lists the British doctor data for coronary disease and smoking [6]. The rates for non-smokers, , are simply the ratio of the number of deaths to person-years in age group and exposure group , whereas the rates for smokers, , are the ratio of deaths to person-years in age group for the exposure group .
Non-smokers | Smokers | |||||
Age group, | ||||||
35-44 | 2 | 18,790 | 0.1064 | 32 | 52,407 | 0.6106 |
45-54 | 12 | 10,673 | 1.1243 | 104 | 43,248 | 2.4047 |
55-64 | 28 | 5,710 | 4.9037 | 206 | 28,612 | 7.1998 |
65-74 | 28 | 2,585 | 10.8317 | 186 | 12,663 | 14.6885 |
75-84 | 31 | 1,462 | 21.2038 | 102 | 5,317 | 19.1838 |
First, import the British doctors coronary death dataset, and you will see the following arrangement in the datagrid:
Next, select Analysis Regression Poisson from the Analysis pull-down menu, and select the person-year feature, count feature, and the covariates for age and smoke:
Next, select Additive from the model choices, and click Apply (shown below):
After the run, the treenodes resulting from this run will appear as follows:
Click on the Coefficients icon, and the additive risks will be listed in the diplay as follows:
Firstly, notice the value of the deviance GOF of 7.43. Because this run used a purely additive model , the predicted baseline rate for each age group is equal to the value of the regression coefficients for the specific age group. The absolute risk (AR) is equal to the regression coefficient for smoke (), and is actually an excess rate that adds to the baseline rate. A key point to be made is that the AR is a rate that serves as a fixed constant which adds to the baseline rate of any group. The table below lists the calculated baseline rates (deaths per 1,000 person-years) for each age group based on the fitted coefficients, as well as the fitted rate among smokers for each age group.
Feature | Coef. | Baseline rate(non-smokers) | Rate among smokers |
age_35-44 | 0.0841 | 0.0841 | 0.0841+0.5907=0.6748 |
age_45-54 | 1.6407 | 1.6407 | 1.6407+0.5907=2.2314 |
age_55-64 | 6.3035 | 6.3035 | 6.3035+0.5907=6.8942 |
age_65-74 | 13.5241 | 13.5241 | 13.5241+0.5907=14.1148 |
age_75-84 | 19.1696 | 19.1696 | 19.1696+0.5907=19.7603 |
smoke | 0.5907 | AR=0.5907 | |
Next, click on the Residuals icon, and the following table will be shown:
We notice that two observations (#1 and 6) were overly influential on the model’s fit. Next, click on the DFBETAs icon, and the following table will be shown:
Recall, the DFBETAS reveal the influence of each record each covariate if the record was removed and the model was ran again. In the DFBETAS output listed above, there appears to be a greater number of residuals for the older age groups.
Example 6 - British Doctors data - Purely Multiplicative Model (). In this example, we will employ a purely multiplicative model to identify the relative risk (RR) of smoking. Import the Britisk doctor data again, select Analysis Regression Poisson from the Analysis pull-down menu, select the features, and select Multiplicative for the model choices (shown below):
Click on the Coefficients icon and the following table will be displayed:
The value of deviance GOF for this multiplicative mode is 12.13, which implies a worse fit when compared with the purely additive model. Using the coefficient values, the fitted baseline rates (deaths per 1,000 person-years) among non-smokers and smokers are listed in the table below:
Feature | Coef. | Baseline rate(non-smokers) | Rate among smokers |
age_35-44 | -1.0116 | exp(-1.0116)=0.3636 | 0.36361.4225 = 0.5184 |
age_45-54 | 0.4724 | exp(0.4724)=1.6038 | 1.60381.4225 = 2.2863 |
age_55-64 | 1.6159 | exp(1.6159)=5.0324 | 5.03241.4225 = 7.1737 |
age_65-74 | 2.3389 | exp(2.3389)=10.3698 | 10.36981.4225 = 14.7822 |
age_75-84 | 2.6885 | exp(2.6885)=14.7096 | 14.70961.4225 = 20.9685 |
smoke | 0.3545 | RR=exp(0.3545)=1.4255 | |
The relative risk (RR) is determined by exponentiating the coefficient for exposure (i.e., smoke) and multiplying the RR by each fitted baseline rate. Next, click on the Residuals icon, and the following table will be shown:
We notice that four observations were overly influential on the model’s fit, and since there is greater number of outliers for the multiplicative model, this may imply that the multiplicative models fits worse. Next, click on the DFBETAs icon, and the following table will be shown:
The DFBETAS results indicate fewer issues with the multiplicative model.
Example 7 - British Doctors data - Power Model (). NXG Explorer can be used to evaluate power models with varying values of where To begin, import the British doctors’ dataset, select Analysis Regression Poisson from the Analysis pull-down menu, select the features, and then specify “Evaluate rho at 0.1 intervals,” from the popup windows (shown below):
After the run, you will see the following icons in the treenode:
Click on the Power evaluation icon, and the following output table will appear:
Note that the deviance GOF is lowest when , which indicates that a power model with fit better than a purely additive model () or a purely multiplicative model (). Below is a plot of the deviance values as a function of :
Next let’s run a Poisson regression model using a fixed value . Import the dataset, select the Poisson regression command and features, and now specify in the options popup window:
Click on the Coefficients icon, and the following table will appear:
Note the low value of deviance of 2.14, which was similar to an estimate of the interpolated value in the power table listed above, and is much lower than the deviance of 7.43 for the purely additive model and 12.13 for the purely multiplicative model. Again, using the coefficient values, the fitted baseline rates (deaths per 1,000 person-years) among non-smokers and smokers are listed in the table below:
Feature | Coef. | Baseline rate(non-smokers) | Rate among smokers |
Feature | Coef. | Baseline rate(non-smokers) | Rate among smokers |
age_35-44 | 0.276 | 0.0963+0.2767=0.373 | |
age_45-54 | 1.1145 | 1.2179+0.2767=1.4946 | |
age_55-64 | 2.4563 | 5.1239+0.2767=5.4006 | |
age_65-74 | 3.8593 | 11.6514+0.2767=11.9281 | |
age_75-84 | 4.7632 | 17.0822+0.2767=17.3589 | |
smoke | 0.4933 | ||
There were no outliers for this model:
Nor were there any overly influential DFBETAs for this power model:
Summary of fitted rates. In summary of the additive, multiplicative, and power models, the Table 7.3 shows the basic interpretation of coefficients and how the AR and RR are estimated and then applied to the baseline rate among non-smokers to obtain the rate among smokers.
[1] F. Galton. Regression towards mediocrity in hereditary stature. J. Anthropol. Inst., 15:246–263, 1886.
[2] M.S. Bartlett. Further aspects on the theory of multiple regression. Proc. Cambridge Phil. Soc, 34:33–40, 1938.
[3] D.C. Hoaglin, R.E. Welsch. The hat matrix in regression and ANOVA. Am. Stat. 32:17-22, 1978.
[4] D.A. Belsley, E. Kuh, R.E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York (NY), John Wiley and Sons, 1980.
[5] J. Neter, W. Wasserman, M.H. Kutner. Applied Linear Statistical Models. 2nd Ed. Homewood (IL), Richard D. Irwin, 1985.
[6] M.S. Bartlett. Multivariate analysis. J. Royal Stat. Soc., 9:176–197, 1947.
[7] E.L. Frome, M.H. Kutner, J.J. Beauchamp. Regression analysis of Poisson-distributed data. JASA. 68:935–940, 1973.
[8] E.L. Frome. The analysis of rates using Poisson regression methods. Biometrics. 39:665–674, 1983.
[9] Frome, E.L. PREG: A computer program for Poisson regression analysis. Oak Ridge Associated Universities Report ORAU-178 (1981). Oak Ridge National Laboratory. Oak Ridge(TN), ORAU.
[10] C.R. Muirhead, S.C. Darby. Modelling the relative and absolute risks of radiation-induced cancers. J. R. Stat. Soc. A. 150:83–118, 1987.
[11] N.E. Breslow, N.E. Day. Statistical Methods in Cancer Research. Volume II: The design and anaylsis of cohort studies. Chapter 4. IARC Scientific Publication No. 82. Lyon, IARC, 1987.
[12] D.A. Pierce, Y. Shimizu, D.L. Preston, M. Vaeth, K. Mabuchi. Studies of mortality of atomic bomb survivors. Report 12, Part I. Cancer: 1950-1990. Radiat. Res. 146:1–27, 1996.
[13] D.E. Thompson, K. Mabuchi, E. Ron, M. Soda, M. Tokunaga, S. Ochikubo, S. Sugimoto, S. Ikeda, M. Terasaki, S. Izumi. Cancer Incidence in Atomic Bomb Survivors. Part II: Solid Tumors, 1958-1987. Radiat. Res. 137:S17-S67, 1994.
[14] D.L. Preston, S. Kusimi, M. Tomonaga, S. Izumi, E. Ron, A. Kuramoto, N. Kamada, H. Dohy, T. Matsui, H. Nonaka, D.E. Thompson, M. Soda, K. Mabuchi. Cancer Incidence in Atomic Bomb Survivors. Part III: Leukemia, Lymphoma and Multiple Myeloma, 1950-87. Radiat. Res. 137, S68–S97, 1994.
[15] E. Cardis, E.S. Gilbert, L. Carpenter, G. Howe, I. Kato, B.K. Armstring, V. Beral, G. Cowper, A. Douglas, J. Fix, S.A. Fry, J. Kaldor, C. Lavé, L. Salmon, P.G. Smith, G.L. Voelz, L.D. Wiggs. Effects of low doses and low dose rates of external ionizing radiation: Cancer mortality among nuclear industry workers in three countries. Radiat. Res. 142: 117–132, 1995.
[16] E. Ron, J.H. Lubin, R.E. Shore, K. Mabuchi, B. Modan, L.M. Pottern, A.B. Schneider, M.A. Tucker, J.D. Boice, Jr. Thyroid cancer after exposure to external radiation: a pooled analysis of seven studies. Radiat. Res. 141:259-277, 1995
[17] R.A. Kerber, J.E. Till, S.E. Simon, J.L. Lyon, D.C. Thomas, S. Preston-Martin, M.L. Rallison, R.D. Lloyd, W.S. Stevens. A cohort study of thyroid disease in relation to fallout from nuclear weapons testing. J. Am. Med. Assoc. 270:2076–2082, 1993.
[18] L.E. Peterson, W.J. Schull, B.R. Davis, P.A. Buffler. Information Bias and Lifetime Mortality Risks of Radiation-Induced Cancer. U. S. NRC NUREG Report GR-0011. Nuclear Regulatory Commission, Washington, D.C., 1994.
[19] R. Sposto, D.L. Preston, Y. Shimizu, K. Mabuchi. The effect of diagnostic misclassification on non-cancer and cancer mortality dose-response in A-bomb survivors. Biometrics. 48:605-617, 1992.
[20] A.S. Whittemore, G. Gong. Poisson regression with misclassified counts: application to cervical cancer mortality rates. Appl. Stat. 40; 81–93, 1991.
[21] E.J. Frome, R.J. DuFfrain. Maximum likelihood estimation for cytogenetic dose-response curves. Biometrics. 42:73–84, 1986.
[22] P. McCullagh, J.A. Nelder. Generalized Linear Models. London, Chapman and Hall, 1983.
[23] F.J. Aranda-Ordaz. An extension of the proportional-hazards model for grouped data. Biometrics. 39:109–117, 1983.
[24] D.C. Hoaglin, R.E. Welsch. The Hat matrix in regression and ANOVA. Am. Statistician. 32:17–22, 1978.
[25] R.D. Cook. Detection of influential observations in linear regression. Technometrics. 19:15–18, 1977.
[26] R. Doll, A.B. Hill. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Nat’l. Cancer Inst. Monogr. 19:205–268, 1966.
Survival data analysis involves use of time-to-event data for both failures and censored outcomes. In order to perform survival analysis, there are two outcome measures required from each object (animal, human subject, plant, etc.), which are (i) the time under observation from the beginning of the study until the objects fails (“failure”) or the object withdraws or survives until study end without failure (“censored”), and (ii) whether or not the object failed or not (0-No, 1-Yes). NXG Explorer can handle two kinds of survival data analyses, which are Kaplan-Meier grouped-based analysis and Cox proportional hazard regression analysis.
The primary goal of Kaplan-Meier (KM) analysis is to determine the product limit estimate of the survivorship function, , based on the survival times and censoring status of objects which have been observed for failure over a given follow-up period [1]. The survivorship function, , is defined as the probability of surviving (not failing) beyond time , and is only determined for failure times, not censoring times for objects which have dropped out of follow-up monitoring (i.e., lost to follow-up, missing). Objects that have been censored, that is, have not failed throughout the entire study or dropped out from follow-up, are assigned a censoring time equal to the time they left the study early (without failing) or the time until the end of the study. If a human research subject withdrew consent from a clinical trial before the end of the trial at 4.5 months after enrollment, then the subject’s survival (censoring) time is 4.5+ months, where denotes censored (did not fail). Otherwise, if a subject failed at 13.7 months after treatment by having a primary outcome, for example, tumor progression, stroke, myocardial infraction, etc., then the survival (failure) time for this subject would be 13.7 months.
The table below lists subjects who were monitored during a 65-day study (6 who failed and 4 who were censored) along with their survival times (failure times or censoring times) and product limit estimates of the survivorship function values, calculated at the unique failure times.
Time, | Rank, | |||
5 | 1 | 0.9 | ||
13+ | 2 | - | - | |
21+ | 3 | - | - | |
28 | 4 | |||
31 | 5 | |||
33+ | 6 | - | - | |
45 | 7 | |||
53+ | 8 | - | - | |
58 | 9 | |||
63 | 10 | 0 | 0 | |
denotes censored object, i.e. did not fail.
To construct the table, we first sort the censoring and failure times of objects in ascending order, so that the shortest time is first and the longest time is last. Next, assign a rank, , to each observation using values . At each unique failure time, calculate the hazard rate
| (8.1) |
where is the rank of the observation at failure time . The survivorship function value at each failure time is then calculated as the product of all hazard rates up to time using the relationship
| (8.2) |
The Nelson-Aalen baseline cumulative hazard function is then determined as
| (8.3) |
Let be the greatest observed study time among the subjects being studied. Let be the ordered unique failure (death) times for the pooled sample. At time we observe deaths in the th sample out of subjects at risk.
The hypotheses applied to test for inequality of hazard functions for each group is
In order to develop a hypothesis test, we first obtain differences between the observed and expected number of failures, using the form
| (8.4) |
Following Lawless [2], the variance-covariance matrix of has elements
| (8.5) |
where if . Note that , so the test statistic with d.f. is constructed by selecting any of the . The variance-covariance matrix of these statistics is given by the matrix , and again, only of the variances and covariances are needed. The test statistic is based on the quadratic form
| (8.6) |
An level test of rejects when is greater than the th upper percentage point of a random variable with degrees of freedom.
Example 1 - Kaplan-Meier Group Test (Edema) of Primary Biliary Cirrhosis. In this example, we will run a KM analysis on the Edema categorical feature in the Primary Biliary Cirrhosis (PBC) data. To begin, import the pbc.xlxs Excel file. Next, select Analysis Survival analysis Kaplan-Meier group test from the Analysis pull-down menu, shown as follows:
Select edema as the grouping feature, fudays as the survival time feature, and dead as the failure feature, shown as follows:
Following the run, the treenodes will appear as follows:
Click in the Kaplan-Meier Survival By Group icon, and the survival plot will appear as follows:
Click on the Logrank Test icon and the results of the logrank test for equality of hazard(survival) functions between the 3 edema groups is provided:
The p-value indicates the hazard functions for the 3 groups are significantly different.
Kaplan-Meier Group Test (Drug) of Primary Biliary Cirrhosis. In this example, we will perform KM analysis of the drug effect, with two options for recoding. Recall that the drug feature is coded as a 1 for drug and 2 for placebo. We now show how to create a new feature called treated (0-placebo, 1-drug) based on recoding and labeling of the drug and placebo groups. To get started, select Preferences Labels and transforms, and in the editor of open the script file called pbclabel.nxgs. Select Run, so that the new treated feature is generated. Next, select Analysis Survival analysis Kaplan-Meier group test, and specify dead as the failure feature, fudays as survival time, and treated as the group feature. In addition, repeat these steps while using drug (1-drug, 2-placebo). The results below illustrate how changing the coding resulted in a change in the plot and the KM logrank observed and expected counts. Quite basically, the groups are only swapped, but the point of this example is to remind the user to pay particular attention to coding schemes.
The logrank test results indicate that there is no significant difference in hazard(survivorship) functions between the two groups. Furthermore, you can notice how the survival curves overlap one another, providing visual confirmation that there is no significant drug effect of survival.
Cox proportional hazards (PH) regression is a multiplicative maximum-likelihood technique to derive parameters for modeling “time-to-event” failures or survival times [3]. Each object used for fitting the model requires a failure(censoring) time and a binary outcome response indicating whether the object failed or was censored, that is, either left(migrated out of) the study or did not fail before the study ended. When an object fails, the time value is set to the duration of time under observation until the failure time (i.e., survival time), and its failure status is assigned a one (1) to indicate failure. However, an object that doesn’t fail under study will have a failure status of zero (0) and a survival time set equal to the time it was observed until the end of the study or the time it dropped out or migrated out of the study. Either way, every object under study receives two outcome codes: one for the survival time and one for the binary code indicating failure or no failure (censoring).
To begin, assume the ordered unique failure times . We recall that subjects in the risk set at time , i.e., , are those subjects with survival times that are at least . In other words, the group of subjects at risk at time are subjects which survived up to time and subjects who failed or were censored during or after (since they too were also at risk before failure time ). The probability contribution of each failure at time conditional on those subjects at risk, , is
| (8.7) |
For total failures, the conditional likelihood becomes
| (8.8) |
We first take the log of the likelihood above. Taking the log of the numerator gives us
| (8.9) |
The denominator term in the likelihood cannot be reduced by taking the log and remains as
| (8.10) |
Looking at the likelihood kernel , we also notice that it is a ratio, so we have to use the property for logarithms . Combining the logarithms of the numerator and denominator of the kernel yields the log likelihood function
| (8.11) |
Next, let’s only work with partial derivatives for each th risk set, so for now we can drop the summation symbol . Therefore, we are dealing with the log likelihood for each risk set in the form
| (8.12) |
Now we take the first partial derivative of the left term in the log likelihood function w.r.t. each term to get the score vector. We already know that for the function
| (8.13) |
the partial derivative of the function with respect to is
| (8.14) |
Therefore, the first partial derivative for each in the left term of the log likelihood is
| (8.15) |
At this point, we know that the first part of the score function is
| (8.16) |
We are left with the right term of the log likelihood in the form
| (8.17) |
Notice that this is a log function. We know that the derivative of the logarithm of some function , is the inverse of the function, . This gives us a hint that part of the right term of the score contains the inverse
| (8.18) |
Since we already handled the logarithm, we can now attack the function itself. Our target is , and we note that the subscript indicates the coefficients are not a function of each subject in the risk set considered. Thus, values don’t change across risk sets, so their addition across risk sets (and anywhere they appear in the summation) has to be recognized and left intact. Given this, we can now work with the terms for a single individual in the risk set who contributes the factor
| (8.19) |
Taking the first partial derivative of this function w.r.t. gives
| (8.20) |
For the entire group of subjects in the risk set, we can now include the first partial derivative into the summation and rewrite this as
| (8.21) |
Going back to our original log function on the right side of the log likelihood, we now have
| (8.22) |
For the risk set considered, we now have the score
| (8.23) |
Now that the first partial derivative (score) of the log-likelihood function w.r.t. is solved, we can calculate the second partial derivative of the log-likelihood function w.r.t. . First, we drop the term on the left of the first partial derivative since it is independent of . Looking at the remaining terms, we notice that the first partial derivative is a quotient in the form
| (8.24) |
so we use the quotient rule for differentiation. The quotient rule is
| (8.25) |
so we have
| (8.26) |
and after rearranging gives
| (8.27) |
and finally
| (8.28) |
Cancelling out terms, we get
| (8.29) |
which forms element in the information matrix (second partial derivative “Hessian” matrix). Each element in the information matrix is calculated using the equation above and by substituting in appropriate values for subjects in all risk sets. A running total of elements in the score vector and information matrix is kept as the covariates (e.g., ) for each subject are looped through within each risk set and over each term.
A gradient ascent approach is employed using the Newton-Raphson method to find maximum likelihood estimates of the solution parameters in the form
| (8.30) |
Convergence is reached when . Values of are typically in the range .
This section covers diagnostics for assessing the fit of a Cox proportional hazards regression run, as well as determining overly influential observations used during an analysis.
Cox-Snell (CS) residuals. The Cox-Snell (CS) residual [4] for the th observation is
| (8.31) |
where is the covariate value for the th object and th feature, is the regression coefficient for the th feature, and is Breslow’s cumulative hazard function [4], defined as
| (8.32) |
Once the CS residuals are determined for each object, their values are employed as the survival time during a new KM analysis, with the same censoring information for each object. The values of from the KM analysis are then transformed to new values of the Nelson-Aalen (NA) baseline cumulative hazard using (8.3). A plot of the CS residual vs. Nelson-Aalen baseline cumulative hazard derived from using CS residuals as survival times is then generated. If the Cox PH model fits well, then the CS residuals will closely fall on a line with a slope of one, assuming that that CS residuals follow an exponential distribution with parameter unity, i.e., exp(1).
Martingale (M) residuals. The Martingale residual [4] is determined for each risk set (for objects having unique failure times) using the relationship
| (8.33) |
where is the censoring indicator for the th risk set.
Deviance (D) residuals . The deviance residual [4] is also determined for each risk set using the relationship
| (8.34) |
where is the Martingale residual and is the censoring indicator for the th risk set.
Schoenfeld (SCH) residuals. Schoenfeld residuals [4] are determined for each observation and each feature, based on the functional form
| (8.35) |
where is
| (8.36) |
and is the risk set of individuals at risk up to time for the th observation.
Scaled Schoenfeld (SCA) residuals. Scaled Schoenfeld residuals [4], on the other hand, are determined as
| (8.37) |
where is the vector of regression coefficients, is the total number of failures, is the variance-covariance matrix of the coefficients, and is a vector of Schoenfeld residuals for the th observation.
The table below lists the representation of and existence at each level of measurement for the residuals and hazards described above. The Cox-Snell, Martingale, and deviance residuals along with the Nelson-Aalen cumulative hazard are only determined for observations with unique failure times, whereas the Schoenfeld and scaled Schoenfeld residuals are determined for each observation and each feature.
Residual | Representation | Existence |
Cox-Snell | Risk sets | Unique failure time, |
Martingale | Risk sets | Unique failure time, |
Deviance | Risk sets | Unique failure time, |
Nelson-Aalen cum hazard | Risk sets | Unique failure time, |
Schoenfeld | Observations | Each observation and feature, |
Scaled Schoenfeld | Observations | Each observation and feature, |
Plots for proportional hazards assumption. Plots of vs. for categorical features such as treatment group or diagnostic group can be constructed using stratum-specific baseline cumulative hazard, defined as
| (8.38) |
where represents the th category, is a failure time in the th stratum, is the number of failures at time , and is the risk set for censored and failed objects exclusively in the th category at risk up to time . This calculation requires use of only the covariate values, unique failure times, and risk sets containing objects in the th group, which results in stratum-specific baseline cumulative hazard. The stratum-specific survivorship function value for each observation in the th category and th risk set is then determined as
| (8.39) |
Tests of proportional hazards assumption. Grambsch and Therneau [5] introduced a test based on scaled Schoenfeld residuals to determine if the PH assumption holds for each feature, and for the global collection of features used in a model. Under the null hypothesis, the PH assumption will hold true if the slope of scaled Schoenfeld residuals over some function of time, , is zero. The effect of a specific feature on violation of the PH assumption can be tested using the following zero-slope test statistic
| (8.40) |
where is either the survival time, log survival time, etc., , and is a diagonal element of the variance covariance matrix of the coefficients. Any feature yielding a significant slope test should either be use as a strata term for stratified Cox PH regression, or be removed from the model. In addition, an overall global zero-slope test for all features can be evaluated using the following formulation
| (8.41) |
where each element in the vector e is and each element in the matrix Y is . If the global zero-slope test is significant, then the features employed in the model result in violation of the PH assumption.
Example 2 - Cox PH regression of Primary Biliary Cirrhosis. In this example, we will be using the Mayo Clinic data for Primary Biliary Cirrhosis again to perform Cox PH regression.
First, import the PBC survival dataset. From the Analysis pull-down menu, select Survival analysis Cox proportion hazards regression, shown as:
Once the following popup window appears, specify dead as the censoring feature, fudays as the survival time feature, and for categorical features select edema, and stage, and finally select the rest of the continuous and binary variables as covariates for the model:
Once the run is complete, the following treenode icons will appear:
Click on the Coefficients icon, and the univariate regression coefficients and their hazards ratios (95% CI) will be listed at the top of the output:
The univariate regression coefficient for treatment reveals there is not significant treatment effect on survival, which coincides with the KM example performed on these data in the previous section. Many of the other covariates are significantly related to survival, and these should be included in the multiple regression model. Next, the multiple regression coefficients are listed in the output:
We notice that treatment is not significant when adjusting by potential nuisance and medical history factors among the subjects. and finally the hazard ratios (95% CI) for the multiple regression model:
Among the hazard ratios, we can observe quite high values for the edema_3 and stage_4 levels of edema and stage of disease, further attesting to the fact that these covariates should be included with the treatment feature (treatment) in the multiple regression model.
The proportional hazard assumption tests are shown below:
and indicate that the group 2 category for edema violates the PH assumption. See the -log(-log[S(t)) vs. log(t) plot later in this section to observe this violation.
The predicted survival for each edema group is shown below:
The predicted cumulative hazard for each edema category is shown below:
The -log(-log[S(t)) vs. log(t) plot for the 3 edema groups for this model is shown below:
and it is apparent that the group 2 survival curve crosses the other curves, suggesting that the proportional hazard assumption is violated for the categorical edema feature. The Nelson-Aalen cumulative hazard vs. Cox-Snell residual plot is shown below:
The Martingale residuals are as follows:
The deviance residuals are as follows:
[1] E.T. Lee. Statistical Methods for Survival Data Analysis. Belmont(CA), Lifelong Learning Publications, 1980. (ISBN: 0-534-97987-4).
[2] J.F. Lawless. Statistical Models and Methods for Lifetime Data, 2nd Edition. New York(NY), John Wiley and Sons, 2002. (ISBN: 978-0-471-37215-8).
[3] J.D. Kalbfleisch, R.L. Prentice. The Statistical Analysis of Failure Time Data. New York(NY), John Wiley and Sons, 1980. (ISBN: 978-0-471-05519-0).
[4] J.P. Klein, M.L. Moeschberger. Survival Analysis: Techniques for Censored and Truncated Data. New York(NY), Springer-Verlag Press, 1997. (ISBN: 0-387-94829-5).
[5] P.M. Grambsch, T.M. Therneau. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 81:515-526, 1994.
The Retinol dataset is based on results of a cross-sectional study to investigate the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids [1, 2, 3]. Study subjects () were patients who had an elective surgical procedure during a three-year period to biopsy or remove a lesion of the lung, colon, breast, skin, ovary or uterus that was found to be non-cancerous. Plasma beta-carotene levels were log-transformed prior to the analyses due to severe asymmetry of the residuals on the original scale. For log beta-carotene, dietary intake, regular use of vitamins, and intake of fiber were associated with higher plasma concentrations, while Quetelet Index (defined as weight/height in the units kg/m ) and cholesterol intake were associated with lower plasma levels, adjusting for the other factors (). There was one extremely high leverage point in alcohol consumption that was deleted prior to the analyses.
Plasma concentrations of retinol and beta-carotene were not correlated. The authors conclude that there is wide variability in plasma concentrations of these micronutrients in humans, and that much of this variability is associated with dietary habits and personal characteristics. A better understanding of the physiological relationship between some personal characteristics and plasma concentrations of these micronutrients will require further study.
Authorization: This data set can be used to demonstrate multiple regression, transformations, categorical variables, outliers, pooled tests of significance and model building strategies.
Feature | Description | Scale |
age | Age (years) | Continuous |
sex | Sex (1=Male, 2=Female) | Categorical(1,2) |
smokstat | Smoking status (1=Never, 2=Former, 3=Current Smoker) | Categorical(1,2,3) |
quetelet | Quetelet (weight/(height^2)) | Continuous |
vituse | Vitamin Use (1=Yes, 2=Sometimes, 3=No) | Categorical(1,2,3) |
calories | Number of calories consumed per day | Continuous |
fat | Grams of fat consumed per day | Continuous |
fiber | Grams of fiber consumed per day | Continuous |
alcohol | Number of alcoholic drinks consumed per week | Continuous |
cholesterol | Cholesterol consumed (mg per day) | Continuous |
betadiet | Dietary beta-carotene consumed (mcg per day) | Continuous |
retdiet | Dietary retinol consumed (mcg per day) | Continuous |
betaplasma | Plasma beta-carotene (ng/ml) | Continuous |
retplasma | Plasma Retinol (ng/ml) | Continuous |
SIZE: Excel with 314 objects, 14 features.
Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for low birth weight babies. A woman’s behavior during pregnancy (including diet, smoking habits, and receiving prenatal care) can greatly alter the chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.
The goal of this study [4] was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.
The variables identified in the code sheet given in the table have been shown to be associated with low birth weight in the obstetrical literature. Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986.
Feature | Description | Scale |
id | Identification Code | NA |
low | Low Birth Weight (0=Birth Weight2500g, 1=Birth Weight2500g) | Binary(0,1) |
age | Age of the Mother in Years | Continuous |
lwt | Weight in Pounds at the Last Menstrual Period | Continuous |
race | Race (1=White, 2=Black, 3=Other) | Categorical(1,2,3) |
smoke | Smoking Status During Pregnancy (1=Yes, 0=No) | Binary(0,1) |
ptl | History of Premature Labor (0=None, 1=One, etc.) | Binary(0,1) |
ht | History of Hypertension (1=Yes, 0=No) | Binary(0,1) |
ui | Presence of Uterine Irritability (1=Yes, 0=No) | Binary(0,1) |
ftv | Number of Physician Visits During the First Trimester (0=None, 1=One, 2=Two, etc.) | Continuous |
bwt | Birth Weight in Grams | Continuous |
SIZE: Excel with 189 objects, 11 features.
The purpose of this study [5] was to develop a method of classifying cancers to specific diagnostic categories based on their gene expression signatures using artificial neural networks (ANNs). ANNs were trained using the small, round blue-cell tumors (SRBCTs) as a model. These cancers belong to four distinct diagnostic categories (Ewing’s sarcoma, Burkitt’s lymphoma, neuroblastoma, and rhabdomyosarcoma) and often present diagnostic dilemmas in clinical practice. The ANNs correctly classified all samples and identified the genes most relevant to the classification. Expression of several of these genes has been reported in SRBCTs, but most have not been associated with these cancers. To test the ability of the trained ANN models to recognize SRBCTs, the study team analyzed additional blinded samples that were not previously used for the training procedure, and correctly classified them in all cases. This study demonstrates the potential applications of these methods for tumor diagnosis and the identification of candidate targets for therapy.
The 23 features genes listed in tabular notation below were identified from the full set of 2,308 genes using Greety PTA feature selection.
Feature | Description | Scale |
fc fragment of igg | Gene expression | Continuous |
growth arrest-specific 1-36582 | Gene expression | Continuous |
fgfr4 | Gene expression | Continuous |
antigen-monoclonal antibodie | Gene expression | Continuous |
apj receptor | Gene expression | Continuous |
sarcoglycan | Gene expression | Continuous |
histone deacetylase | Gene expression | Continuous |
actin alpha 2 | Gene expression | Continuous |
eh domain containing | Gene expression | Continuous |
ests-66552 | Gene expression | Continuous |
n-acetylglucosamine receptor 1 | Gene expression | Continuous |
mitogen-activated protein kina | Gene expression | Continuous |
non-metastatic cells 4 | Gene expression | Continuous |
ribonucleaseangiogenin inhibit | Gene expression | Continuous |
glycine amidinotransferase | Gene expression | Continuous |
translocation protein 1 | Gene expression | Continuous |
recoverin | Gene expression | Continuous |
igf-2 | Gene expression | Continuous |
transmembrane 4 superfamily | Gene expression | Continuous |
myh-1c | Gene expression | Continuous |
plasminogen activator | Gene expression | Continuous |
ccaat | Gene expression | Continuous |
n-cadherin 2 | Gene expression | Continuous |
class | Diagnostic class (1-Ewing’s sarcoma(EWS), 2-Burkitt’s lymphoma(BL), 3-neuroblastoma(NB), 4-rhabdomyosarcoma(RMS) | Categorical(1,2,3,4) |
SIZE: Excel with 63 objects, 24 features.
This dataset is very amenable for Poisson regression because the data are presented in terms of the number of cases and person-years of follow-up of British doctors and coronary disease as a function of smoking (yes/no). Table 9.1 lists the number of coronary deaths as a function of age among non-smoking and smoking doctors.
Feature | Description | Scale |
Cases | # cases | Continuous |
PY | Person-years of follow-up | Continuous |
Age 35-44 | Age category 35-44 (1-Yes,0-No) | Binary(0,1) |
Age 45-54 | Age category 45-54 (1-Yes,0-No) | Binary(0,1) |
Age 55-64 | Age category 55-64 (1-Yes,0-No) | Binary(0,1) |
Age 65-74 | Age category 65-74 (1-Yes,0-No) | Binary(0,1) |
Age 75-84 | Age category 75-84 (1-Yes,0-No) | Binary(0,1) |
Smoke | Smoking history (1-Yes, 0-No) | Binary(0,1) |
SIZE: Excel with 10 rows, 8 features.
The data is from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984 [7]. A total of 424 PBC patients, referred to Mayo Clinic during that ten-year interval, met eligibility criteria for the randomized placebo controlled trial of the drug D-penicillamine. The first 312 cases in the data set participated in the randomized trial and contain largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements recorded and to be followed for survival. Six of those cases were lost to follow-up shortly after diagnosis, so the data here are on an additional 106 cases as well as the 312 randomized participants. Missing data items are denoted by a period.
This dataset is nearly identical to the data set found in appendix D of Fleming and Harrington. The differences with this data set are: age is in days, status is coded with 3 levels, and the sex and stage variables are not missing for obs 313-418. Table 9.2 lists the data format.
Feature | Description | Scale |
id | case number | NA |
treatment | 0-Placebo, 1=D-penicillamine | Binary(0,1) |
age | age in years | Continuous |
sexfem | sex (0=Male, 1=Female) | Binary(0,1) |
ascites | presence of ascites (0=No, 1=Yes) | Binary(0,1) |
hepatomegaly | presence of hepatomegaly (0=No, 1=Yes) | Binary(0,1) |
spiders | presence of spiders (0=No, 1=Yes) | Binary(0,1) |
edema | presence of edema (1=No edema and no diuretic therapy for edema; 2=edema present without diuretics, or edema resolved by diuretics; 3 = edema despite diuretic therapy | Categorical(1,2,3) |
bilirubin | serum bilirubin in mg/dl | Continuous |
albumin | albumin in gm/dl | Continuous |
copper | urine copper in ug/day | Continuous |
alkphos | alkaline phosphatase in U/liter | Continuous |
sgot | SGOT in U/ml | Continuous |
platelets | platelets per cubic ml / 1000 | Continuous |
cholesterol | serum cholesterol in mg/dl | Continuous |
prothrombin | prothrombin time in seconds | Continuous |
stage | histologic stage of disease (1,2,3,4) | Categorical(1,2,3,4) |
fudays | number of days between registration and the earlier of death, transplantation, or study analysis time in July, 1986 | Continuous |
dead | failure status (0=Alive, 1=Dead) | Binary(0,1) |
SIZE: objects 312, features 19
[1] D.W. Nierenberg, T.A. Stukel, J.A. Baron, B.J. Dain, E.R. Greenberg. Determinants of plasma levels of beta-carotene and retinol. Am. J. Epidemiol. 130:511–521, 1989.
[3] http://staff.pubhealth.ku.dk/ tag/Teaching/share/data/Plasma.html
[4] D.W. Hosmer, S. Lemeshow, R.X. Sturdivant. Applied Logistic Regression: Third Edition. New York (NY), John Wiley and Sons, 2013.
[5] J. Khan, J.S. Wei, M. Ringnér, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, P.S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine. 7:673–679, 2001.
[6] R. Doll, A.B. Hill. Mortality of British doctors in relation to smoking: Observations on coronary thrombosis. Nat’l. Cancer Inst. Monogr. 19:205–268, 1966.
[7] T.R. Fleming, D.P. Harrington. Counting Processes and Survival Analysis. New York (NY), John Wiley and Sons, 1991.
Behrens-Fisher problem, 52
Binary features, 5
Categorical features, 5
Chi-squared test, 71
Classifier
Polytomous logistic regression (PLOG), 130
Clipboard
Copying HTML output, 15
Copying output images, 15
Coding effects
Corner-point constraints, 127
Sum-to-zero constraints, 127
Color
Background of images, 11
Cluster heat maps, 11
Datagrid selected items, 11
Filtered records, 11
Histogram bars, 11
Lines in cluster images, 11
Preferences, 5
Text in cluster images, 11
Contingency table analysis, 71
Contingency tables
, 75
Multiway, 75
Continuous features, 5
Copying
HTML output, 15
images, 15
Correlation
Matrix, 94
Pearson product moment, 83
Spearman rank, 83
Covariance, 83
Datasets
British doctors, 171
Low birth weight, 170
Primary Biliary Cirrhosis, 172
Retinol, 169
Small round blue-cell tumors, 170
Default settings, 5
Categorical feature criterion, 7
Categorical table iterations, 8
Class discovery run outputs, 8
Class prediction run outputs, 9
Convert input text to numeric, 7
Data size control, 7
Feature names, 8
Filtered record color, 11
Graphic image format, 12
Graphics, 11
Graphics resolution, 11
Histogram bar color, 11
Image background color, 11
Image line color, 11
Image save format, 7
Input binary and categorical as continuous, 7
Input string manipulation, 8
Minimum sample size, 8
Missing data codes, 6
Model building strategy, 8
Monte Carlo uncertainty analysis, 9
Output options, 6
Regression n-2-p ratio, 8
Replace missing data, 7
Report test results, 8
Script editor font, 8
Table fonts and colors, 9
Text color, 11
Dependency
Logistic regression, 122
Multiple linear regression, 101
ANOVA table, 103
Cook’s distance, 108
DFBETAS, 109
DFFITS, 109
Diagnostics, 106
Jackknife residuals, 108
Multicollinearity, 109
Partial F-tests, 111
Residuals, 108
Standardized residuals, 108
Variance inflation factors, 110
Variance partitioning, 102
Multivariate linear regression, 116
Poisson regression, 136
Polytomous logistic regression, 130
Distance
Euclidean, 83
Distribution
normal, 18
Standard normal, 37
Editor, Filtering records, 33
Editor, Labels for categories, 30
Editor, Mathematical transforms, 31
Eigenvalues, 117
Euclidean distance, 83
Feature types
Binary, 5
Categorical, 5
Continuous, 5
Text, 5
File
Category code map file, 16
Commands, 15
Export
Comma-delimited (.csv), 16
Tab-delimited text (.txt), 16
Import
Category codes, 16, 33
Comma-delimited (.csv), 16
Excel (xlsx, xls), 16
Excel file (xlsx, xls), 20
Tab-delimited text (.txt), 16
Merge Excel files, 16
Open Explorer project, 15
Save Explorer project, 16
Font
Preferences, 5
Fonts
Script editor, 8
Geometric mean, 58
Geometric standard deviation, 58
Harmonic mean, 58
Harmonic standard deviation, 58
Heteroscedasticity, 49, 65
Homogeneity of variance test, 49
Homoscedasticity, 49
Hypothesis test
Bartlett’s equality of variances, 49
Chi-squared, 71
Homogeneity of variance, 49
Kruskal-Wallis, 62
Mann-Whitney U, 53
Proportions, two independent, 69
T-test, 49
T-test, assuming equal variances, 51
T-test, assuming unequal variances, 52
T-test, independent samples, 50
T-test, paired samples, 66
T-test, variances assumed unequal, 51, 52
Welch ANOVA, 65
Welch ANOVA test, 62
Import
Category code file, 16, 33
Comma-delimited file (.csv), 16
Excel file (xlsx, xls), 16, 20
Tab-delimited text file (.txt), 16
Jacobi method, 117
Kruskal-Wallis Test, 62
Kurtosis, 19
Likert scale, 13
Linear predictor, 107
Link function, 136
Logistic regression, 122
Mann-Whitney U Test, 53
Mathematical transforms
, 31
, 31
, 31
, 31
, 31
, 31
, 31
, 31
, 31
, 31
, 31
, 31
, 31
Matrix
Correlation, R, 94, 96
Covariance, , 95
Covariance, S, 94
Data, multivariate, Y, 93
Data, X, 17
Determinant, 96
Distance, D, 95
Hessian, 125, 131, 157
Identity, 97
Regression
Coefficient, , 117
Data, X, 101
Hat, H, 108
Variance-covariance of coefficients, , 104
Variance-covariance of coefficients, , 117
Standardized data, Z, 94
Variance-covariance, 106
Variance-covariance, S, 94
Matrix setup of data, 17
Mean, 18
Mean-zero standardization, 37
Median, 19
Newton-Raphson method, 125, 132, 157
Normal distribution, 18
Normal variates, 18
Normality test, see Shapiro-Wilk, 19
Normalization, 37
Paired t-test, 66
Percentiles, 40
Polytomous logistic regression, 130
Preferences, 5
Quantile, 19
Range, 19
Recoding features
Fuzzification, 41
Logarithm, 38
Nominal-to-binary, 41
Normalization, 37
Percentiles, 40
Quantiles, 39
Ranks, 39
Standardization, 37
van der Waerden scores, 40
Z-scores, 37
Regression
Cox proportional hazards, 155
Deviance, 126
Diagnostics, 106, 157
Goodness-of-fit, 125
Logistic (polytomous), 130
Logistic regression, 122
Logistic, unconditional (binary), 123
Multicollinearity, 109
Multiple linear regression, 101
Multivariate linear regression, 116
Newton-Raphson method, 125, 132, 157
Partial F-tests, 111
Poisson, 136
Polytomous logistic regression, 130
Residuals, 108, 125
Simple linear, 101
Screenshots
Copying HTML output to clipboard, 15
Copying images to clipboard, 15
Shapiro-Wilk normality test, 19
Singular value decomposition, 117
Skew-zero transform, see van der Waerden transform, 41
Skewness, 19
Standard deviation, 18
Standard normal distribution, 37
Standardization, 37
Survival analysis
Breslow’s cumulative hazard, 158
Cox proportional hazards regression, 155
Cox-Snell residuals, 158
Deviance residuals, 158
Hazard function, 151
Kaplan-Meier analysis, 151
Log-rank test, 151
Martingale residuals, 158
Nelson-Aalen baseline cumulative hazard, 152
Product limit estimate of survival, 151
Scaled Schoenfeld residuals, 158
Schoenfeld residuals, 158
Stratum-specific baseline hazard, 159
Survivorship function, 151
Tests of PH assumption, 159
Tests of hypothesis
2-by-2 table frequencies, 71
ANOVA (one-way), 59
Bartlett’s equality of variances, 49
Categorical data, 71
Chi-squared, 71
Contingency tables, 71
Kruskal-Wallis, 62
Mann-Whitney U test, 53
McNemar paired binary outcomes, 77
Paired t-test, 66
Permutation, 77
Proportions, two independent, 69
r-by-c table frequencies, 71
Randomization, 77
T-test, 49
T-test, assuming equal variances, 51
T-test, assuming unequal variances, 52
T-test, independent samples, 50
T-test, paired, 66
Wilcoxon rank sum, 55
Wilcoxon signed rank, 67
Text features, 5
Transformations
Categorical-to-Class, 45
Continuous-to-categorical, 42
Fuzzification, 41
Inside CV folds, 45
Logarithm, 38
Nominal-to-binary, 41
Normalization, 37
Percentiles, 40
Quantiles, 39
Ranks, 39
Standardization, 37
Text-to-categorical, 43, 44
van der Waerden scores, 40
Z-scores, 37, 38
van der Waerden scores, 40
Variance, 18
Welch ANOVA test, 62