Abstract
Simulated data in the form of sine waves with noise were analyzed as time series using principal component analysis (PCA). Multiple replicates of each underlying signal were used so that the number of features exceeded the number extracted principal components (PCs). Both the covariance matrix and correlation matrix were used during eigendecomposition, and the standardized (scaled) and non-standardized (unscaled) PC scores were obtained from each run. A total of eight patterns were evaluated, each of which consisted of mixtures of amplitudes, wavelengths, and phase shifts. Results indicate that use of the correlation matrix for eigendecomposition typically only introduced small phase shifts among the extracted PC scores (over time, or interval); however, correlation was able to extract a PC vector which correlated strongly with the sum of the data, representing the composite cycle. While the composite cycle is not the dominant cycle in the sense of the greatest power spectrum peak from FFT, the composite nevertheless represents the entire dataset. It is important to realize that PCA cannot de-convolute a dataset, to extract PC score vectors that reflect the underlying signals, but rather, PCA can only extract shared information (correlation, covariance) between multiple time series. In conclusion, it is important that researchers know specifically whether covariance or correlation is used for eigendecomposition, and whether the standardized or non-standardized PC scores are generated by the algorithms employed for running PCA on time series. The choices made surrounding these options for PCA can have a profound effect on results, especially if dimension reduction is being performed to collapse data for further analyses.
Principal components analysis (PCA) is one of the most popular linear dimension reduction techniques. During PCA, PC scores are based on the summed (linear) components extracted during eigendecomposition of a covariance matrix or correlation matrix.
By definition, the population covariance matrix is
| (1) |
where is a
-tuple
for the
th object and
is a
-tuple with
means for the
features. Because it is impossible to measure
, we are limited to
working with the sample covariance matrix, given as
| (2) |
Let , the sample correlation matrix is
| (3) |
During computation, elements of
can be determined individually using the
relationship
| (4) |
If the means of columns of are known, covariance elements
of
can be
directly calculated using the functional form
| (5) |
whereas, if the standard deviations and correlations are already known, then calculate covariance elements using
| (6) |
It is assumed that the sample covariance matrix is an accurate and reliable
estimate of the population covariance matrix
. When the sample size
is infinitely
large, there are fewer problems with this assumption, but as
, the eigenvalues of
are biased where small eigenvalues are too small and the large eigenvalues are
too large. Moreover, several of the eigenvalues become zero, so
becomes
singular by losing full rank and therefore, is not positive definite and cannot be
inverted.
The non-standardized principal component score, “PC score,” for the th object and
th
principal component when the covariance matrix
is used for eigendecomposition
is
| (7) |
which translates to the matrix operation
| (8) |
Thus, for an input data matrix
, there will be an
matrix
, which
contains principal components (PCs) in columns and objects in rows. For each object
there will be
PC scores in each row of
, and in each column of
, there will be
PC scores for all of the objects. We have observed that non-standardized PC scores are
equivalent to PCA output of most software packages, and reported in the majority of
textbooks.
When the correlation matrix is used, the non-standardized PC scores are based on
the relationship
| (9) |
which in matrix notation is
| (10) |
Note that, although the mean-zero standardized -scores are used when
eigendecomposition is performed on the correlation matrix, it does not result in
standardized PC scores. Standardizing PC scores depends on rescaling with the
eigenvalues, which is done by introduction of the loadings, which are introduced
below.
After removing eigenvectors (columns) of for null eigenvectors or for eigenvectors
whose
, the final number of valid columns will be
, and the PC score matrix
is now determined as
| (11) |
Loading matrix,
| (12) |
which, in terms of matrix operations is
| (13) |
It is important to remember that the loading matrix is a
square non-symmetric
matrix which has the same dimensions as the covariance matrix
or correlation matrix
. However, if principal components are only used for the
eigenvalues that are
greater than 1, i.e.,
, then
will have dimensions
. Operationally, this is
typically the case, as most of the correlation loads on the first few PCs when there is
moderate to strong multicollinearity. However, if most of the features are not correlated,
then
.
If we assume that eigendecomposition was performed on the covariance matrix ,
once the loading matrix is obtained, the standardized PC scores are determined using
the relationship
| (14) |
where is the loading matrix that reveals the correlation between each
feature and each PC. This is really a regression problem, for which the goal is to
estimate
, where
are the PC scores and
is the observable data. We
have
| (15) |
For normally distributed data the mean vector drops out, and since the covariance
between factor scores and data is simply the loadings, for which , we
get
| (16) |
Now, since we know the loading matrix is , and by replacing covariance
with correlation,
, and moving
on the left to the right side we
have
| (17) |
Therefore, while most textbooks and software packages define as the
non-standardized PC scores, the use of
to obtain standardized PC
scores equates to using
, essentially the non-standardized PC scores
divided (scaled) by
. By default, NXG Explorer uses standardized PC scores for 2D
score plots of PC1 vs. PC2, since the scales are somewhat smaller and distributed more
like the standard normal distribution.
Eigendecomposition performed on | PC Scores | Matrix operation |
Covariance matrix, | Non-standardized | |
Standardized | ||
Correlation matrix, | Non-standardized | |
Standardized | ||
There are a few rules-of-thumb which must be taken into consideration when applying
PCA to time series. Firstly, each time series, , must represent a single feature or
variable, that is, the objects represent the values for a given time series at each time
interval, and the feature itself represents the entire time series. PCA will not return
anything if you represent a time series across multiple feature values. With regard to data
formatting, what this means is that, if columns represent features or variables, and rows
represent objects, then each time series will form a single column. So essentially, each
time series represents an entire feature whose values are spread out over the
rows.
After setting up the data, the data matrix will have dimensions
, where
is
the number of time periods (rows) and
is the number of time series being evaluated
(columns).
Readers should also note that in financial engineering, PCA has been performed on the assets and the days. For example, assume the 30 stocks in the DOW-30 and 1000 daily price returns (4 years of trading, i.e., 250 days/year). The correlation (covariance) matrix can be run on the 30 stocks (30 features) with daily prices used as the objects (time series); or, the correlation (covariance) matrix can be run on the 1000 days (1000 features), with the 30 stocks used as objects – but this is not PCA on time series. The point is that in financial engineering, researchers use PCA on assets to understand how each asset correlates with ("loads on") each PC, as well as use PCA on days, to identify hidden grouping of days - which can be viewed as a type of clustering of days.
For simulation purposes, we initially chose to use straightforward sine waves for PCA
analysis. We assume that the time series, , can be strictly based on a sine wave
defined as
| (18) |
where is the time or interval,
is the amplitude,
is the frequency or
where
is the wavelength, and
is the phase shift. For a fixed frequency, it is
easier to use the wavelength (period) directly in the equation for
by simply
dividing
by
, and then adding the phase shift
before taking the
,
and then multiplying the result by the amplitude
. A small amount of noise
was also added to each value of
so that we could simulate a number of
features having the same underlying sine wave, but with a small amount of noise
added. This was accomplished by simply adding a noise value of
to each
value, where
is a random uniform variate in the range
[0,1].
For each experimental setup, we used a set of four underlying sine waves (without
noise) each having different amplitude, wavelength, or phase shift. For each of the four
underlying sine wave, we simulated a set of 25 time series (features), each of which had
the same parameters (amplitude, wavelength, shift) but with a small amount of noise
added. Therefore, within each set of 25 time series based on the same underlying sine
wave, the time series were not exactly the same because of the noise added. The total
number of features (time series, or sine waves) was therefore . We
also set
, so that
, and the final dataset consisted of
100 time series each having 500 intervals. It was important not to consider
using a wholly different sine wave for each simulated time series, as this would
not enable the comparison between the PC scores derived from the simulated
data and unique underlying pattern among each set of 25 time series that were
used.
The following notation was used to represent each set of 25 time series that were
similar: represented time series 1-25,
represented time series 26-50,
represented time series 51-75, and
represented time series 76-100. For plotting
purposes, we also added together and multiplied together all of the time series, and
plotted the sum and product with labels SUM(X1:X4) and PROD(X1:X4) in each plot.
(Note that the sum and product of all signals were not used in PCA, but rather, only in
plotting). The various simulation setups and their results are presented in the following
paragraphs.
| (19) |
When all 100 signals were similar and had the same underlying wavelength, there was
only 1 PC extracted, which seemed to exactly represent the signal, which is expected
when all features are very correlated, that is, all of the features load on one PC, and the
variance explanation is based on only the first principal eigenvalue, . Quite naturally,
the single PC also coincided with the composite sum of the signals, which is
expected.
As shown below, the non-standardized PC scores are much larger than the
standardized, since they are not scaled by :
Use of the correlation matrix for eigendecomposition resulted in similar PC scores
when compared with covariance:
Again, the non-standardized PC scores display too large a scale when reflecting
patterns among the signals:
| (20) |
This resulted in a finding similar to the previous experiment, where the features were still highly correlated after the introduction of changing amplitude, but with the same wavelength. Obviously, with such high multicollinearity there would be only 1 PC extracted from the covariance matrix.
| (21) |
The PCA results when there was only a phase shift for the same wavelength
indicate that only two PCs were extracted, as shown in the image below:
| (22) |
Results indicate that still only 2 PCs were extracted, as show in the image below:
| (23) |
These results suggest that when only the wavelength is changed, there are 4 PCs which are extracted, implying a greater degree of orthogonality (non-correlation) between the sets of 25 simulated signals. In addition, the PC score vectors appear as composite signals representing summation of the 4 cycles. It is also interesting to note that by not having a difference in scale (amplitude) of the signals, the PCs do introduce scale differences.
The non-standardized PC scores display too large a scale when reflecting patterns
among the signals:
Use of the correlation matrix for eigendecomposition resulted in a PC score vector (i.e.,
PC3) which tracked with the composite sum of signals. Although PCs can not
de-convolute the data into the constituent individual cycles, one of them does
nevertheless reflect the composite cycle (sum) of the data:
Again, the non-standardized PC scores display too large a scale when reflecting
patterns among the signals:
| (24) |
When amplitude and wavelength change is introduced, it seems that the resulting PC remove the scale, and tend to represent the original signals more closely when compared with the previous results when the signals had no change in scale (amplitude).
Life before, the non-standardized PC scores display too large a scale:
When amplitude is introduced to the varying wavelengths, unlike before, correlation
was not able to extract a PC score vector which tracked with the composite sum of
signals:
The non-standardized PC scores, again, displayed too large a scale:
| (25) |
Under negative correlation when the signals cancel each other out, there is only 1 PC vector obtained.
| (26) |
When the correlation is zero and the two signals are orthogonal, as expected, there were 2 PC vectors obtained.
It is important to realize that PCA will not de-convolute a signal the same way FFT will, or at least pull out something like power spectra values for a signal to reveal what the major wavelengths are for the dominant cycles. When there are mixtures of various signals in a dataset, PCA will determine the degree of correlation between the signals and return the orthogonal signals (PC score vectors) based on the unique correlation patterns identified. But PCA will not return PC vector scores for the original underlying signals in the data the way FFT will, rather, the returned signals will be based on collapsed information hinged to correlation. For example, if there are two underlying signals, one based on sin and one on cos, and they have a small degree of correlation between them, then the PC score vectors that PCA returns will be based on the shared information between the two signals. Whereas FFT will provide the wavelength of the original two signals independent of their correlation. In summary, the important issue with PCA is that it can only return PC score vectors based on shared information between signals, and not anything regarding the original unique signals in a dataset.
While some reports have suggested that the covariance matrix must always be used for eigendecomposition of time series, our results indicate that use of correlation typically only introduces a small phase shift in the PC vectors, which may or may not be so important for more complex data. Irrespective of the phase shift in PC vector values introduced by correlation, when performing dimension reduction the difference between using covariance and correlation for eigendecomposition would likely not cause a strong bias. An important finding when using correlation for eigendecomposition was that one of the extracted PC vectors highly tracked with the composite (sum) of all the cyclical data. Although the composite cycle (sum of cycles) is not the dominant cycle in terms of an FFT power spectrum, this observation is important since the composite signal is the underlying signal in a dataset.
Lastly, the standardized (scaled) PC scores based on a least squares fit with the
loading matrix should be used because their scale is much smaller than the
non-standardized PC scores, which are not scaled by eigenvalues. One of the
advantages of using the standardized PC scores based on the loading matrix is that
columns of the loading matrix will be automatically zeroed out for any eigenvalues which
are zero, since loadings are defined as . Whereas when raw PC scores
are used, there can be residual non-informative information still remaining in the
eigenvector matrix, which has to be zeroed out manually for any zero eigenvalues or
when the PCs are filtered for e.g.
.