Abstract
Simulated data in the form of sine waves with noise were analyzed as time series using principal component analysis (PCA). Multiple replicates of each underlying signal were used so that the number of features exceeded the number extracted principal components (PCs). Both the covariance matrix and correlation matrix were used during eigendecomposition, and the standardized (scaled) and non-standardized (unscaled) PC scores were obtained from each run. A total of eight patterns were evaluated, each of which consisted of mixtures of amplitudes, wavelengths, and phase shifts. Results indicate that use of the correlation matrix for eigendecomposition typically only introduced small phase shifts among the extracted PC scores (over time, or interval); however, correlation was able to extract a PC vector which correlated strongly with the sum of the data, representing the composite cycle. While the composite cycle is not the dominant cycle in the sense of the greatest power spectrum peak from FFT, the composite nevertheless represents the entire dataset. It is important to realize that PCA cannot de-convolute a dataset, to extract PC score vectors that reflect the underlying signals, but rather, PCA can only extract shared information (correlation, covariance) between multiple time series. In conclusion, it is important that researchers know specifically whether covariance or correlation is used for eigendecomposition, and whether the standardized or non-standardized PC scores are generated by the algorithms employed for running PCA on time series. The choices made surrounding these options for PCA can have a profound effect on results, especially if dimension reduction is being performed to collapse data for further analyses.
Principal components analysis (PCA) is one of the most popular linear dimension reduction techniques. During PCA, PC scores are based on the summed (linear) components extracted during eigendecomposition of a covariance matrix or correlation matrix.
By definition, the population covariance matrix is
| (1) |
where is a -tuple for the th object and is a -tuple with means for the features. Because it is impossible to measure , we are limited to working with the sample covariance matrix, given as
| (2) |
Let , the sample correlation matrix is
| (3) |
During computation, elements of can be determined individually using the relationship
| (4) |
If the means of columns of are known, covariance elements of can be directly calculated using the functional form
| (5) |
whereas, if the standard deviations and correlations are already known, then calculate covariance elements using
| (6) |
It is assumed that the sample covariance matrix is an accurate and reliable estimate of the population covariance matrix . When the sample size is infinitely large, there are fewer problems with this assumption, but as , the eigenvalues of are biased where small eigenvalues are too small and the large eigenvalues are too large. Moreover, several of the eigenvalues become zero, so becomes singular by losing full rank and therefore, is not positive definite and cannot be inverted.
Eigendecomposition of correlation matrix, . Correlation-based PCA (RPCA) using the correlation matrix, , is the traditional form of PCA. Let be the feature feature correlation matrix. By the Principal Axis Theorem, there exists a rotation matrix and diagonal matrix such that . Columns of and are the eigenvectors and diagonal entries of are the eigenvalues . NXG Explorer uses the QR algorithm for eigendecomposition of a real square symmetric matrix. The QR algorithm returns the eigenvalues of sorted in descending order () as well as the accompanying eigenvector matrix with its columns sorted in accordance with the eigenvalues. We also enforce the sum of each th column of to be positive, that is, by simply reversing the sign of each element if .The non-standardized principal component score, “PC score,” for the th object and th principal component when the covariance matrix is used for eigendecomposition is
| (7) |
which translates to the matrix operation
| (8) |
Thus, for an input data matrix , there will be an matrix , which contains principal components (PCs) in columns and objects in rows. For each object there will be PC scores in each row of , and in each column of , there will be PC scores for all of the objects. We have observed that non-standardized PC scores are equivalent to PCA output of most software packages, and reported in the majority of textbooks.
When the correlation matrix is used, the non-standardized PC scores are based on the relationship
| (9) |
which in matrix notation is
| (10) |
Note that, although the mean-zero standardized -scores are used when eigendecomposition is performed on the correlation matrix, it does not result in standardized PC scores. Standardizing PC scores depends on rescaling with the eigenvalues, which is done by introduction of the loadings, which are introduced below.
After removing eigenvectors (columns) of for null eigenvectors or for eigenvectors whose , the final number of valid columns will be , and the PC score matrix is now determined as
| (11) |
Loading matrix, . Standardized PC scores are scaled by the inverse of the eigenvalue, , for each eigenvector. In order to introduce the eigenvalues into the calculations, the loadings are used, which represent the correlation between each input feature and the given PC. The loading matrix is a square non-symmetric correlation matrix with elements representing the correlation between each of features and each PC in columns of . The loading for the th feature and th PC is given by
| (12) |
which, in terms of matrix operations is
| (13) |
It is important to remember that the loading matrix is a square non-symmetric matrix which has the same dimensions as the covariance matrix or correlation matrix . However, if principal components are only used for the eigenvalues that are greater than 1, i.e., , then will have dimensions . Operationally, this is typically the case, as most of the correlation loads on the first few PCs when there is moderate to strong multicollinearity. However, if most of the features are not correlated, then .
If we assume that eigendecomposition was performed on the covariance matrix , once the loading matrix is obtained, the standardized PC scores are determined using the relationship
| (14) |
where is the loading matrix that reveals the correlation between each feature and each PC. This is really a regression problem, for which the goal is to estimate , where are the PC scores and is the observable data. We have
| (15) |
For normally distributed data the mean vector drops out, and since the covariance between factor scores and data is simply the loadings, for which , we get
| (16) |
Now, since we know the loading matrix is , and by replacing covariance with correlation, , and moving on the left to the right side we have
| (17) |
Therefore, while most textbooks and software packages define as the non-standardized PC scores, the use of to obtain standardized PC scores equates to using , essentially the non-standardized PC scores divided (scaled) by . By default, NXG Explorer uses standardized PC scores for 2D score plots of PC1 vs. PC2, since the scales are somewhat smaller and distributed more like the standard normal distribution.
Eigendecomposition performed on | PC Scores | Matrix operation |
Covariance matrix, | Non-standardized | |
Standardized | ||
Correlation matrix, | Non-standardized | |
Standardized | ||
There are a few rules-of-thumb which must be taken into consideration when applying PCA to time series. Firstly, each time series, , must represent a single feature or variable, that is, the objects represent the values for a given time series at each time interval, and the feature itself represents the entire time series. PCA will not return anything if you represent a time series across multiple feature values. With regard to data formatting, what this means is that, if columns represent features or variables, and rows represent objects, then each time series will form a single column. So essentially, each time series represents an entire feature whose values are spread out over the rows.
After setting up the data, the data matrix will have dimensions , where is the number of time periods (rows) and is the number of time series being evaluated (columns).
Readers should also note that in financial engineering, PCA has been performed on the assets and the days. For example, assume the 30 stocks in the DOW-30 and 1000 daily price returns (4 years of trading, i.e., 250 days/year). The correlation (covariance) matrix can be run on the 30 stocks (30 features) with daily prices used as the objects (time series); or, the correlation (covariance) matrix can be run on the 1000 days (1000 features), with the 30 stocks used as objects – but this is not PCA on time series. The point is that in financial engineering, researchers use PCA on assets to understand how each asset correlates with ("loads on") each PC, as well as use PCA on days, to identify hidden grouping of days - which can be viewed as a type of clustering of days.
For simulation purposes, we initially chose to use straightforward sine waves for PCA analysis. We assume that the time series, , can be strictly based on a sine wave defined as
| (18) |
where is the time or interval, is the amplitude, is the frequency or where is the wavelength, and is the phase shift. For a fixed frequency, it is easier to use the wavelength (period) directly in the equation for by simply dividing by , and then adding the phase shift before taking the , and then multiplying the result by the amplitude . A small amount of noise was also added to each value of so that we could simulate a number of features having the same underlying sine wave, but with a small amount of noise added. This was accomplished by simply adding a noise value of to each value, where is a random uniform variate in the range [0,1].
For each experimental setup, we used a set of four underlying sine waves (without noise) each having different amplitude, wavelength, or phase shift. For each of the four underlying sine wave, we simulated a set of 25 time series (features), each of which had the same parameters (amplitude, wavelength, shift) but with a small amount of noise added. Therefore, within each set of 25 time series based on the same underlying sine wave, the time series were not exactly the same because of the noise added. The total number of features (time series, or sine waves) was therefore . We also set , so that , and the final dataset consisted of 100 time series each having 500 intervals. It was important not to consider using a wholly different sine wave for each simulated time series, as this would not enable the comparison between the PC scores derived from the simulated data and unique underlying pattern among each set of 25 time series that were used.
The following notation was used to represent each set of 25 time series that were similar: represented time series 1-25, represented time series 26-50, represented time series 51-75, and represented time series 76-100. For plotting purposes, we also added together and multiplied together all of the time series, and plotted the sum and product with labels SUM(X1:X4) and PROD(X1:X4) in each plot. (Note that the sum and product of all signals were not used in PCA, but rather, only in plotting). The various simulation setups and their results are presented in the following paragraphs.
Same wavelength, no amplitude or phase. What happens when all 100 simulated time series are similar with high multicollinearity? This was simulated using the same underlying wavelength, without amplitude changes or a phase shift:
| (19) |
When all 100 signals were similar and had the same underlying wavelength, there was only 1 PC extracted, which seemed to exactly represent the signal, which is expected when all features are very correlated, that is, all of the features load on one PC, and the variance explanation is based on only the first principal eigenvalue, . Quite naturally, the single PC also coincided with the composite sum of the signals, which is expected.
As shown below, the non-standardized PC scores are much larger than the
standardized, since they are not scaled by :
Use of the correlation matrix for eigendecomposition resulted in similar PC scores
when compared with covariance:
Again, the non-standardized PC scores display too large a scale when reflecting
patterns among the signals:
Same wavelength, different amplitude, no phase. Next, we used the same underlying wavelength for all 100 simulated time series, but used a changing amplitude for each set of 25:
| (20) |
This resulted in a finding similar to the previous experiment, where the features were still highly correlated after the introduction of changing amplitude, but with the same wavelength. Obviously, with such high multicollinearity there would be only 1 PC extracted from the covariance matrix.
Same wavelength, different phase. For this simulation, for each set of 25 time series with noise, we varied the phase but used the same wavelength:
| (21) |
The PCA results when there was only a phase shift for the same wavelength
indicate that only two PCs were extracted, as shown in the image below:
Same wavelength, different phase, with amplitude change. Next, we added an amplitude change for each set of 25 time series and retained the phase shift:
| (22) |
Results indicate that still only 2 PCs were extracted, as show in the image below:
Different wavelength, no phase or amplitude. Here we only employed a different wavelength for each set of 25 simulated time series:
| (23) |
These results suggest that when only the wavelength is changed, there are 4 PCs which are extracted, implying a greater degree of orthogonality (non-correlation) between the sets of 25 simulated signals. In addition, the PC score vectors appear as composite signals representing summation of the 4 cycles. It is also interesting to note that by not having a difference in scale (amplitude) of the signals, the PCs do introduce scale differences.
The non-standardized PC scores display too large a scale when reflecting patterns
among the signals:
Use of the correlation matrix for eigendecomposition resulted in a PC score vector (i.e.,
PC3) which tracked with the composite sum of signals. Although PCs can not
de-convolute the data into the constituent individual cycles, one of them does
nevertheless reflect the composite cycle (sum) of the data:
Again, the non-standardized PC scores display too large a scale when reflecting
patterns among the signals:
Different wavelength, different amplitude, without phase. Using different wavelengths, we now add amplitude changes:
| (24) |
When amplitude and wavelength change is introduced, it seems that the resulting PC remove the scale, and tend to represent the original signals more closely when compared with the previous results when the signals had no change in scale (amplitude).
Life before, the non-standardized PC scores display too large a scale:
When amplitude is introduced to the varying wavelengths, unlike before, correlation
was not able to extract a PC score vector which tracked with the composite sum of
signals:
The non-standardized PC scores, again, displayed too large a scale:
Purely negative correlation (), phase shift = . Purely negative correlation, that is, a correlation coefficient of , was invoked by introducing a phase shift of in half of the time series:
| (25) |
Under negative correlation when the signals cancel each other out, there is only 1 PC vector obtained.
Zero correlation (), phase shift = . Zero correlation was introduced by using a phase shift of for half of the time series:
| (26) |
When the correlation is zero and the two signals are orthogonal, as expected, there were 2 PC vectors obtained.
It is important to realize that PCA will not de-convolute a signal the same way FFT will, or at least pull out something like power spectra values for a signal to reveal what the major wavelengths are for the dominant cycles. When there are mixtures of various signals in a dataset, PCA will determine the degree of correlation between the signals and return the orthogonal signals (PC score vectors) based on the unique correlation patterns identified. But PCA will not return PC vector scores for the original underlying signals in the data the way FFT will, rather, the returned signals will be based on collapsed information hinged to correlation. For example, if there are two underlying signals, one based on sin and one on cos, and they have a small degree of correlation between them, then the PC score vectors that PCA returns will be based on the shared information between the two signals. Whereas FFT will provide the wavelength of the original two signals independent of their correlation. In summary, the important issue with PCA is that it can only return PC score vectors based on shared information between signals, and not anything regarding the original unique signals in a dataset.
While some reports have suggested that the covariance matrix must always be used for eigendecomposition of time series, our results indicate that use of correlation typically only introduces a small phase shift in the PC vectors, which may or may not be so important for more complex data. Irrespective of the phase shift in PC vector values introduced by correlation, when performing dimension reduction the difference between using covariance and correlation for eigendecomposition would likely not cause a strong bias. An important finding when using correlation for eigendecomposition was that one of the extracted PC vectors highly tracked with the composite (sum) of all the cyclical data. Although the composite cycle (sum of cycles) is not the dominant cycle in terms of an FFT power spectrum, this observation is important since the composite signal is the underlying signal in a dataset.
Lastly, the standardized (scaled) PC scores based on a least squares fit with the loading matrix should be used because their scale is much smaller than the non-standardized PC scores, which are not scaled by eigenvalues. One of the advantages of using the standardized PC scores based on the loading matrix is that columns of the loading matrix will be automatically zeroed out for any eigenvalues which are zero, since loadings are defined as . Whereas when raw PC scores are used, there can be residual non-informative information still remaining in the eigenvector matrix, which has to be zeroed out manually for any zero eigenvalues or when the PCs are filtered for e.g. .