Transformation of Data Science into Data Services - The Assembly of Parts and Manufacture of Nothing
If you look into the latest developments in deep convolutional neural networks (CNN), you will observe that the most developed groups were winners of the annual ILSVRC (ImageNet Large Scale Visual Recognition Competition). Artificial neural network (ANN) design among the most recent winners incorporated padding, filters, stride, dropout regularization, etc. in order to classify millions of 32x32-pixel RGB images for cats, pumpkins, airplanes, cars, mushrooms, etc., into dozens of classes using a CNN with millions of connection weights optimized on GPU servers. So essentially, over the last decade, computer scientists at U-Toronto, Stanford-U, Google Brain, and elsewhere, have been going down a rabbit-hole of trying to develop a better “mouse trap” which will result in fewer false positives during class prediction of millions of images. While the commercialization from such work is potentially amenable to music and voice recognition, facial recognition, hand-written character recognition, and e.g. identification of geophysical oil reserves, the workflow has nothing to do to with classifier diversity.
First and foremost, data scientists are losing sight of the fact that, in the universe of all cost functions, there is no one best classifier (No Free Lunch Theorem). In spite of this, the assumption has been that there is no other alternative to using a deep CNN with more than a dozen input, convolutional, RELU, pool, and output layers essentially building another “Great Pyramid at Giza” in order to optimize the objective function. ANNs are notoriously “feature hungry”, and won’t learn unless there are ample informative features, and compared with other classifiers are quite expensive in Big O time complexity. (FYI SVMs are “object hungry,” and won’t learn unless there is an adequate supply of objects). The CNNs employed for ILSVRC, however, extend much further than typical multiple hidden layer feed-forward back-propagation architectures and employ more than a dozen independent learning networks that draw from various regions of each masked input image. In fact, deep CNNs go far beyond typical efficient backprop weight initialization of , where n is the number of fan-in nodes into a hidden node, and use of an initial learning rate of , where is the principal eigenvalue of the covariance matrix of input features. There is, however, a tendency to constrain weights leading into hidden nodes by the max-norm, i.e., , which has been shown to improve results of dropout regularization. Overall, when done, the deep CNNs used for image classification layer network upon network of 2D by 2D by 3D vectors chained together via 10-15 layers to arrive at class prediction results. Examples of the types of layers used in CNNs include:
1. Data layer
2. Gaussian blur layer
3. Bed of nails layer
4. Image resizing layer
5. RGB to YUV layer
6. RGB to L*a*b* layer
7. Local response/contrast normalization layers
8. Block sparse convolution layer
9. Random sparse convolution layer
10. Convolution layer
11. Locally-connected layer with unshared weights
12. Fully-connected layer
13. Local pooling layer
14. Neuron layer
15. Elementwise sum layer
16. Elementwise max layer
17. Softmax layer
18. Logistic regression cost layer
19. Sum-of-squares cost layer
With regard to regularization, you don’t see much interest in the Mackay-type of Bayesian regularization, mostly because coming up with the trace of the Hessian is near degenerate and impractical. In summary, deep CNNs are not really a classifier, but rather are a highly structured series of class predictive hierarchical gradient descent machines.
Back to classifier diversity, i.e., kappa vs. error. If CNNs are not straightforward classifiers, then their votes cannot be fused with results of other classifiers in an ensemble. This makes them more like random forests (RF) and mixture of expert (MOE) classifiers whose calculations are not amenable for fusion with other classifiers during cross validation. RF develops its own bootstraps and involves (typically) 5,000 or more trained trees (Breiman said: “don’t be stingy”) that out-of-bag objects are dropped down for testing. Gated MOE has too many matrix computations, which precludes its fusion with other classifiers used for CV. Hence RF and MOE are not CV-friendly, since by design they operate quite differently from the usual CV-based classifier. In consideration of the above, when fusing together the majority votes of classifiers in an ensemble, you would need to leave out RF, MOE, and CNN, since they are much too expensive because of their architectural design. Classifiers which cannot be fused together in an ensemble cannot be employed to address the unsolved problem related to classifier diversity.
The advantages for undertaking deep CNN development are clear: (a) tremendous commercialization potential, (b) results inform class prediction of millions of images into dozens of classes, (c) result in more complete methods, and (d) are an automaton, since humans could not possibly perform such large-scale image classification work. The disadvantages of deep CNNs is that everyone “and their uncle” is slapping together open source Python and numpy to build a better mouse trap. Since there are so many open source interpretive languages (not “native” compilable languages like .NET or FORTRAN) being used for their development, the groups developing CNNs tend to be more like the PC “assembler” companies in the 1990’s who merely assembled together PC parts and sold product as a PC computer. The majority of the original assembler companies were either acquired or failed, and we are now left with HP (Compaq), Dell, Alienware, Acer, Asus, MSI, etc. Aside from the lucrative patents and contracts like Foveon’s touch pad technology on music Ipods, or Foveon’s hand-written character recognition for mail sorting at the US Postal Service, assemblers of open source interpretive language code don’t develop anything. Rather, they provide services for the mining of corporate or government data, which is not data science. Another drawback of deep CNNs is that in their current form, requiring several days of run-time on GPUs in the cloud to classify hundreds of thousands of images, they are not very portable. They depend on large-scale local or cloud-based GPU hardware and don’t scale well to single-chip laptops (that is, performing all the calculations on an 9th generation Intel i7).
Eventually, the service providers who are experienced “assemblers” of open source interpretive languages will take on a role that’s akin to being an accountant or architect. Accountants have formal training in fiduciary and financial reporting laws, and employ spreadsheets to provide corporate services. Architects provide design services and assemble together various resources and oversee construction of their designs, but they don’t manufacture the individual components such as the steel beams, toilets, light switches, and glass windows. Thousands of data science jobs are now available, and in the end their positions will merely entail the assembly of open source interpretive languages for providing corporate services. Eventually, data science will not involve the creation or manufacturing of anything, but rather the assembly of open source code applied in the cloud which will result in a transformation into “Data Services.” Data Services will likely become the largest industry in the history of commercialization and will very likely become its own market sector. However, with explosive man-made growth comes the high risk of local and global economic avalanches.
Transformation of Data Science into Data Services - The Assembly of Parts and Manufacture of Nothing
If you look into the latest developments in deep convolutional neural networks (CNN), you will observe that the most developed groups were winners of the annual ILSVRC (ImageNet Large Scale Visual Recognition Competition). Artificial neural network (ANN) design among the most recent winners incorporated padding, filters, stride, dropout regularization, etc. in order to classify millions of 32x32-pixel RGB images for cats, pumpkins, airplanes, cars, mushrooms, etc., into dozens of classes using a CNN with millions of connection weights optimized on GPU servers. So essentially, over the last decade, computer scientists at U-Toronto, Stanford-U, Google Brain, and elsewhere, have been going down a rabbit-hole of trying to develop a better “mouse trap” which will result in fewer false positives during class prediction of millions of images. While the commercialization from such work is potentially amenable to music and voice recognition, facial recognition, hand-written character recognition, and e.g. identification of geophysical oil reserves, the workflow has nothing to do to with classifier diversity.
First and foremost, data scientists are losing sight of the fact that, in the universe of all cost functions, there is no one best classifier (No Free Lunch Theorem). In spite of this, the assumption has been that there is no other alternative to using a deep CNN with more than a dozen input, convolutional, RELU, pool, and output layers essentially building another “Great Pyramid at Giza” in order to optimize the objective function. ANNs are notoriously “feature hungry”, and won’t learn unless there are ample informative features, and compared with other classifiers are quite expensive in Big O time complexity. (FYI SVMs are “object hungry,” and won’t learn unless there is an adequate supply of objects). The CNNs employed for ILSVRC, however, extend much further than typical multiple hidden layer feed-forward back-propagation architectures and employ more than a dozen independent learning networks that draw from various regions of each masked input image. In fact, deep CNNs go far beyond typical efficient backprop weight initialization of , where n is the number of fan-in nodes into a hidden node, and use of an initial learning rate of , where is the principal eigenvalue of the covariance matrix of input features. There is, however, a tendency to constrain weights leading into hidden nodes by the max-norm, i.e., , which has been shown to improve results of dropout regularization. Overall, when done, the deep CNNs used for image classification layer network upon network of 2D by 2D by 3D vectors chained together via 10-15 layers to arrive at class prediction results. Examples of the types of layers used in CNNs include:
1. Data layer
2. Gaussian blur layer
3. Bed of nails layer
4. Image resizing layer
5. RGB to YUV layer
6. RGB to L*a*b* layer
7. Local response/contrast normalization layers
8. Block sparse convolution layer
9. Random sparse convolution layer
10. Convolution layer
11. Locally-connected layer with unshared weights
12. Fully-connected layer
13. Local pooling layer
14. Neuron layer
15. Elementwise sum layer
16. Elementwise max layer
17. Softmax layer
18. Logistic regression cost layer
19. Sum-of-squares cost layer
With regard to regularization, you don’t see much interest in the Mackay-type of Bayesian regularization, mostly because coming up with the trace of the Hessian is near degenerate and impractical. In summary, deep CNNs are not really a classifier, but rather are a highly structured series of class predictive hierarchical gradient descent machines.
Back to classifier diversity, i.e., kappa vs. error. If CNNs are not straightforward classifiers, then their votes cannot be fused with results of other classifiers in an ensemble. This makes them more like random forests (RF) and mixture of expert (MOE) classifiers whose calculations are not amenable for fusion with other classifiers during cross validation. RF develops its own bootstraps and involves (typically) 5,000 or more trained trees (Breiman said: “don’t be stingy”) that out-of-bag objects are dropped down for testing. Gated MOE has too many matrix computations, which precludes its fusion with other classifiers used for CV. Hence RF and MOE are not CV-friendly, since by design they operate quite differently from the usual CV-based classifier. In consideration of the above, when fusing together the majority votes of classifiers in an ensemble, you would need to leave out RF, MOE, and CNN, since they are much too expensive because of their architectural design. Classifiers which cannot be fused together in an ensemble cannot be employed to address the unsolved problem related to classifier diversity.
The advantages for undertaking deep CNN development are clear: (a) tremendous commercialization potential, (b) results inform class prediction of millions of images into dozens of classes, (c) result in more complete methods, and (d) are an automaton, since humans could not possibly perform such large-scale image classification work. The disadvantages of deep CNNs is that everyone “and their uncle” is slapping together open source Python and numpy to build a better mouse trap. Since there are so many open source interpretive languages (not “native” compilable languages like .NET or FORTRAN) being used for their development, the groups developing CNNs tend to be more like the PC “assembler” companies in the 1990’s who merely assembled together PC parts and sold product as a PC computer. The majority of the original assembler companies were either acquired or failed, and we are now left with HP (Compaq), Dell, Alienware, Acer, Asus, MSI, etc. Aside from the lucrative patents and contracts like Foveon’s touch pad technology on music Ipods, or Foveon’s hand-written character recognition for mail sorting at the US Postal Service, assemblers of open source interpretive language code don’t develop anything. Rather, they provide services for the mining of corporate or government data, which is not data science. Another drawback of deep CNNs is that in their current form, requiring several days of run-time on GPUs in the cloud to classify hundreds of thousands of images, they are not very portable. They depend on large-scale local or cloud-based GPU hardware and don’t scale well to single-chip laptops (that is, performing all the calculations on an 9th generation Intel i7).
Eventually, the service providers who are experienced “assemblers” of open source interpretive languages will take on a role that’s akin to being an accountant or architect. Accountants have formal training in fiduciary and financial reporting laws, and employ spreadsheets to provide corporate services. Architects provide design services and assemble together various resources and oversee construction of their designs, but they don’t manufacture the individual components such as the steel beams, toilets, light switches, and glass windows. Thousands of data science jobs are now available, and in the end their positions will merely entail the assembly of open source interpretive languages for providing corporate services. Eventually, data science will not involve the creation or manufacturing of anything, but rather the assembly of open source interpretive language code applied in the cloud which will result in a transformation into “Data Services.” Data Services will likely become the largest industry in the history of commercialization and will very likely become its own market sector. However, with explosive man-made growth comes the high risk of local and global economic avalanches.