Machine Learning
Machine Learning (ML) is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data.
By following the definition of ML, the Data Mining based on Machine Learning approach has as one of the most important feature the technique of learning. There are basically two main learning techniques: supervised and unsupervised (shown below).

In supervised machine learning we have a set of data points or observations for which we know the desired output, class, target variable or outcome. The outcome may take one of many values called classes or labels. The target variable is providing some level of supervision in that it is used by the learning algorithm to adjust parameters or make decisions that will allow it to predict labels for new data. Finally of note, when the algorithm is predicting labels of observations we call it a classifier. Some classifiers are also capable of providing a probability of a data point belonging to class in which case it is often referred to a probabilistic model or a regression - not to be confused with a statistical regression model.
The supervised learning process is:
- 1. Scale and prepare training data: First we build input vectors that are appropriate for feeding into our supervised learning algorithm.
- 2. Create a training set and a validation set by randomly splitting the universe of data. The training set is the data that the classifier uses to learn how to classify the data, whereas the validation set is used to feed the already trained model in order to get an error rate (or other measures and techniques) that can help us identify the classifier's performance and accuracy. Typically you will use more training data (maybe 80% of the entire universe) than validation data. Note that there is also cross-validation), but that is beyond the scope of this article.
- 3. Train the model. We take the training data and we feed it into the algorithm. The end result is a model that has learned (hopefully) how to predict our outcome given new unknown data.
- 4. Validation and tuning: After we've created a model, we want to test its accuracy. It is critical to do this on data that the model has not seen yet - otherwise you are cheating. This is why on step 2 we separated out a subset of the data that was not used for training. We are indeed testing our model's generalization capabilities. It is very easy to learn every single combination of input vectors and their mappings to the output as observed on the training data, and we can achieve a very low error in doing that, but how does the very same rules or mappings perform on new data that may have different input to output mappings? If the classification error of the validation set is very big compared to the training set's, then we have to go back and adjust model parameters. The model will have essentially memorized the answers seen in the training data, losing its generalization capabilities. This is called over fitting, and there are various techniques for overcoming it.
- 5. Validate the model's performance. There are numerous techniques. The model's accuracy can be improved by changing its structure or the underlying training data. If the model's performance is not satisfactory, change model parameters, inputs and or scaling, go to step 3 and try again.
- 6. Use the model to classify new data. In production. Profit!
The kinds of problems that are suited for unsupervised algorithms may seem similar, but are very different to supervised learners. Instead of trying to predict a set of known classes, we are trying to identify the patterns inherent in the data that separate like observations in one way or another. Viewed from 20 thousand feet, the main difference is that we are not providing a target variable like we did in supervised learning. This marks a fundamental difference in how both types of algorithms operate. On one hand, we have supervised algorithms which try to minimize the error in classifying observations, while unsupervised learning algorithms don't have such luxuries because there are no outcomes or labels. Unsupervised algorithms try to create clusters of data that are inherently similar. In some cases we don't necessarily know what makes them similar, but the algorithms are capable of finding these relationships between data points and group them in significant ways. While supervised algorithms aim to minimize the classification error, unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible. Another main difference is that in a clustering problem, the concept of "Training Set" does not apply in the same way as with supervised learners. Typically we have a dataset that is used to find the relationships in the data that buckets them in different clusters. We could of course apply the same clustering model to new data, but unless it is too unpractical to do so (perhaps for performance reasons), we will most certainly want to rerun the algorithm on new data as it will typically find new relationships within the data that may surface up given the new observations.
The unsupervised learning process is:
- 1. Scale and prepare raw data: As with supervised learners, this step entails selecting features to feed into our algorithm, and scaling them to build a suitable data set.
- 2. Build model: We run the unsupervised algorithm on the scaled dataset to get groups of like observations.
- 3. Validate: After clustering the data, we need to verify whether it cleanly separated the data in significant ways. This includes calculating a set of statistics on the resulting clusters (such as the within group sum of squares), as well as analysis based on domain knowledge, where you may measure how certain attributes behave when aggregated by the clusters.
- 4. Once we are satisfied with the clusters created there is no need to run the model with new data (although you can). Profit!
Moreover, in the data mining scenario, the machine learning model choice should always be accompanied by the functionality domain. To be more precise, some machine learning models can be used in a same functionality domain, because it represents the functional context in which it is performed the exploration of data. Examples of such domains are:
- Dimensional reduction;
- Classification;
- Regression;
- Clustering;
- Segmentation;
- Statistical data analysis;
- Forecasting;
- Data Mining Model Filtering;
- etc...
Functionalities: Classification
In the DAMEWARE web application resource the following data mining models are available for classification experiments:
- MLP with BP learning rule
- MLP with GA learning rule
- MLP with QNA learning rule
- SVM
What's the meaning of Classification?
Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more features inherent to the items (referred to as features) and based on a training set of previously labelled items. A classifier is a system that performs a mapping from a feature space X to a set of labels Y.
Basically a classifier assigns a pre-defined class label to a sample.
Formally, the problem can be stated as follows: given training data {(x_1,y_1),...,(x_n, y_n)} (where x_i are vectors) a classifier h:X->Y maps an object x in X to its classification label y in Y.
Different classification problems could arise:
- a) crispy classification: given an input pattern x (vector) the classifier returns its computed label y (scalar).
- b) probabilistic classification: given an input pattern x (vector) the classifier returns a vector y which contains the probability of y_i to be the "right" label for x. In other words in this case we seek, for each input vector, the probability of its membership to the class y_i (for each y_i).
- training, by means of a training set (INPUT: patterns and target vectors, or labels; OUTPUT: an evaluation system of some sort);
- testing, by means of a test set (INPUT: patterns and target vectors, requiring a valid evaluation system from point 1; OUTPUT: some statistics about the test, confusion matrix, overall error, bitfail error, as well as the evaluated labels);
- evaluation, by means of an unlabelled dataset (INPUT: patterns, requiring a valid evaluation systems; OUTPUT: the labels evaluated for each input pattern).
The overall error somehow integrates information about the classification goodness. However, when a data set is unbalanced (when the number of samples in different classes varies greatly) the error rate of a classifier is not representative of the true performance of the classifier.
A confusion matrix can be calculated to easily visualize the classification performance: each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is the simple way to see if the system is mixing two classes.
Optionally (some classification methods does not require it by its nature or simply as a user choice), one could need a validation procedure. Validation is the process of checking if the classifier meets some criterion of generality when dealing with unseen data. It can be used to avoid over-fitting or to stop the training on the base of an objective criterion. With “objective” we intend a criterion which is not based on the same data we have used for the training procedure. If the system does not meet this criterion it can be changed and then validated again, until the criterion is matched or a certain condition is reached (for example, the maximum number of epochs). There are different validation procedures. One can use an entire dataset for validation purposes (thus called validation set); this dataset can be prepared by the user directly or in an automatic fashion. In some cases (e.g. when the training set is limited) one could want to apply a cross validation procedure, which means partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis. Different types of cross validation may be implemented, e.g. k-fold, leave-one-out, etc.
Summarizing we can safely state that a common classification training task involves:
- the training set to compute the model;
- the validation set to choose the best parameters of this model (in case there are "additional" parameters that cannot be computed based on training);
- the test data as the final "judge", to get an estimate of the quality on new data that are used neither to train the model, nor to determine its underlying parameters or structure or complexity of this model;
Functionalities: Regression
In the DAMEWARE web application resource the following data mining models are available for regression experiments:
- MLP with BP learning rule
- MLP with GA learning rule
- MLP with QNA learning rule
- SVM
What's the meaning of Regression?
Regression methods bring out relations between variables, especially whose relation is imperfect (i.e it has not one y for each given x).
Just as an example, the relation in a DM design team between brain weight and working capability of the members is a typical imperfect relationship (any reference is purely casual...).
The term regression is historically coming from biology in genetic transmission through generations, where for example it is known that tall fathers have tall sons, but not as tall on the average as the fathers. The trend to transmit on average genetic features, but not exactly in the same quantity, was what the scientist Galton defined as regression, more exactly regression toward the mean. This is the first item you can found through a short immersion in the topic.
But, strictly speaking it is very difficult to find a precise definition. It seems the existence of two meanings for regression, that can be addressed as data table statistical correlation (usually column averages) and as fitting of a function.
About the data table statistical correlation, let start with a very generic example: suppose to have two variables x and y, where for each small interval of x there is a distribution of corresponding y. We can always compute a summary of the y values for that interval. The summary might be for example the mean, median or even the geometric mean. Let fix the points (x_i,mean(y_i )), where x_i is the center of the ith interval and mean(y_i ) the average y for that interval.
Then the fixed points will fall close to a curve that could summarize them, possibly close to a straight line. Such a smooth curve approximates the regression curve called the regression of y on x. By generalizing the example, the typical application is when the user has a table (let say a series of input patterns coming from any experience or observation) with some correspondences between intervals of x (table raws) and some distributions of y (table columns), representing a generic correlation not well known (i.e. imperfect as introduced above) between them. Once we have such a table, we want for example to clarify or accent the relation between the specific values of one variable and the corresponding values of the other. If we want an average, we might compute the mean or median for each column. Then to get a regression, we might plot these averages against the midpoints of the class intervals.
In the practical astrophysical cases, we usually do not have continuous populations with known functional forms. But the data may be very extensive. In these cases it is possible to break one of the variables into small intervals and compute averages for each of them.
Then, without severe assumptions about the shape of the curve, essentially get a regression curve. What the regression curve does is essentially to give a, let say, big summary for the averages for the distributions corresponding to the set of x. One can go further and compute several different regression curves corresponding to the various percentage points of the distributions and thus get a more complete picture of the input data set. Of course often it is an incomplete picture for a set of distributions! But in this first meaning of regression, when the data are more sparse, we may find that sampling variation makes impractical to get a reliable regression curve in the simple averaging way described. From this assumption, it descends the second meaning of regression.
About the fitting of a function, usually it is possible to introduce a smoothing procedure, applying it either to the column summaries or to the original values of y (of course after an ordering of y values in terms of increasing x).
In other words we assume a shape for the curve describing the data, for example linear, quadratic, logarithmic or whatever. Then we fit the curve by some statistical method, often least-squares.
In practice, we do not pretend that the resulting curve has the perfect shape of the regression curve that would arise if we had unlimited data, but simply we obtain an approximation.
We intend the regression of data in terms of forced fitting of a functional form.
The real data present intrinsic conditions that make this second meaning as the official regression use case, instead of the first, i.e. curve connecting averages of column distributions.
We ordinarily choose for the curve a form with relatively few parameters and then we have to choose the method to fit it. In many manuals sometimes it might be founded a definition probably not formally perfect, but very clear: by regressing one y variable against one x variable means to find a carrier for x.
This introduce possible more complicated scenarios in which more than one carrier of data can be founded. In these cases it has the advantage that the geometry can be kept to three dimensions (with two carriers) up to n-dimensional spaces (n>3, with more than two carriers regressing input data). Clearly, both choosing the set of carriers from which a final subset is to be drawn and choosing that subset can be most disconcerting processes.
In substance we can declare a simple, important use of regression, consisting in what follows.
To get a summary of data, i.e. to locate a representative functional operator of the data set, in a statistical sense (first meaning) or via an approximated trend curve estimation (second meaning).
And a more common use of regression:
- For evaluation of unknown features hidden into the data set;
- For prediction, as when we use information from several weather or astronomical seeing stations to predict the probability of rain or the turbulence growing in the atmosphere;
- For exclusion. Usually we may know that x affects y, and one could be curious to know whether z affects y too. In this case one approach would take the effects of x out of y and see if what remains is associated with z. In practice this can be done by an iterative fitting procedure by evaluating at each step the residual of previous fitting.
Functionalities: Clustering
In the DAMEWARE web application resource the following data mining models are available for clustering experiments:
- CSOM with Multi Layer Clustering option
- GSOM with Multi Layer Clustering option
- K-MEANS through KNIME engine
What's the meaning of Clustering?
Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details (data compression), but achieves simplification.
From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system could represent a data concept in the KDD (Knowledge Discovery in Databases).
From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others.
Data mining in Astrophysics adds to clustering the complications of very large datasets with very many attributes of different types (high dimensionality). This imposes unique computational requirements on relevant clustering algorithms.
What are the properties of clustering algorithms we are concerned with in data mining?
These properties include:
- Type of attributes that the algorithm can handle;
- Scalability to large datasets;
- Ability to work with high dimensional data (multi-D parameter space, multi-wavelength, multi-epoch etc…);
- Ability to find clusters of irregular shape;
- Handling outliers;
- Time complexity (when there is no confusion, we use the term complexity);
- Data order dependency;
- Labeling or assignment (hard or strict vs. soft of fuzzy);
- Reliance on a priori knowledge and user defined parameters;
- Interpretability of results;
What's the meaning of Multi Layer Clustering?
The customized versions of SOFM (Self Organizing Feature Maps) as implemented in the DAMEWARE Suite have a typical learning which can produce a sub-dimensioning of the network, revealing an unsufficient generalization capability.
This case occurs often when the number of neurons of the SOM network (the so-called Kohonen layer) corresponds to the number of desired clusters and this number results too low to obtain a right clustering effect.
In such conditions, at the end of the training, the main effect is to obtain output reference patterns too close to each other, causing an overlap of classes (clusters).
A solution to this problem is to apply SOFM training through a MLC (Multi Layer Clustering) profile architecture. The MLC scheme has the following properties:
- it is a hierarchical structure with an upper limit of three layers;
- It offers the same basic clustering functionalities, but overcoming the single layer network sub-dimensioning problem;
- each layer can be configured to host a different clustering algorithm (but they also be made by the same one). The important structural requirement is that each upper layer (starting from the bottom) must have a number of neurons strictly less than the previous (as shown below).

Functionalities: Dimensional Reduction
In the next release of DAMEWARE web application resource the following data mining models will be available for dimensional reduction experiments:
What's the meaning of Dimensional Reduction?
Traditional statistical methods break down partly because of the increase in the number of observations, but mostly because of the increase in the number of variables associated with each observation. The dimension of the data is the number of variables that are measured on each observation.
High-dimensional datasets present many mathematical challenges as well as some opportunities, and are bound to give rise to new theoretical developments. One of the problems with high-dimensional datasets is that, in many cases, not all the measured variables are “important” for understanding the underlying phenomena of interest.
While certain computationally expensive novel methods can construct predictive models with high accuracy from high-dimensional data, it is still of interest in many applications to reduce the dimension of the original data prior to any modeling of the data.
In mathematical terms, the problem we investigate can be stated as follows: given the p-dimensional random variable x = (x1...xp), and a lower dimensional representation of it, s = (s1,..., sk) with k less or equal to p, that captures the content in the original data, according to some criterion.
The components of s are sometimes called the hidden components. Different fields use different names for the p multivariate vectors: the term variable is mostly used in statistics, while feature and attribute are alternatives commonly used in the data mining and machine learning literature.
Generally speaking, in statistics, dimensional reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction.
Feature selection approaches try to find a subset of the original variables (also called features or attributes). Two strategies are filter (e.g. information gain) and wrapper (e.g. search guided by the accuracy) approaches.
Feature extraction transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in Principal Component Analysis (PCA), but many non-linear techniques also exist, like Principal Probabilistic Surfaces (PPS).
In some cases, data analysis such as regression or classification can be done in the reduced space more accurately than in the original space.
Models: Multi Layer Perceptron general architecture
This model is available in the current release of the DAMEWARE suite, associated to different learning rules, respectively:
The Multi Layer Perceptron (MLP) architecture is one of the most typical feed-forward neural network model.The term feed-forward is used to identify basic behavior of such neural models, in which the impulse is propagated always in the same direction, e.g. from neuron input layer towards output layer, through one or more hidden layers (the network brain), by combining weighted sum of weights associated to all neurons (except the input layer).

As easy to understand, the neurons are organized in layers, with proper own role.
The input signal, simply propagated throughout the neurons of the input layer, is used to stimulate next hidden and output neuron layers.
The output of each neuron is obtained by means of an activation function, applied to the weighted sum of its inputs. Different shape of this activation function can be applied, from the simplest linear one up to sigmoid, arctan or tanh (or a customized function ad hoc for the specific application).
The number of hidden layers represents the degree of the complexity achieved for the energy solution space in which the network output moves looking for the best solution.
As an example, in a typical classification problem, the number of hidden layers indicates the number of hyper-planes used to split the parameter space (i.e. number of possible classes) in order to classify each input pattern.
The activation function can be either linear or non-linear, depending on whether the network must learn a regression problem or should perform a classification.
Activation functions, for the hidden units, introduce the non-linearity into the network. Without non linearity, the hidden units would not render the NN more powerful than just the perceptrons with only input and output units (the linear function of linear functions is again a linear function). In other words, it is the non-linearity (i.e., the capability to represent nonlinear functions) that makes multilayer networks so powerful.

For the hidden units, sigmoid activation functions (for binary problems) or softmax (for multi class problem), are usually better to use instead of the threshold activation functions.
The base of the MLP is the perceptron, a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier. The perceptron is a binary classifier which maps its input x (a real-valued vector) to an output value f(x) (a single binary value) across the matrix.
The perceptron is considered the simplest kind of feed-forward neural network.
The earliest kind of neural network is a single-layer perceptron (SLP) network (see picture below), which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. In this way it can be considered the simplest kind of feed-forward network. The sum of the products of the weights and the inputs is calculated in each node, and if the value is above some threshold (typically 0) the neuron fires and takes the activated value (typically 1); otherwise it takes the deactivated value (typically -1).

Neurons with this kind of activation function are also called artificial neurons or linear threshold units, as described by Warren McCulloch and Walter Pitts in the 1940s.
A perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two. Most perceptrons have outputs of 1 or -1 with a threshold of 0 and there is some evidence that such networks can be trained more quickly than networks created from nodes with different activation and deactivation values.
SLPs are only capable of learning linearly separable patterns. In 1969 in a famous monograph entitled Perceptrons Marvin Minsky and Seymour Papert showed that it was impossible for a single-layer perceptron network to learn an XOR function.
Although a single threshold unit is quite limited in its computational power, it has been shown that networks of parallel threshold units can approximate any continuous function from a compact interval of the real numbers into the interval [-1,1]. So far, it was introduced the model Multi Layer Perceptron (see picture below), able to solve nonlinear problems (like XOR function).

This class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer. In many applications the units of these networks apply a continuous activation function.
The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, e.g. for the sigmoidal functions.
An extension of the universal approximation theorem states that the two layers architecture is capable of universal approximation and a considerable number of papers have appeared in the literature discussing this property.
An important corollary of these results is that, in the context of a classification problem, networks with sigmoidal non-linearities and two layer of weights can approximate any decision boundary to arbitrary accuracy. Thus, such networks also provide universal non-linear discriminate functions.
More generally, the capability of such networks to approximate general smooth functions allows them to model posterior probabilities of class membership. Since two layers of weights suffice to implement any arbitrary function, one would need special problem conditions or requirements to recommend the use of more than two layers.
Furthermore, it is found empirically that networks with multiple hidden layers are more prone to getting caught in undesirable local minima.
Astronomical data do not seem to require such level of complexity and therefore it is enough to use just a double weights layer, i.e a single hidden layer.
Models: Multi Layer Perceptron trained by Back Propagation
This model, functionally related to Classification and Regression, is one of the data mining algorithms already available in the current release of the DAMEWARE suite.
Multi-layer networks (see MLP architecture) use a variety of learning techniques, the most popular being back-propagation (BP).
Here, the output values are compared with the correct answer to compute the value of some predefined error-function. By various techniques, the error is then fed back through the network. Using this information, the algorithm adjusts the weights of each connection in order to reduce the value of the error function by some small amount.
After repeating this process for a sufficiently large number of training cycles, the network will usually converge to some state where the error of the calculations is small. In this case, one would say that the network has learned a certain target function.
To adjust weights properly, one applies a general method for non-linear optimization that is called gradient descent. For this, the derivative of the error function with respect to the network weights is calculated, and the weights are then changed such that the error decreases (thus going downhill on the surface of the error function). For this reason, back-propagation can only be applied on networks with differentiable activation functions.
In general, the problem of teaching a network to perform well, even on samples that were not used as training samples, is a quite subtle issue that requires additional techniques. This is especially important for cases where only very limited numbers of training samples are available.
The risk is that the network overfits the training data and fails to capture the true statistical process generating the data.
Computational learning theory is concerned with training classifiers on a limited amount of data. In the context of neural networks a simple heuristic, called early stopping, often ensures that the network will generalize well to examples not in the training set.
Other typical problems of the back-propagation algorithm are the speed of convergence and the possibility of ending up in a local minimum of the error function. But, fortunately there are practical solutions to above problems that make back-propagation in multi-layer perceptrons the solution of choice for many machine learning tasks.

The BP algorithm (see above picture) is a supervised learning method, and is an implementation of the Delta rule. It requires a teacher that knows, or can calculate, the desired output for any given input. It is most useful for feed-forward networks (networks that have no feedback, or simply, that have no connections that loop).
The term BP is an abbreviation for "backwards propagation of errors".
Back Propagation requires that the activation function used by the artificial neurons (or "nodes") is differentiable.
Models: Multi Layer Perceptron trained by Genetic Algorithms
This model, functionally related to Classification and Regression, is one of the data mining algorithms already available in the current release of the DAMEWARE suite.
The Genetic Algorithms (GA) are computational methods inspired by the natural evolution law discovered by Darwin. They are particularly powerful to solve problems where the solution space is not well defined.
The algorithm hence consists mainly in the cyclic exploration of the parameter space, carefully going towards the best solution. In a GA each element of a population is the so-called chromosome, composed by a set of genes (features), that represents its DNA.
Each DNA is in practice one possible solution to the problem.
The starting point of the method consists in the random generation of a population of chromosomes, for example by using normal or uniform statistical distributions.
The method proceeds by performing cyclic variation and combination of the initial population, looking for the best population (best problem solution).
At each evolution stage the output chromosomes are obtained by applying several genetic operators to the input population and by evaluating through a specific fitness function the goodness of the new generated population.
The fitness function has the basic role to give a method to discard worst chromosomes from the population, achieving the evolution to the next generation of the best candidates only (exactly like Nature works with its species, following the Darwin law). Typical genetic operators are cross-over and mutation.
In the design of a GA to solve an optimization problem, three steps are considered as strategic to obtain better results in the better time:
- 1. the chromosome representation (codified in some way);
- 2. the choice of the fitness function;
- 3. the method to choose chromosome reproduction.
The choice must be driven in some way. Usual rules are the so-called roulette wheel selection and tournament selection.
The former mainly consists into the assignment of a probability to be selected for the reproduction for each chromosome. This probability is directly proportional to its fitness.
The latter is alternatively based on the random choice of a number N of chromosomes from the current population and to compare their fitness: the element with the best fitness is the winner and is selected for reproduction.
The tournament selection seems quite flexible, because it permits to introduce more variability in the selection criterion: higher N, lower the possibility to select chromosomes candidates with worst fitness!
Anyway, not always the elements with worst fitness must be considered bad choices for reproduction. They in fact, introduce more variety in the population (exactly like the jumping out from a local minima for the gradient descendent learning algorithm in the Back Propagation), and can speed up the convergence towards best solution. But their use must be taken under control in the GA.
For example by monitoring population evolution, where some chromosome has a static trend (i.e. it has always associated a good fitness but not sufficient to become a winner during reproduction, so remaining always the same through several generations). In this case, by applying any reproduction with a chromosome randomly chosen between those with poor fitness, can introduce gain in the next generations.
Usually (but not always) in the tournament selection two winners are chosen at a time (i.e. from two tournaments) and recombined to generate two sons put in the next generation. As said before, two are typical genetic operators employed in the chromosome recombination:
- The Cross-over (or unbiased chromosome cross-over) criterion, on two m-length chromosomes A and B, brakes original parents at a fixed position j (j is the number of genes or sub-string length arbitrarily fixed a priori). So the m-j genes of chromosome A are queued to the first j genes of chromosome B, while the m-j genes of chromosome B are queued to the first j genes of chromosome A.
Example:
A = 10010001, B = 00110111, m = 8 and j = 5 => after cross-over => new A = 10010111, new B = 00110001 After the cross-over application, the two sons new A and new B are obtained from their parents by reciprocally reverting last m-j (3) genes. - The mutation consists into the roughly change in only one gene inside the chromosome. Depending on the code used to represent the chromosome genes (typical is the binary code), the original value of a gene is replaced by the other character.
There are two different mutation criteria: unbiased and biased. Both criteria can be applied to an entire chromosome as well as to a single gene.
- In the single gene unbiased mutation (standard mutation type), one position inside the chromosome string is randomly selected. Then the gene related to the selected chromosome string position is replaced.
- In the unbiased chromosome mutation, the entire selected chromosome is completely replaced by a new chromosome, randomly generated. In the biased chromosome mutation, the selected chromosome is totally replaced by a new chromosome, obtained by the sum of original chromosome and a new one randomly generated (in this sense biased).

This procedure (serial application of selected genetic operators) is applied until the new population has exactly the same member number of the previous one. And the entire cycle of population generation is iterated until the chromosome with desired good fitness is founded (see picture above).
Models: Multi Layer Perceptron trained by Quasi Newton rule
This model, functionally related to Classification and Regression, is one of the data mining algorithms already available in the current release of the DAMEWARE suite.
Quasi-Newton methods (also known as variable metric methods) are algorithms (hereinafter called QNA) for finding local maxima and minima of functions.
QNA are based on Newton's method to find the stationary point of a function, where the gradient is 0. Newton's method assumes that the function can be locally approximated as a quadratic in the region around the optimum, and use the first and second derivatives (gradient and Hessian) to find the stationary point.
In QNA the Hessian matrix of second derivatives of the function to be minimized does not need to be computed. The Hessian is updated by analyzing successive gradient vectors instead.
QNA are a generalization of the secant method to find the root of the first derivative for multidimensional problems. In multi-dimensions the secant equation is under-determined, and quasi-Newton methods differ in how they constrain the solution, typically by adding a simple low-rank update to the current estimate of the Hessian.
The most common quasi-Newton algorithm is the widespread BFGS method, that was suggested independently by Broyden, Fletcher, Goldfarb, and Shanno, in 1970.
As known, the classical Newton method uses the Hessian of a function. The step of the method is defined as a product of an inverse Hessian matrix and a function gradient. If the function is a positive definite quadratic form, we can reach the function minimum in one step. In case of an indefinite quadratic form (which has no minimum), we will reach the maximum or saddle point. In short, the method finds the stationary point of a quadratic form.
In practice, we usually have functions which are not quadratic forms. If such a function is smooth, it is sufficiently good described by a quadratic form in the minimum neighborhood. However, the Newton method can converge both to a minimum and a maximum (taking a step into the direction of a function increasing).
QNA solve this problem as follows:
- they use a definite positive approximation instead of a Hessian. If Hessian is definite positive, we make the step using the Newton method. If Hessian is indefinite, we modify it to make it definite positive, and then perform a step using the Newton method. The step is always performed in the direction of the function decrement. In case of a positive definite Hessian, we use it to generate a quadratic surface approximation. This should make the convergence better. If Hessian is indefinite, we just move to where function decreases.
Up to here it seems quite simple…but it is not!
The Hessian of a function isn't always available and in many cases is too much complicated; more often we can only calculate the function gradient. But let us proceed with order.
As shown, the Back Propagation learning rule tries to adapt weights in order to minimize the error function E(w). For networks without hidden layers, the error function will be squared and will assume a multi-dimensional parabolic shape, with only one absolute minimum. But for a generic MLP, the error function will be much more complex, with more than one minimum (local minima) in which the error gradient is zero.

In these last cases it is important to distinguish between absolute and local minima. When, during the learning process, the error founds a local minimum, with the above adaption rule, the error function will not move anymore, resulting in a wrong (not absolute) minimization state.
Moreover, the descent gradient is not fast to converge. Fortunately there exist several methods to overcome these limits:
- Online Descent Gradient Optimization, in which the network weights are updated after each input pattern presentation;
- Batch Descent Gradient Optimization, in which the network weights are updated after each presentation of the whole training dataset.

Between the two approaches, the first is preferable if there is an high degree of redundancy in the training set information, otherwise the second is the best.
Both versions require that the learning rate is a priori defined in a static way. This is another important point, because an high value of learning rate causes a more instable convergence of the learning process (the error function jumps along the error surface without convergence assurance). For a learning rate too small, the convergence will result extremely slow.
The good compromise is to gradually reduce, at each learning step, the value of the learning rate (for example by following any law of
type 1/t), obtaining a faster convergence of the algorithm.
If we use a local squared approximation of error function, we can obtain an expression of minimum position.

The Quasi Newton Algorithm (QNA) which does not calculate the Hessian matrix and then its inverse, but an approximation in a series of steps.
The problem to have only one Hessian matrix not positive is solved starting from a unitary matrix (positive by definition) and by maintaining further approximations positive as well.
By using the identity matrix to initialize the procedure is equivalent to consider, step by step, the direction of the negative gradient while, at each next step, the direction is for sure a descent one. The above expression could carry the search out of the interval of validity for the squared approximation. The solution is hence to use the line search to found the minimum of function along the search direction.
There is also a slightly modified version of the QNA, known as L-QNA, or Limited memory-QNA, which has been implemented, as said above, through the L-BFGS algorithm.
The L letter in the scheme name comes from the words "Limited memory". Unlike the original BFGS method which stores a dense nxn approximation, L-BFGS stores only a few vectors that represent the approximation implicitly.
Due to its moderate memory requirement, L-BFGS method is particularly well suited for optimization problems with a large number of variables.
L-BFGS never explicitly forms or stores Hessian approximations. Instead, it maintains a history of the past m updates of the position and gradient, where generally the history m can be short, often less than 10.
These updates are used to implicitly do operations requiring the Hessian approximations vector product. While strictly, a straight-forward BFGS implementation at the i-th iteration would represent the inverse Hessian approximation as informed by all updates on 0,...,i-1, L-BFGS does quite well using updates from only the most recent iterations i-m...i-1.
In case of big dimensions, the amount of memory required to store a Hessian (NxN) is too big, along with the machine time required to process it. Therefore, instead of using N gradient values to generate a Hessian we can use a smaller number of values, which requires a memory capacity of order of NxM. In practice, M is usually chosen from 3 to 7, in difficult cases it is reasonable to increase this constant to 20. Of course, as a result we'll get not the Hessian but its approximation. On the one hand, the convergence slows down. On the other hand, the performance could even grow up.
At first sight, this statement is paradoxical. But it contains no contradictions: the convergence is measured by a number of iterations, whereas the performance depends on the number of processor's time units spent to calculate the result.
As a matter of fact, this method was designed to optimize the functions of a number of arguments (hundreds and thousands), because in this case it is worth having an increasing iteration number due to the lower approximation precision because the overheads become much lower. This is particularly useful in astrophysical data mining problems, where usually the parameter space is dimensionally huge and confused by a low signal-to-noise ratio. But we can use these methods for small dimension problems too. The main advantage of the method is scalability, because it provides high performance when solving high dimensionality problems, and it allows to solve small dimension problems too.
Models: Support Vector Machine
This model, functionally related to Classification and Regression, is one of the data mining algorithms already available in the current release of the DAMEWARE suite.
The Support Vector Machines (SVM) are a set of supervised learning methods useful in both classification and regression functional cases. Since their first definition they have been largely employed in many physical and biology real problems.
The SVM were originally defined for the classification of linearly separable object classes. For each group of objects, divided into in two classes, a SVM is able to identify the hyperplane with the highest level of separation.

In the above picture the hyperplane H1 defines the edge of +1 class, while the H2 defines the class -1. In the first case two objects are used, while for H2 three objects are needed. These circled objects are called support vectors. The core of the SVM method is exactly to found these support vectors in the data hyperplane.
Clearly the SVM are useful in those cases where a linear classifier is not able to separate classes. In these cases the object coordinates are mapped into a space called feature space, by using not linear functions, features functions phi.
The feature space is strongly multi-dimensional, in which the two classes can be separable by a linear classifier. Then the initial space is re-mapped into the new space and it is identified the classifier that is re-located in the initial space, as shown in the picture below.

So far, the function phi combines the initial space (original object features) into a feature space potentially having infinite dimension. Cause the fact that this space has many dimensions, it could not practice to use a generic function to individuate the separation hyperplane.
Special kernel functions are instead applied in several combinations to found the phi function. In the picture below the most famous kernel functions are reported.

The SVM were early extended to regression cases.
The parameter set used to train the SVM supervised network identifies a regression model represented by an hypertube with radius epsilon (called hyper parameter), fitting the data. In the ideal case, the regression performed by SVM founds a function mapping input data with a maximum standard deviation equal to epsilon, from the target value. In this case all training objects are located into the regression hypertube (picture below).

Usually it is not possible to fit all objects internally to the hypertube and maintaining a model with some realistic meaning.
Hence in the general case the regression SVM consider zero error for all objects inside the tube, while all other external objects have an error proportional to the distance from the tube edge.
Recall that SVM arises as a supervised network model to perform two-class classification (actually all implementations work up to 3-class problems).
They also have been adapted to work for regression problems. Currently there are many multi-language implementations (C++, Python, Java, Matlab and R), but we used the most famous LibSVM.
The main advantage of the SVM is that, given a generic problem and a defined kernel function, it is always solvable. But the downside is that it badly scales with the data dimension D (as D2).
Moreover it is crucial to apply the best choice for the kernel function, with the best parameters. In this case any experiment done with SVM can be considered as a never ending work.
Models: K-MEANS (through KNIME engine)
This model, functionally related to Clustering, is one of the data mining algorithms already available in the current release of the DAMEWARE suite.
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.
The k-means algorithm is by far the most popular clustering tool used in scientific and industrial applications.
The name arises from representing each of k clusters Cj by the mean (or weighted average) cj of its points, the so-called centroid.
While this obviously does not work well with categorical attributes, it has the good geometric and statistical sense for numerical attributes.
The sum of discrepancies between a point and its centroid expressed through appropriate distance is used as the objective function. For example, the L2-norm based objective function, the sum of the squares of errors between the points and the corresponding centroids, is equal to the total intra-cluster variance:
The sum of the squares of errors can be rationalized as (a negative of) log-likelihood for normally distributed mixture model and is widely used in statistics (SSE). Therefore, k-means algorithm can be derived from general probabilistic framework. Note that only means are estimated. A simple modification would normalize individual errors by cluster radii (cluster standard deviation), which makes a lot of sense when clusters have different dispersions. An objective function based on L2-norm has many unique algebraic properties.
Two versions of k-means iterative optimization are known.

The first version consists of two-step major iterations that
- (1) reassign all the points to their nearest centroids
- (2) recompute centroids of newly assembled groups
- It easily works with any Lp-norm;
- It allows straightforward parallelization;
- It is insensitive with respect to data ordering;
If a move has a positive effect, the point is relocated and the two centroids are recomputed.
It is not clear that this version is computationally feasible, because the outlined analysis requires an inner loop over all member points of involved clusters affected by centroids shifts.
There are experimental results demonstrating that compared with Forgy algorithm, the second (classic) version frequently yields better results.
In particular it was noticed that a Forgy spherical k-means (using cosine similarity instead of Euclidean distance) has a tendency to get stuck when applied to document collections. They noticed that a version reassigning points and immediately recomputing centroids works much better. Figure above illustrates both implementations.
The wide popularity of k-means algorithm is well deserved. It is simple, straightforward, and is based on the firm foundation of analysis of variances.
The k-means algorithm also suffers from all the usual suspects:
- The result strongly depends on the initial guess of centroids (or assignments)
- Computed local optimum is known to be a far from the global one
- It is not obvious what is a good k to use
- The process is sensitive with respect to outliers
- The algorithm lacks scalability
- Only numerical attributes are covered
- Resulting clusters can be unbalanced (in Forgy version, even empty)
Models: Self Organizing Feature Maps
This model, functionally related to Clustering, is one of the data mining algorithms already available in the current release of the DAMEWARE suite.
There are currently two customized versions of SOFM available in the web application:
- CSOM: Clustering Self Organizing Feature Map, dedicated to FITS image-only clustering experiments;
- GSOM: Gated Self Organizing Feature Map, dedicated to generic FITS image/table clustering experiments. Here the term gated derives from the SOFM model, in which the number of clusters to be identified must be submitted as input parameter to the algorithm;
The Self Organizing Map (SOM) or Self Organizing Feature Map (SOFM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. SOMs are different from other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.
This makes SOM useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. The model was first described as an artificial neural network by T. Kohonen, and is sometimes called a Kohonen map.
Like most artificial neural networks, SOMs operate in two modes: training and mapping. Training builds the map using input examples. It is a competitive process, also called vector quantization. Mapping automatically classifies a new input vector.
A SOM consists of components called nodes or neurons. Associated with each node is a weight vector of the same dimension as the input data vectors and a position in the map space. The usual arrangement of nodes is a regular spacing in a hexagonal or rectangular grid. The SOM describes a mapping from a higher dimensional input space to a lower dimensional map space. The procedure for placing a vector from data space onto the map is to find the node with the closest weight vector to the vector taken from data space and to assign the map coordinates of this node to our vector.
While it is typical to consider this type of network structure as related to feedforward networks where the nodes are visualized as being attached, this type of architecture is fundamentally different in arrangement and motivation.
Useful extensions include using toroidal grids where opposite edges are connected and using large numbers of nodes. It has been shown that while self-organizing maps with a small number of nodes behave in a way that is similar to K-means, larger self-organizing maps rearrange data in a way that is fundamentally topological in character.
It is also common to use the U-matrix. The U-matrix value of a particular node is the average distance between the node and its closest neighbors. In a square grid for instance, we might consider the closest 4 or 8 nodes, or six nodes in a hexagonal grid.
Large SOMs display properties which are emergent. In maps consisting of thousands of nodes, it is possible to perform cluster operations on the map itself.
The goal of learning in the SOM is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other sensory information is handled in separate parts of the cerebral cortex in the human brain.

The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest principal component eigenvectors. With the latter alternative, learning is much faster because the initial weights already give good approximation of SOM weights.
The network must be fed a large number of example vectors that represent, as close as possible, the kinds of vectors expected during mapping. The examples are usually administered several times as iterations.
The training utilizes competitive learning. When a training example is fed to the network, its Euclidean distance to all weight vectors is computed. The neuron with weight vector most similar to the input is called the best matching unit (BMU). The weights of the BMU and neurons close to it in the SOM lattice are adjusted towards the input vector.
The magnitude of the change decreases with time and with distance from the BMU. Below the typical neighborhood mexican hat activation function of Kohonen map neurons is shown.

Regardless of the functional form, the neighborhood function shrinks with time.
At the beginning when the neighborhood is broad, the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a couple of neurons the weights are converging to local estimates.
This process is repeated for each input vector for a (usually large) number of cycles. The network winds up associating output nodes with groups or patterns in the input data set. If these patterns can be named, the names can be attached to the associated nodes in the trained net.
During mapping, there will be one single winning neuron: the neuron whose weight vector lies closest to the input vector. This can be simply determined by calculating the Euclidean distance between input vector and weight vector.
While representing input data as vectors has been emphasized in this article, it should be noted that any kind of object which can be represented digitally and which has an appropriate distance measure associated with it and in which the necessary operations for training are possible can be used to construct a self-organizing map. This includes matrices, continuous functions or even other self-organizing maps.

There are two ways to interpret a SOM:
- Because in the training phase weights of the whole neighborhood are moved in the same direction, similar items tend to excite adjacent neurons.
Therefore, SOM forms a semantic map where similar samples are mapped close together and dissimilar apart. This may be visualized by a U-Matrix (Euclidean distance between weight vectors of neighboring cells) of the SOM. - The other way is to think of neuronal weights as pointers to the input space. They form a discrete approximation of the distribution of training samples. More neurons point to regions with high training sample concentration and fewer where the samples are poor.
SOM may be considered a nonlinear generalization of Principal components analysis (PCA). It has been shown, using both artificial and real geophysical data, that SOM has many advantages over the conventional feature extraction methods such as Empirical Orthogonal Functions or PCA.
Models: Principal Probabilistic Surfaces
This model, functionally related to Dimensional Reduction (or Feature Extraction), will be one of the data mining algorithms available in the next releases of DAMEWARE suite.
The machine learning method based on the Principal Probabilistic Surfaces (PPS) belongs to the so-called latent variable methods family.
The classical method of such family is the Principal Component analysis (PCA), which involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components.
The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
The PCA method has a strong limit: the linear reduction is not always correct, as shown below (left: single PCA; center: multiple PCA; right: not-linear).

The PPS are able to find a better data aggregation than the PCA method. A PPS is trained to recognize the best projection functions from the N-dimensional parameter space to a spherical surface in a 3-D space.
This surface is covered by a grid of latent variables (points), representing the Gaussian peak in the N-parameter space. It permits to visualize all data with a human compliant 3-D diagram, independently from the number of initial parameters. It is hence possible to individuate the presence of sub-structures in the data.
PPS in this sense can be considered a reliable method for data dimensional reduction.

The above picture shows:
- a) schematic representation of the spherical variety in a 3-D latent space;
- b) the same variety distorted in the parameter space RD with points associated to data;
- c) the projection of the point distribution on the surface of the spherical variety on the latent space R3.
During the training phase a reference variety is created.
In the test phase, a datum, never seen by the network, is attributed to the closest spherical variety. Obviously the concept of closest implies a calculation of a distance between a point and a node in the space. Before that the data must be projected on the space. This basically because a spherical variety consists of squared or triangular areas, each of them defined by 3 or 4 variety nodes.
After this projection of the datum the approximated distance is calculated.
In the PPS exist three main approximation criteria:
- Nearest Neighbor: it founds the minimum square distance from all variety nodes;
- Grid projections: it founds the shortest projection distance on the variety grid;
- Nearest Triangulation: it founds the projection closest to the possible triangulations.
The downside is that it is generally more time-consuming, but more precise than others.
The technique described above makes clear the role of PPS as an efficient method for MDS pre-clustering or dimensional reduction.
Models: Genetic Algorithm Model Experiment
The GAME model is a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from a multi-core CPU serial implementation, named GAME, already scientifically successfully tested and validated on astrophysical massive data classification problems, through a web application resource (DAMEWARE), specialized in data mining based on Machine Learning paradigms. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm has provided an exploit of the internal training features of the model, permitting a strong optimization in terms of processing performances and scalability.
As known, genetic algorithms are derived from Darwin's evolution law and are intrinsically parallel in its learning evolution rule and processing data patterns. The parallel computing paradigm can indeed provide an optimal exploit of the internal training features of the model, permitting a strong optimization in terms of processing performances.

GAME is a pure genetic algorithm specially designed to solve supervised optimization problems related with regression and classification functionalities, scalable to efficiently manage MDS and based on the usual genetic evolution methods (crossover, genetic mutation, roulette/ranking, elitism). In order to give a level of abstraction able to make simple to adapt the algorithm to the specific problem, a family of polynomial developments was chosen for GAME model. This method-ology makes the algorithm itself easily expandable, but this abstraction requires a set of parameters that allows fitting the algorithm to the specific problem.
From an analytic point of view, a pattern, composed of N features contains an amount of information correlated between the features corresponding to the target value. Usually in a real scientific problem that correlation is masked from the noise (both intrinsic to the phenomenon, and due to the acquisition system); but the unknown correlation function can ever be approximated with a polynomial sequence, in which the degree and non-linearity of the chosen function determine the approximation level. The generic function of a polynomial sequence is based on these simple considerations shown in the next pictures.

In order to evaluate the fitness of the patterns an extension of Mean Square Error (MSE) or Root Mean Square Error (RMSE) may be used. Then we define a GA with the following characteristics:
- The array (a0,...,aM) defines M genes of the generic chromosome (initially they are generated random and normalized between -1 and +1);
- All the chromosomes have the same size (constrain from a standard GA);
- The expression of the error calculation gives the standard error to evaluate the fitness level of the chromosomes;
- The population (genome) is composed by a number of chromosomes imposed from the choice of the function g(x) of the polynomial sequence.

N is the number of features of the patterns, d is the degree of the polynomial and B is a multiplicative factor that depends from the g(x) function, in the simplest case is just 1, but can arise to 3 or 4 in more complex cases (2 in the case of the trigonometric expansion). The parameter B also influences the dimension of each chromosome (number of genes).
In the present project, the idea is to build a GA able to solve supervised crispy classification and regression problems, typically related to an high-complexity pa-rameter space where the background analytic function is not known, except for a limited number of couples of input-target values, representing valid solutions to a physical category of phenomena. A typical case is to classify astronomical objects based on some solution samples (the KB) or to predict new values extracted by further observations. To accomplish such behavior we designed a function (a polynomial expansion) to combine input patterns. The coefficients of such polynomials are the chromosome genes. The goal is indeed to find the best chromosome so that the related polynomial expansion is able to approximate the right solutions to input pattern classification/regression. So far, the fitness function for such representation consists of the training error, obtained as absolute difference between the polynomial output and the target value for each pattern. Due to the fact that we are interested to find the minimum value of the error, the fitness is calculated as the complement of the error (i.e. 1-error) and the problem is reduced to find the chromosome achieving the maximum value of fitness.
GAME implementation on GPU based parallel computing infrastructure
In all execution modes (use case), GAME exploits the polynomial function, consisting in a polynomial expansion in terms of sum of sins and cosines. Specifically in the training use case, corresponding to the GA building and consolidation phase, the polytrigo is used at each iteration as the transformation function applied to each chromosome to obtain the output on the problem input dataset, and indirectly also to evaluate the fitness of each chromosome. It is indeed one of the critical aspects of the serial algorithm to be investigated during the parallelization design process. Moreover, after having calculated the fitness function for all genetic population chromosomes, this information must be back-propagated to evolve the genetic population. This back and forth procedure must be replicated as many times as it is the training iteration number or the learning error threshold, both decided and imposed by the user at setup time of any experiment. The direct consequence of the above issues is that the training use case takes much more execution time than the others (such as test and validation), and therefore is the one we are going to optimize.
Main design aspect approaching the software architecture analysis for the GPU is the partition of work: i.e. which work should be done on the CPU vs. the GPU. We have identified the time consuming critical parts to be parallelized by executing them on the GPU. They are the generation of random chromosomes and the calculation of the fitness function of chromosomes.
The key principle is that we need to perform the same instruction simultaneously on as much data as possible. By adding the number of chromosomes to be randomly generated in the initial population as well as during each generation, the total number of involved elements is never extremely large but it may occur with a high frequency. This is because also during the population evolution loop a variable number of chromosomes are randomly generated to replace older individuals. To overcome this problem we may generate a large number of chromosomes randomly una tantum, by using them whenever required. On the contrary, the evaluation of fitness functions involves all the input data, which is assumed to be massive datasets, so it already has an intrinsic data-parallelism.
Since CUDA programming involves code running concurrently on a host with one or more CPUs and one or more CUDA-enabled GPU, it is important to keep in mind that the differences between these two architectures may affect application performance to use CUDA effectively. The function polytrigo takes about three-quarters of the total execution time of the application, while the total including child functions amounts to about 7/8 of total time execution. This indeed has been our first candidate for parallelization.
To better address the GPU-based design starting from the serial implementation of the application, we choose an ad hoc software development methodology, for instance APOD (Assess, Parallelize, Optimize, and Deploy)(NVIDIA Corp. 2011).
APOD design cycle aims at quickly identify the portions of code that could take more easily the advantages and benefits of GPU acceleration, and begin to exploit the speedups resulting in production as fast as possible. APOD is a cyclical process: initial speedups can be achieved, tested, and deployed quickly, at which point the cycle can start over to identify further optimization opportunities.

Assess
The first step is to evaluate the multi-core application code to identify the parts responsible for most of the execution time. To identify the critical points and start to draw up a list of candidates for parallelization, the developers can use profiler tools. These bottlenecks are evaluated, de facto starting to investigate on parallelizable GPU acceleration. An upper limit of performance improvement can be estimated considering requirements and constraints, and by applying Amdahl's and Gustafson's laws (Gustafson 1988).
Gustafson's Law states that the problem size scales with the number of processors. Practically, for Gustafson the maximum speedup S of a program is:
S=N+(1-P)(1-N)
where P is the fraction of the total serial execution time taken by the portion of code that can be parallelized and N is the number of processors over which the parallel portion of the code runs.
Parallelize
Once identified the hotspots and having established the theoretical speedup achievable, we need to parallelize the code. By exposing the parallelism to improve performance and simply maintain the code of sequential applications, we are able to ensure also the maximum parallel throughput on GPU CUDA-capable. This could be as simple as adding a few preprocessor directives, such as OpenMP as OpenACC, or it can be done by calling an existing GPU-optimized library such as cuBLAS, cuFFT, or Thrust.
Specifically, Thrust (Bell N. et al. 2010) is a parallel C++ template library like C++ STL (Standard Template Library) (Stepanov et al. 1995), it provides a rich collection of data parallel primitives such as scan, sort, and reduce, which can be composed together to implement complex algorithms with concise, readable source code. Thrust is implemented entirely within CUDA C/C++ and maintains interoperability with the rest of the CUDA ecosystem. The native interoperability with CUDA C is a powerful feature. Interoperability ensures that Thrust always complements CUDA C and that a Thrust plus CUDA C combination is never worse than either Thrust or CUDA C alone.
Optimize
After the parallelization step is complete, we can move to optimize the outcome to improve performance. As well APOD, optimization is an iterative process (identify an opportunity for optimization, apply and test optimization, verify the speedup achieved, and repeat), which means that it is not necessary to spend large amounts of time trying all possible optimization strategies. Instead, strategies can be applied incrementally and using profiling tools can come in handy once again for guiding this process. Performance optimization is based on:
- Maximizing parallel execution;
- Optimizing memory usage to achieve maximum memory bandwidth;
- Optimizing instruction usage to achieve maximum instruction throughput.
Maximizing parallel execution starts with structuring the algorithm in order to expose as much data parallelism as possible. Once the parallelism of the algorithm has been exposed, should be mapped to the hardware as efficiently as possible. This is usually done by carefully choosing the running configuration of each kernel launch and maximizing competition between host and the device.
Optimizing memory usage starts by minimizing the host-to-device data transfers because they have much lower bandwidth than the device-to-device transfers. Sometimes, the best memory optimization could be simply to avoid any transfer of data by recalculating them whenever needed.
As for optimizing instruction usage, the use of arithmetic instructions that have low throughput should be avoided. This suggests trading precision for speed when it does not affect the end result, such as using single precision instead of double precision. Finally, particular attention must be paid to control flow instructions due to the SIMT (Single Instruction Multiple Thread) nature of the device.
Deploy
After completed an acceleration cycle, we can compare the result with the original implementation. Before tackling other critical points, the current partially parallelized implementation is deployed. This allows us to profit from the improvements as fast as possible (the speed increase may be partial, but it is still valid).
With each generation of NVIDIA processors, new features are added to the GPU that CUDA can leverage. Consequently, it's important to understand the characteristics of the architecture. The computing capability describes the features of the hardware and reflects the set of instructions supported by the device as well as other specifications, such as the maximum number of threads per block and the number of registers per multiprocessor. Higher computing capability versions are backward compatible.
When in doubt about the computing capability of the hardware that will be present at runtime, it is best to assume a computing capability of 1.0 or 1.3, depending on the required double-precision arithmetic.

Drawing of an empty pyramid
Leonardo da Vinci
De Divina Proportione, Luca Pacioli, Milan, 1497
