webapp logo

GCS

Globular Cluster Search

This page is the entry point to the GCS science case specialized for data mining on on the application of a Neural Network technique (Multi Layer Perceptron trained by the Quasi-Newton learning rule) to show that it is possible to effectively identify Globular Clusters in external galaxies, using single-band photometry and marginally resolved images. The experiments are performed through the DAME Web Application Suite DAMEWARE.
In this page the users can obtain news, information, documentation and results.



The Scientific Problem

The study of Globular Clusters populations in external galaxies requires the use of wide-field, multi-band photometry. However to minimize contamination problems and to measure some of the GC properties, such as sizes and structural parameters (core radius, concentration, binary formation rates) high-resolution data are required as well, which are only available through the use of space facilities (i.e. HST).

The use of HST data is however challenging since the optimal dataset should be deep, multi-band and with wide-field coverage in order to minimize projection effects, as well as to study the overall properties of the GC populations, which often differ from those inferred from observations of the central region of a galaxy.

The use of single-band HST data reduces the cost (in terms of observing time) of such studies, and can be eventually integrated with ground-based photometry in other bands to obtain the required color informations.

In this project we intend to show that even the use of single band photometry can yield very complete datasets with low contamination, through the use of Neural Network (Multi Layer Perceptron trained by a Quasi Newton rule) algorithm. This approach will minimize the observing time requirements, thus allowing to extend such studies to large areas and to the outskirts of nearby galaxies, thus minimizing the observational biases in studies where a very complete dataset is required, such as the study of Low Mass X-ray Binaries in GCs.

back to top page



The Data Set

The dataset used in this experiment consists in wide field HST observations of the giant elliptical NGC1399 in the Fornax cluster. This galaxy represents an ideal test case since, due to it's distance (20 Mpc), it is possible to cover a large fraction of its GC system (out to >5 Re) with a limited number of observations. Furthermore at this distance GC are only marginally resolved even by HST, allowing to verify our experiment in a worst-case scenario.

The optical data were taken with the HST Advanced Camera for Surveys (ACS, program GO-10129), in the F606W filter, with integration time of 2108 seconds for each field. The observations were arranged in a 3x3 ACS mosaic, and combined into a single image using the MultiDrizzle routine (Koekemoer et al. 2002). The final scale of the images is 0.03”/pix, providing Nyquist sampling of the ACS PSF. The field of view of the ACS mosaic covers ~100 square arcmin and extends out to a projected galactocentric distance of ~55 kpc, i.e. 4.9 r_e of the GC system (~5.7 r_e^gal).

Source catalogs were generated with SExtractor, requiring a minimum area of 20 pixels and reaching a 7 sigma depth of m_V=27.5, i.e. ~4 mag below the GC luminosity function turnover, thus sampling the entire GC population.. The catalog astrometric solution was registered to the USNO-B1 reference frame, obtaining a final accuracy of 0.2" r.m.s. Since no complete color catalog was available for the whole field, GC candidates were selected based on their magnitude and morphology, choosing sources with stellarity index >0.9 and m_V lower than 26. In fact the distribution follows the GC luminosity function down to mV=26; at fainter magnitudes background unresolved sources dominate the overall distribution.

The final catalog used for this experiment thus contains photometric and morphological parameters for 2100 sources:
  • isophotal magnitude;
  • kron radius;
  • aperture magnitudes within a 2, 6 and 20 pixels (0.06", 0.18" and 0.6") diameter;
  • ellipticity;
  • position angle;
  • FWHM;
  • SExtractor stellarity index;
In addition for these sources we were able to measure structural parameters, fitting King surface brightness profile models with the Galfit software (Peng et al. 2002), deriving:
  • tidal
  • core
  • effective radii
  • central surface brightness
The accuracy of these measurements was estimated simulating artificial GCs with the Multiking code (available at this page) specifically written to account for field distortion, PSF variation, dithering pattern.

In addition we use two multi-band datasets to obtain color informations for part of our sources that are needed in order to train our algorithms and validate the results: an HST/ACS dataset covering the central region of the galaxy in the g and z filters (Kundu et al. 2005), and a lower resolution ground-based dataset in C and R covering the entire galaxy (Bassino et al. 2006).

back to top page



The Data Mining Model (MLP-QNA)

As typical in DAME Program, we decided to investigate the above scientific case as a data mining problem on Massive Data Sets (MDS), i.e. by using a mathematical model based on an automatic learning of information, correlations and significant features from huge datasets (bases ok knowledge) related to processes of wide different nature. This approach is basically motivated by the need to analyze and understand complex phenomena, often described by only partial or completely not explicit information in their parameter space.

We selected five models based on the supervised machine learning paradigm:
  • MLPBP (Multi Layer Perceptron trained by Back Propagation);
  • MLPGA (Multi Layer Perceptron trained by Genetic Algorithm);
  • SVM (Support Vector Machine);
  • MLPQNA (Multi Layer Perceptron trained by Quasi Newton rule);
  • GAME (Genetic Algorithm Model Experiment);

The MLPQNA was the best model in terms of some performance indicators, shown in the next picture.


As a matter of fact, these methods were designed to optimize the functions of a number of arguments (hundreds and thousands), because in this case it is worth having an increasing iteration number due to the lower approximation precision because the overheads become much lower. This is particularly useful in astrophysical data mining problems, where usually the parameter space is dimensionally huge and confused by a low signal-to-noise ratio. But we can use these methods for small dimension problems too. In particular the main advantage of the method MLPQNA is scalability, because it provides high performance when solving high dimensionality problems, and it allows to solve small dimension problems too.

back to top page



Results

In this section we shortly outline the results of another experiment performed with the evaluation release of the DAMEWARE platform and concerning the identification of globular clusters in external galaxies. Further details will be found in [Brescia et al 2010]. For the benefit of non astronomer readers, we shall just point out that Globular Clusters (GCs) are almost spherical, massive stellar systems orbiting in the external halo of galaxies. The study of the GCs populations in external galaxies requires the use of wide-field, multi-band photometry and, in order to minimize contamination from fore/background objects and to measure some of the GC properties (size, core radius, concentration, binary formation rates) high angular resolution data are required [Jordán et al 2009].

The detection of GCs relies basically on two aspects: the shape of the image (which differs from the instrumental Point Spread Function or PSF) and the colors (i.e. the ratio of observed fluxes at different wavelengths). The shape allows to disentangle large systems from stars (which are PSF-like), while the colors are needed to disentangle GCs from other extended systems such as background galaxies (Figure below).

FIGURE - A sub-section of the HST (Hubble Space Telescope) image used to build the dataset for the experiment. It is obtained with the ACS (Advanced Camera for Survey) used to detect Globular Clusters (GCs) around N1399. GCs (in yellow) are difficult to distinguish from background galaxies (in green), based only on single band images.

The supervised learning experiment presented in what follows, regarded the attempt to identify GCs in single band wide field images obtained with the Hubble Space Telescope for the galaxy NGC1399, using the base of knowledge (true GCs) provided in [Paolillo et al 2010a], [Paolillo et al 2010b]. The advantage being that single band data are much less expensive in terms of observing time, and thus easier to obtain than multi-band ones.

TABLE 1 - SUMMARY OF THE EXPERIMENT SETUP. There are specified all scientific dataset parameters used as Base of Knowledge for the experiment.

The input (see Table 1) features were of two types: optical (measured fluxes and moment of the light distribution) and structural (derived from a King model fit, commonly used to describe GC profiles). Optical parameters were measured for 12915 objects from a single band deep image of the galaxy NGC1399, while structural parameters were measured for a subsample of 4590 sources [Paolillo et al 2010a], [Paolillo et al 2010b]. The Book of Knowledge (BoK) used to train the model was obtained by using multi-wavelength information (color selection). The total amount of objects in the BoK was 2100 objects, having both optical color and structural information: 1219 true GCs and 881 false GCs.

The following table shows a direct comparison between the five machine learning models used, in terms of general performances.


The machine learning supervised model which obtained the best recognition performances was the Multi Layer Perceptron (MLP) trained by the Quasi Newton Approximation (QNA) learning rule [Shanno 1970], [Sherman 1949], implemented with the optimized L-(Broyden Fletcher Goldfarb Shanno) (L-BCFG) [Byrd et al 1994], where L stands for Limited memory version of the algorithm, that will be integrated into the next release of DAME web application, configured in a typical hierarchical layers (input-hidden-output). More rigorously, the QNA is an optimization of learning rule, also because, as described below, the implementation is based on a statistical approximation of the Hessian by cyclic gradient calculation, that, as said in the previous section, is at the base of Back Propagation method. As known, the classical Newton method uses the Hessian of a function. The step of the method is defined as a product of an inverse Hessian matrix and a function gradient. If the function is a positive definite quadratic form, we can reach the function minimum in one step. In case of an indefinite quadratic form (which has no minimum), we will reach the maximum or saddle point. In short, the method finds the stationary point of a quadratic form. Some modifications of Quasi-Newton methods perform a precise linear minimum search along the indicated line, but it is proved that it's enough to sufficiently decrease the function value, and not necessary to find a precise minimum value. The L-BFGS algorithm tries to perform a step using the Newton method. If it does not lead to a function value decreasing, it lessens the step length to find a lesser function value. As a matter of fact, this me-thod was designed to optimize the functions of a number of arguments (hundreds and thousands), because in this case it is worth having an increasing iteration number due to the lower approximation precision because the overheads become much lower.

This is particularly useful in statistical data mining problems, where usually the parameter space is dimensionally huge and confused by a low signal-to-noise ratio. But we can use these methods for small dimension problems too. The main advantage of the method is scalability, because it provides high performance when solving high dimensionality problems, and it allows to solve small dimension problems too. With this method we performed the series of experiments summarized in Table 2 below.

TABLE 2 - SUMMARY OF THE EXPERIMENT SETUP. There are specified all the MLPQNA model parameter used for the experiment.

Using all features the best result led to a performance of 98.33%. It needs to be stressed, however, that a feature significance analysis performed by rejecting one feature at the time (pruning), showed that the exclusion of feature 11 does not significantly degrade the performances (97.95%), [Brescia et al 2011].

More in detail, concerning the best performance case (the dataset with 2100 samples, including both optical and structural features), the reported performance of 98.33% is hence referred to the following model output:
  • 1203 TRUE GCs correctly identified;
  • 862 FALSE GCs correctly identified;
The results can also be expressed also in terms of com-pleteness and purity of the experiment:
  • 1203 TRUE GCs recognized out of 1219 samples imply a completeness of 98.69%;
  • 19 FALSE GCs were wrongly considered as TRUE, so far contaminating the output dataset. It hence results with a purity of 98.44% (1.56% of contamination);
Finally, as previously mentioned, a complete pruning phase was performed on all 11 features of the BoK (7 optical plus 4 structural features), obtaining a relevance percentage (in terms of correlation contribute in each sample) for all features. The fact that feature 11 (TIDAL RADIUS) carries almost no relevant information can be easily understood on the basis that the tidal radius of globular clusters is not well determined, thus resulting in a very low contribution in terms of correlation information [Brescia et al 2011].

back to top page



Bibliography and References

  • Brescia, M.; Longo, G.; Djorgovski, G. S.; Cavuoti, S.; D'Abrusco, R.; Donalek, C.; Di Guido, A.; Fiore, M.; Garofalo, M.; Laurino, O.; Mahabal, A.; Manna, F.; Nocella, A.; d'Angelo, G.; Paolillo, M.; DAME: A Web Oriented Infrastructure for Scientific Data Mining & Exploration, 2010arXiv1010.4843B, 16 pages, 9 figures, 2010
  • Jordan, Andres et al., The ACS Virgo Cluster Survey XVI. Selection Procedure and Catalogs of Globular Cluster Candidates, The Astrophysical Journal Supplement, Volume 180, Issue 1, pp. 54-66, 2009;
  • Paolillo, Maurizio et al., Probing the GC-LMXB Connection in NGC 1399: A Wide-Field Study with HST and CHANDRA, Draft version September 3, 2010a;
  • M. Paolillo, et al., Probing the Low Mass X-ray Bina-ries/Globular Cluster connection in NGC1399, American Institute of Physics Conference Series, 1248, 243, 2010b
  • D. F. Shanno, Conditioning of Quasi-Newton methods for function minimization, Math. Comput., 24, 647-656, 1970
  • J. Sherman, Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix, Annals of Mathematical Statistics , 20, 621, 1949
  • Byrd, R.H et al., Representations of Quasi-Newton Matrices and their use in Limited Memory Methods, Mathematical Programming, 63, 4, pp. 129-156, 1994;
  • Brescia, M.; Cavuoti, s.; Paolillo, M.; Longo, G.; Puzia, T.; 2011, The detection of Globular Clusters in galaxies as a data mining problem, accepted by MNRAS (in press), 11 pages, available at arXiv:1110.2144v1
  • Brescia, M., MLP with QNA model design and user manual, DAME Technical Documentation, mlpGP_DAME-MAN-NA-0008-Rel1.0, September 02, 2010;

back to top page



Who is who in the GCS project

  • Maurizio Paolillo (Science Management)
  • Massimo Brescia (Data Mining Model Design and Development, Project Management)
  • Stefano Cavuoti (PhD Science and Engineering Support)
  • Sandro Riccardi (Model Integration in DAMEWARE web application)
  • Giuseppe Longo (PI & Science Support)

back to top page





matrix

Drawing of an empty icosahedron

Leonardo da Vinci

De Divina Proportione, Luca Pacioli, Milan, 1497



Related Resources
[+] Multiking HST/ACS simulator

[+] Quasi-Newton Method

[+] DAMEWARE Web Application

DAME Science Cases
[+] Photometric redshifts

[+] Photometric Quasar candidates

[+] STraDIWA

[+] Image segmentation

DAME Infrastructure
[+] Overview

[+] Cloud/Grid Environment

Technical Support
[+] helpdame AT gmail.com

[+] Skype service Skype Me™!