publications

DAME Program and Technology Overview

This page is dedicated to an overview about main aspects of both technology and management issues adopted into DAME Program.



Introduction

DAME is a SOUP Service Oriented Utilities Program.
The term Program derives from the project management context language, and indicates that DAME is a Program as a suite (ensemble) of projects, sharing a unique technological platform, based on a Service Oriented Application architecture, and a shared goal, e.g. able to provide a general-purpose and cross-disciplinary computationally distributed environment for knowledge extraction and data mining in massive data sets. Specialized for Astrophysics science cases, it can offer to a wide variety of e-science communities a large spectrum of computational facilities to produce science by exploiting the wealth of available massive astronomical data sets and powerful machine learning and statistical algorithms.

The crucial role played by the multi-disciplinary expertise needed to deal with the ongoing burst of data complexity and to perform data mining and exploration on Massive Data Sets (MDS) has been recently certified by the constitution, within the IVOA (International VO Alliance), of an Interest Group on Knowledge Discovery in Data Bases (KDD-IG) which is seen as the main interface between the IVOA technical infrastructure and the VO enabled science.
As known, KDD is associated to the new scientific discipline family, called X-Informatics, and it is worldwide considered as the fourth paradigm of Science, after theory, experiments and simulations.

kdd

In this context, the DAME project intends:

  • (a) to provide the community with an extensible, integrated environment for Data Mining and Exploration, based on WEB 2.0 technology;
  • (b) to support the VO standards and formats, especially for application interoperability;
  • (c) to abstract the application deployment and execution, so to provide the VO with an opaque general purpose computing platform taking advantage of the modern technologies (e.g. Grid, Cloud, etc...).

Data Mining can be considered as the frontier of VObs enabled science since it represents the only way to capture and reveal the scientific knowledge (patterns, trends, correlations, etc.) hidden behind the complexity of MDS. The DAME project aims at creating a distributed e-infrastructure to guarantee integrated and asynchronous access to algorithms to mine data collected by very different experiments and scientific communities in order to correlate them and improve their scientific usability. The project consists of a data mining framework with powerful software instruments capable of working on MDS in a distributed computing environment. So far, the VObs effort has focused on the realization of the low-level tools and on the definition of standards. Our project extends this fundamental target by integrating it in an infrastructure, joining service-oriented software and resource-oriented hardware paradigms, including the implementation of advanced tools for KDD purposes.

Furthermore, the DAME design takes into account the fact that the average scientists cannot and/or does not want to become an expert also in Computer Science or in the fields of algorithms and Information Technology. In most cases the RMS. scientist (our end user) already owns his own algorithms for data processing and analysis and has implemented private routines/pipelines to solve specific problems. The KDD scheme adopted in the DAME package is based on Soft Computing methods, belonging to the typical dichotomy of machine learning methods which confronts supervised and unsupervised methods. All these methods have a common data mining paradigm: the AI technique as self-adaptive exploration methodology.
The core of DAME Service Oriented Architecture can be extended in many directions: for example, one can write his own client, or can provide a new driver for a different deployment environment (for both computing and storage). Also, one can develop low level Data Mining classes to enrich the low level Data Mining Models, or he can develop high level plugins implementing specific science cases which make use of the underlying Data Mining Models.


back to top page


Work Breakdown Structure

In the project management context, sometimes, during a single project development, some extensions are requested. Extension means the introduction of a new project, technically connected and related with the original project, but requiring a parallel organization and supervision. In such cases, the Project Management evolves into Program Management, i.e. an ensemble (Suite) of projects. In particular, Program Management requires:

  • program for development of a series of related projects;
  • periodical revision of current open projects connected by the program;
  • parallel projects oriented to a unique main goal at the end of the program;
kdd

The above management cyclic process has been associated to a well specific Work Breakdown Structure (WBS), shown below.

kdd

In order to make effective the described management scheme, the Extreme Programming (XP) software development methodology has been followed. It intends to improve software quality and responsiveness to changing customer requirements. As a type of agile software development, it advocates frequent "releases" in short development cycles (timeboxing), which is intended to improve productivity and introduce checkpoints where new customer requirements can be adopted. Like other agile software development methods, XP attempts to reduce the cost of change by having multiple short development cycles, rather than one long one. In this doctrine changes are a natural, inescapable and desirable aspect of software development projects, and should be planned for instead of attempting to define a stable set of requirements. The XP-Agile flow is shown below.

kdd

The PM-O (Project Management)-Oriented strategy foresees people involved full-time as administrators and managers. This implies:

  • high action freedom of the Project Manager in the pyramidal structure of man power;
  • hierarchical structure of responsibilities of manager and area sub-managers;
  • careful control and evaluation policy for performances and restricted operation procedures;
  • strict correspondence between Work Breakdown Structure and Man power hierarchy;
  • correct balancing between power, responsibility and needs in order to ensure fast decisions and weighted choices;

The P-O (Project-Oriented) approach is exactly the opposite. It implies:

  • the man power is frequently re-distributed on the base of needs of the project;
  • each WG member replies directly to the unique Project Manager;
  • Possible divisions are often dedicated to particular services and in any case the main responsible remains the Project Manager.

An hybrid P-O solution (shown below) has been adopted in the DAME program, essentially for logistic and other peculiar reasons.

kdd

Last but not the least, programming is only the last part of the overall software development project; documenting it is a nightmare that makes it feasible. The following is the technical documentation structure adopted in DAME.

kdd

A complete view of the entire technical documentation produced by DAME Team since January 2007 is available here.


back to top page



Technology

kdd


back to top page



Computing and Deployment Cloud Infrastructure

DAME program is designed to be hosted on several different computing platforms. From the functional point of view it is formed by a series of WEB services and applications. In other words DAME Suite is a service-oriented infrastructure, perfectly matching the paradigm of Cloud Computing.

Cloud computing describes a new supplement, consumption and delivery model for IT services based on the Internet, and it typically involves the provision of dynamically scalable and often virtualized resources as a service over the Internet.
It is a byproduct and consequence of the ease-of-access to remote computing sites provided by the Internet. The term cloud is used as a metaphor for the Internet, based on the cloud drawing used in the past to represent the telephone network and later to depict the Internet in computer network diagrams as an abstraction of the underlying infrastructure it represents. Typical cloud computing providers deliver common scientific or business applications online which are accessed from another web service or software like a web browser, while the software and data are stored on servers.

kdd

In the current configuration, the DAME Cloud includes the following main services (see above picture). It is based on three different Internet domains, completely embedded to the end user.

The DAME Cloud hosts also, in an hybrid configuration a Grid farm (provided by S.Co.P.E. Project).

Grid computing concerns the application of the resources of many computers in a network to a single problem at the same time - usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data.

Grid computing requires the use of software that can divide and farm out pieces of a program to as many as several thousand computers. Grid computing can be thought of as distributed and large-scale cluster computing and as a form of network-distributed parallel processing. It can be confined to the network of computer workstations within a corporation or it can be a public collaboration (in which case it is also sometimes known as a form of peer-to-peer computing).


back to top page



Parallel Computing with GPGPU and CUDA technology

GPGPU is an acronym standing for General Purpose Computing on Graphics Processing Units. It was invented by Mark Harris in 2002 (Harris 2003), by recognizing the trend to employ GPU technology for not graphic applications. With such term we mean all techniques able to develop algorithms extending computer graphics but running on graphic chips. Up to 2006 these chips have been difficult to be used, mainly because programmers were conditioned to use specific APIs (Application Programming Interface) to access to graphic devices, hence based on methods made available by libraries like OpenGL and Direct3D. These APIs often were strongly limiting applications design and development. In general the graphic chips, due to their intrinsic nature of multi-core processors (many-core) and being based on hundreds of floating-point specialized processing units, make many algorithms able to obtain higher (one or two orders of magnitude) performances than usual CPUs (Central Processing Units). They are also cheaper, due to the relatively low price of graphic chip components. Particularly useful for super-computing applications, often requiring several execution days on large computing clusters, the GPGPU paradigm may drastically decrease execution times, by promoting research in a large variety of scientific and social fields (such as, for instance, astrophysics, biology, chemistry, physics, finance, video encoding and so on).

For over two decades, before the advent of multi-core architectures, the general purpose CPUs have been characterized, at each generation, by an almost linear increasing of performances together with a decreasing of costs, also known as Moore's Law (Moore 1965), shown in next figure.

game

So far, we have now available low-cost desktop PCs able to execute tens of Giga floating-point Operations per Second (GFLOPS) and server clusters with hundreds of GFLOPS. This performance growth engaged a fundamental virtuous cycle in the Computer Science:

  • Users, being rapidly used to performance growth for computers, especially in terms of execution speed, processing reliability and multi-tasking capability, are continuously asking for better software systems;
  • Developers, by observing the constant increase of software performances, together with processor technology, always ask for better hardware performances to optimize application speed.

There is a downside of this virtuous mechanism. The physical constraints of Thermodynamics started to cause relevant problems of power consumption and heat dissipation inside the modern CPUs, by slowing such evolution trend and by forcing computer manufacturers to a drastic revolution in the processor architecture design. In fact, in order to make feasible this linear trend of performances, by controlling the thermal effects, the new strategy was to reduce the clock frequencies and to distribute working loads over several processing units (cores) located on the same chip. From the architectural point of view, such new roadmap has inevitably changed the design approach adopted up to now in the software development environment. They in fact moved away from the past sequential structure. Such methodology appeared obsolete on the new multi-core infrastructure, essentially because the sequential program can run on a single core, leaving unexploited the rest of processor cores. Furthermore, without an effective growth of performances, the developers would not be able to introduce new features in the software products, blocking de facto the evolution of the entire computer science business. So far, in order to maintain the cyclic hardware/software trend, the software applications had to change their perspective, moving towards parallel computing, able to fully exploit the availability of parallel architectures. The first systems, on which the parallel programming started, were indeed HPC mainframes.

However they are machines, or infrastructures in the Grid/Cloud cases, having some critical points:

  • Large dimensions;
  • High costs for equipment and management;
  • Difficult to be accessed by external developers and users;

With such problems, many applications are not able to justify these high costs and this was hardly limiting in practice the parallel programming dissemination.
Nowadays the multi-core technology has reached so high sales volumes that a parallel programming approach can be considered as usual. This caused a trend inversion in the software development field.

At the beginning of 2000 every silicon farm posed an important question: which roadmap to follow in the processor development to reach the business goals? Multi-core processors were selected by many companies, such as, for instance, Advanced Micro Devices Inc. (AMD), ARM Ltd., Broadcom Corp., Intel Corp. e VIA Technologies. Examples of last generation multi-core architectures are present either in the AMD Phenom X4 and Intel Core i7 families. More specifically, these multi-core processors are based on an integrated circuit in which two or more processors were connected to the same socket, in order to increase their connection speed. Each core implements the full set of x86 instructions and it enhances the performances, reduces consumptions and implements a more efficient multi-tasking. First models were dual-core, comparable to dual-processor systems. Ideally indeed, a dual-core processor would be about two times more powerful of a single-core processor. But in practice this gap is about one and half times. The evolution of such architecture proceeds through a slow enhancement, in which the number of cores doubles with every new semi-conductor generation. The basic idea is to grow the core number by maintaining unchanged the execution speed of pre-existent sequential programs. The critical points for such architecture come out in case of serial programs. In this case, in the absence of the parallel approach, the processes are scheduled in such a way that the full load on the CPU is balanced, by distributing them over the less busy cores each time. However many software products are not designed to fully exploit the multi-core features, so far the micro-processors are designed to optimize the execution speed on sequential programs. The choice of graphic device manufacturers, like ATI Technologies Inc. (acquired by AMD in the 2006) and NVIDIA Corp., was the many-core technology (usually many-core is intended for multi-core systems over 32 cores). The many-core paradigm is based on the growth of execution speed for parallel applications. Began with tens of cores smaller than CPU ones, such kind of architectures reached hundreds of core per chip in a few years. An example of many-core architecture is the graphic device NVIDIA GeForce GTX 560, with 336 cores, also named Streaming Processor (SP). These cores are grouped into units, called Streaming Multiprocessor (SM), of 8 cores each (hence in our case 336/8 = 42 SM). Each SP is an in-order executed scalar processor and it shares both control logic and instruction cache with others. The many-core processors, in particular GPU, have led the race for floating point computation performance since 2004, as shown in next Figure (Kirk et al 2010).

game

Since 2009 the throughput peak ratio between GPU (many-core) and CPU (multi-core) was about 10:1. It must be issued that such values are referred mainly to the theoretical speed supported by such chips, i.e. 1 TeraFLOPS against 100 GFLOPS. Such a large difference has pushed many developers to shift more computing-expensive parts of their programs on the GPUs. The large difference between GPU and CPU is basically located into the different design philosophy, as shown in next figure.

game

GPGPU environment: CUDA

CUDA (Compute Unified Device Architecture) is a general purpose parallel computing architecture introduced by NVIDIA in November 2006 that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. It includes the CUDA Instruction Set Architecture (ISA) and the parallel compute engine in the GPU. CUDA comes with a software environment that allows developers to use C as a high-level programming language. Other languages or API are supported, such as CUDA FORTRAN, OpenCL, and DirectCompute. To the hardware perspective, NVIDIA devoted silicon area to facilitate the ease of parallel programming, so this did not represent a change in software alone; additional hardware was added to the chip. CUDA programs no longer go through the graphics interface at all. Instead, a new general-purpose parallel programming interface on the silicon chip serves the requests of CUDA programs.

Unlike previous generations that partitioned computing resources into vertex and pixel shaders, the CUDA Architecture included a unified shader pipeline, allowing each and every arithmetic logic unit (ALU) on the chip to be marshaled by a program intending to perform general-purpose computations. Because NVIDIA intended this new family of graphics processors to be used for general-purpose computing, these ALUs were built to comply with IEEE requirements for single-precision floating-point arithmetic and were designed to use an instruction set tailored for general computation rather than specifically for graphics. Furthermore, the execution units on the GPU were allowed arbitrary read/write access to memory as well as access to a software-managed cache known as shared memory. All of these features of the CUDA Architecture were added in order to create a GPU that would excel at computation in addition to performing well at traditional graphics tasks.

game

A typical CUDA-capable GPU is organized into an array of highly threaded streaming multiprocessors (SMs). In the above figure, two SMs form a building block; however, the number of SMs in a building block can vary from one generation of CUDA GPUs to another generation. Also, each SM has a number of streaming processors (SPs) that share control logic and instruction cache. Each GPU currently comes with up to 4 GB of graphics double data rate (GDDR) DRAM, referred to as global memory. They function as very-high-bandwidth, off-chip memory, though with somewhat more latency than typical system memory. For massively parallel applications, the higher bandwidth makes up for the longer latency. Each SP has a Multiply-Add (MAD) unit and an additional multiply unit. In addition, special-function units perform floating-point functions such as square root (SQRT), as well as transcendental functions. Because each SP is massively threaded, it can run thousands of threads per application.

A CUDA program is a unified source code encompassing both host and device code. While the host code is straight ANSI C code, the device code is written using ANSI C extended with keywords for labeling data-parallel functions, called kernels, and their associated data structures. The kernels typically generate a large number of threads to exploit data parallelism. Due to efficient hardware support the CUDA threads are of much fast and lighter weight than the CPU threads that typically require thousands of clock cycles to generate and schedule.

game

The execution of a typical CUDA program starts with host (CPU) execution. When a kernel function is invoked, the execution is moved to a device (GPU), where a large number of threads are generated to take advantage of abundant data parallelism. All the threads that are generated by a kernel during an invocation are collectively called a grid. When all threads of a kernel complete their execution, the corresponding grid terminates, and the execution continues on the host until another kernel is invoked.

A traditional parallel computing environment is very difficult and expensive to set up. This can be circumvented by recurring to graphics hardware, inexpensive, more powerful, and perfectly comparable with other more complex HPC mainframes in terms of computing power (many frameworks based on GPU architecture are already included in the top 500 HPC worldwide supercomputer ranking ). The DAME Program has already started the investigation on the design and implementation of a hierarchical parallel genetic algorithm (GAME model), implemented on new technology based on multi-core Graphics Processing Unit (GPU) provided by NVIDIA Company, by using the Compute Unified Device Architecture (CUDA) parallel programming SDK. CUDA is a platform for massively parallel high-performance computing on the company's powerful GPUs (Zhang et al. 2009). At its cores are three key abstractions: (a) a hierarchy of thread groups, (b) shared memories, and (c) barrier synchronization that are simply exposed to the programmer as a minimal set of language extensions. These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse grained data parallelism and task parallelism


back to top page



matrix

Drawing of a full cube

Leonardo da Vinci

De Divina Proportione, Luca Pacioli, Milan, 1497



Science Cases
[+] Photometric redshifts

[+] Photometric Quasar candidates

[+] Globular Clusters search

[+] Transient classification

[+] Image segmentation

Products
[+] DAMEWARE

[+] VOGCLUSTERS

[+] DAME-KNIME

[+] STraDiWA

[+] DAME-WFXT

[+] DAME-SDSS