What is gCube?

Different perspectives can be used to describe what gCube is.

The Data Infrastructure enabler perspective

A way to turn infrastructures and technologies into a utility

Delivering an e-Infrastructure service to large organizations is a complex task that requires the integration of several technologies. The complexity of this service resides on:

  • very rich applications and data collections are maintained by a multitude of authoritative providers;
  • different problems require different execution paradigms: batch, map-reduce, synchronous call, message-queue, etc.;
  • key distributed computation middleware exist: gLite, Globus, and Unicore for grid-based wide resource sharing; Condor for site resource sharing; Hadoop and Cassandra for cluster resource sharing; etc.;
  • several standards exist applications in the same domain.

gCube, the D4Science empowering technology, offers solutions to abstract over differences in location, protocols, and models by scaling no less than the interfaced resources, by keeping failures partial and temporary, and by being autonomic reacting and recovering from a large number of potential issues.

gCube doesn't hide infrastructures middleware and technologies. It is not yet another layer. Rather it turns infrastructures and technologies into a utility by offering a single submission, monitoring, and access facilities. It offers a common framework to programming in the large and in the small. It allows to exploit concurrently private virtualized resources organized in sites with resources provided by IaaS and PaaS cloud providers. It supports transparent utilization of resources managed through OpenNebula.

The data and process enabling framework perspective

An abstraction layer over technologies belonging data, process and resource management

gCube is a large software framework designed to abstract over a variety of technologies belonging data, process and resource management on top of Grid/Cloud enabled middleware. By exposing them through a comprehensive and homogeneous set of APIs and services, it globally provides:

  • access to different storage back-ends in a variety of crucial areas for various purposes. Different storage layers offer facilities for handling: (a) multi-versioned software packages and dependencies resolution among them; (b) large scientific data-sets storable as tables (to be released in the upcoming minor release); (c) Time Series by offering an OLAP interface to operate over them; (e) structured document objects storable as trees of correlated information objects; (f) geo-coded datasets compliant with OGC-related standards; (g) and, finally, plain files;
    management of metadata in any format and schema that can be consumed by the same application in the same Virtual Organization;
  • a process execution engine, named PE2ng. PE2ng is a system to manage the execution of software elements in a distributed infrastructure under the coordination of a composite plan that defines the data dependencies among its actors. It provides a powerful, flow-oriented processing model that supports several computational middleware without performance compromises. Thus, a task can be designed as a workflow of invocation of code components (services, binary executables, scripts, map-reduce jobs, etc.) by ensuring that prerequisite data are prepared and delivered to their consumers through the control of the flow of data;
  • a transformation engine to tackle the issue of transformation of data among various manifestations. This engine is manifestation and transformation agnostic by offering an intelligent, object-driven operation workflow. It relies on the PE2ng and it's extensible through transformation-program plugin that can be added as PE2ng component. Each transformation-program is registered in the transformers registry and then used at run-time to perform transformation among large (in batch) and small (in real-time) transformation scenarios;
  • management of Virtual Research Environment (VRE). Through VREs, groups of users have controlled access to distributed data, services, storage, and computational resources integrated under  a personalised interface. A VRE supports cooperative activities such as: metadata cleaning, enrichment, and transformation by exploiting mapping schema, controlled vocabulary, thesauri, and ontology; processes refinement and show cases implementation (restricted to a set of users); data assessment (required to make data exploitable by VO members); expert users validation of products generated through data elaboration or simulation; sharing of data and process with other users.

The scientific application perspective

A way to deliver scientific applications on the cloud

By using its set of facilities several scientific applications have been implemented and delivered to the Fishery and Aquaculture Resource Management communities. After a long incubation phase those facilities are now integral part of the gCube release. Among the others the following applications deserve a note:

  • A collaboration-oriented suite providing for seamless access and organisation facilities on a rich array of objects (e.g. Information Objects, Queries, Files, Templates, Time Series). It offer mediation capabilities between external world objects, systems and infrastructures (import/export/publishing); it supports common file manager features (drag & drop, contextual menu);
  • Time Series (TS) framework offering a set of tools to manage large datasets by supporting the complete TS lifecycle (validation, curation, analysis, and reallocation). It offers a set of tools to operate on multi-dimensional statistical data. It supports filtering, grouping, aggregation, union, mining, and plotting.
  • Ecological Niche Modeling suite. This suite allows to predict the global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species, that generates color-coded species range maps using a half-degree latitude and longitude blocks by interfacing several scientific species-databases and repository providers. It allows the extrapolation of known species occurrences to determine environmental envelopes  (species tolerances) and to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution)