Inside Data 69
by Graham Keitch
Graham Keitch examines the data integration solutions offered by Oracle and Microsoft.
HardCopy Issue: 69 | Published: June 1, 2016
Data integration technologies help combine disparate data sets to provide a unified and more complete view, perhaps to meet the requirements of a business application or for analytical and visualisation purposes. Although improvements in database technology have made it easier to adopt a single platform to support a wider range of data types, most businesses have to deal with multiple platforms due to performance, legacy or operational circumstances. The Internet of Things is creating extra multiplicity, as is the increasing trend to work with contextual information such as finance or weather data. The distributed architecture of today’s systems, whether on-premises, in the Cloud or hybrid in nature, creates an even greater need for data integration tools.
Distributed data assets create complexity when you need to run queries across information silos. The traditional approach to this problem is to pull data from the source and load it into a centralised data warehouse having first prepared it for a single view schema, a process known as extract, transform and load (ETL). This data subset is known as a ‘data mart’ and is designed to provide a view of the queried information for a specific purpose. This might be for analytics or to serve the needs of a business unit that doesn’t need to work with the entire database. Warehouses are generally designed to hold summary data of a ‘snapshot in time’ when the ETL was executed, but increasingly businesses are needing to base decisions on real time events and information, some of which will be unstructured.
A more sophisticated approach to data warehousing makes use of data virtualisation which allows the data to be retrieved and manipulated in real time without the need to know anything about its format or location. The source data is left in place which helps reduce network traffic and the likelihood of error. Virtual ETL uses abstracted representations of objects or entities gathered from the various relational, semi-structured and unstructured sources that need to be integrated. The models are stored in a metadata repository which can be persistent or reside in memory.
So-called ‘data lakes’ and data hubs are examples of new architectures that use these techniques, and they are being deployed to house large collections of disparate information. Unlike a data mart which is a subset, a data lake is a central repository for all the data. A further refinement is the data hub which adds value to the store by homogenising the information, providing security and facilitating the creation of standardised queries. The hub derives its name from its hub and spoke architecture which allows line of business users to dip in and out along a defined spoke or channel.
Oracle’s adherence to open standards puts it in a good position to deal with integration requirements and the company provides a range of tools for this. Oracle Data Integrator (ODI) Enterprise Edition 12c is their flagship integration product designed for warehousing, BI and other tasks that need to pull data from disparate sources. It also solves the real time issue when used alongside Oracle GoldenGate which handles replication to ensure the warehouse is synchronised with the live data set. The Knowledge Modules included with ODI implement the data flows and define templates for code generation across the multiple sources.
This extensible framework will help reduce development costs. Big Data creates one of the greatest integration challenges because of its unstructured nature. For many organisations, this means tapping into completely different and unfamiliar database technologies. ODI provides prebuilt ETL integration with today’s leading Big Data platforms.
Integration tools are also provided by Microsoft for its SQL Server and SQL Azure platforms. SQL Server Integration Services (SSIS) is a component of the Standard, Business Intelligence and Enterprise editions of SQL Server 2014, providing tools for building data integration solutions. SQL Server Data Tools (SSDT) lets you develop Integration Services packages which are managed and executed by SQL Server Management Studio. Each package contains information related to configuration, connections, control and data flow.
As with Oracle ODI, SSIS is an ETL tool and Microsoft provides third party software that allows data to be loaded or unloaded from other databases in the context of Microsoft SSIS. Adapters are available for Oracle, Teradata and SAP BI.
By its nature, the Cloud is creating a more distributed and complex environment for data, but it can also be an ideal environment for pulling it all together. Most database vendors and cloud platforms have introduced or are introducing tools that allow businesses to access and work with disparate data sources previously beyond reach.