ETL Framework

Adastra ETL Framework

Adastra ETL Framework is a set of industry standards, best practices and easy maintenance principles for data integration based on Microsoft SQL Server technologies. It’s not only a tool but it translates key DWH integration standards into code which is independent on the developer thus forcing these standards to take place.

Adastra ETL Framework is taking advantages of SSIS as a tool for effective management of data workflows of individual ETL/ELT processes (see Figure 3 - Centrally managed metadata used for definition of ETL processes) in combination with generated stored procedures based on stored metadata and thus achieving the best performance of data transformations possible. The Framework is based on ELT principle which executes the individual transformations directly as a part of database engine thread.

Another difference between the classic ETL and ELT approach is a role which the ETL/ELT engine plays in terms of maintenance of ETL metadata, BI Governance and Data Governance.

  • When using the Adastra ETL Framework the metadata is stored in database and based on the metadata the ETL transformation code is generated. The same metadata is being used for workflow management in SSIS. Changes in ETL transformations are represented by changes of metadata in the database. The metadata also serves the documentation of transformations purpose which can be used for further documentation and analysis (ex. Impact analysis for new development).
  • When using classic ETL approach the metadata is only a source for ETL manual development done by the developer either in SSIS package or T-SQL code directly. Changes in ETL transformations then involve long analysis of existing ETL code and SSIS packages.
  • At the same time Adastra ETL Framework relates to the uniform logging, audit and monitoring. The Framework always logs the same operations in the same way. It creates consistent audit trails, etc. It also provides the flexibility to adapt the engine based on the individual requirements of the customer.

Given these reasons the fit of Adastra ETL framework is rather defined by the role the customer expects from the solution not only the size of the solution. Our experience shows that all projects delivered in the last decade have been based on the Framework due to effectiveness, reliability and flexibility of development.

All transformations are executed based on the metadata descriptions. These metadata describe source and target structures and define the way the data is retrieved and transformed. The resulting product is a generated script (in T-SQL) which is then executed in regular loads. The metadata is easy to manage, can be versioned and it can also be used for the generation of documentation of the DWH solution (data dictionary).

ETL processes managed by metadata (see Figure 3 - Centrally managed metadata used for definition of ETL processes) have many advantages compared to the classic ETL approach:

  • Defined ETL patterns – each customer has specific requirements for the treatment of dates, time, text and numeric values, etc. The reuse of these patterns in all ETL pumps brings a distinct added value which decreases the amount of time and significantly increases the reliability of ETL pump development.
  • Generation of ETL pumps – in the event of logic change, ex. change of a single pattern it is possible to regenerate all ETL pumps without the need to open tens of ETL packages individually. This functionality significantly reduces time needed for implementation.
  • Logging and audit – when the ETL pumps are generated logging is automatically added to the generated code in chosen level of detail. When coding ETL pumps manually the developers are usually lack of sufficient log entry writing and it is very difficult to discover the flaws and erroneous code until unexpected results are discovered by users. Change in the requirement for logging can then be easily done and transferred to all ETL pumps.
  • Business rules and validation – as a part of the generated ETL pumps it is easy to add decision workflows and validation which require an intervention from the user or a system.
  • Support for all DWH layers – ETL Framework is used for all DWH layers, specifically L0, L1 and L2 layers. Consistency checks comparing input and output data can also be integrated with the engine. All layers have their specifics.
  • Addition of new modules – ETL Framework can be easily extended by further functionality using the existing metadata or by adding additional metadata to the existing ones. As an example of a generation of automated exports for integration with other applications can be used.
  • Support of historization – in DWH terminology –SCD 0,1,2,4 types of historization are supported, i.e. types of versioning of dimension members in history.
  • Single place of maintenance – all changes, extensions and amendments are done in a single place and thus it’s easy to verify the correct functionality, discover errors or add new features.
  • Multi-database support – ETL Framework supports multiple databases in parallel, i.e. you only need one instance of the Framework to support all DWH functions for one physical infrastructure environment. Multiple databases can be represented by multiple DWHs, Data Hubs, etc.…
  • Easy documentation – Because all transformations are defined by metadata, it is easy to document it for users and administrators via reports/documents.
  • Monitoring – due to forced logging, solution metadata and system metadata the Framework provides an extensive information base for monitoring of DWH loads. This function provides the key data for administration of performance of DWH loads.
  • Support for OLAP processing – ETL workflow include support for processing of Multidimensional OLAP. Only changed dimensions, partitions are recognized and processed.

Smart parallelism – On the level of DWH load can be set maximal number of execution threads that can run parallelly. An ETL framework workflow is watching a list of ETL processes that are ready to run with respect to their dependencies and by the history of the last "x" ETL runs evaluates, which of them will be triggered as the first (by the rule the long lasting the first).

None

Figure 3 - Centrally managed metadata used for definition of ETL processes

Our experience proves that reliability and quality of a result generated from ETL Framework is a few times higher compared to quality and reliability of classic ETL approach.

Data quality management, validation and notification framework

Data quality management is an activity intersecting the whole company. Adastra ETL Framework has an inbuilt support for such critical function.

From the perspective of data quality, we distinguish 2 main types of activities:

  • Handling of data transformations in DWH encountering data errors causing the poor-quality data not to affect the quality of data already present in DWH and DWH outputs (reports, exports).
  • Evaluation of data quality as a result of business validations defined for attributes, entities or relationships which can but may not have any relation to the DWH transformations. These can serve to a general improvement of processes and data quality in the company’s systems as such.

By standard the interface supports two types of validations: Technical validations in L0 layer – these are automated checks of data types during L0 load from data sources which do not provide reliable data types. Logical and user validations in all layers L0/L1/L2 which check the values for their business meaning.

  • Checks of existence of dimension members – check if dimension member which is loaded to fact table exists in related dimension.
  • Check of filled in values which can never be left empty or zero valued.
  • User defined checks i.e. attribute position cannot change from “Node” to “Leaf”, attribute cannot change at all, attribute can have only specific values, checks of patterns, formulas, totals, row count compare of source data with target, etc. – validation commands can be defined in a general way and can then be repetitively used in many places during the load.

All validations can have error priority assigned to them:

  • Warning (soft validation) – system writes a warning to the log, the run of the ETL continues in a standard way
  • Validation error (hard validation) - system writes an error to the log, the ETL run is stopped and no changes are propagated to the target dimension or fact table

Notifications

Information about data quality checks failures, business validations or errors are communicated in two different ways:

From Warning Report/Error Report

None

Figure 4 - Notifications

  • From email, if the user is set up to receive the notification
None

Figure 5 - Email notification

All error messaging can be translated to users using error translation dictionary. Such functionality is useful mainly for system errors which the users may not understand.

Ex. system error:

Procedure ended with error:207 - Invalid column name "Code"

Which has been caused during manually created data file can be translated to a more user friendly message:

Obsolete structure of the uploaded file. Missing column "Code". Please upgrade to the latest version.

Validations can have priority set. It is then possible to create a general notification with a low priority in case the error does not correspond to any defined dictionary entry (ex.: „Unmanaged error. Please contact system administrator at dwhadmins@APS.com“).

Logging and monitoring

Each run of transformation routine is properly logged. Each row in a target database carries information which routine created it and which updated it last. Such logging allows a backward analysis even of non-standard changes in primary systems.

Logging is done in 3 levels

  • Package
    • Loaded files/tables
    • Process
      • Detail of process

Package

Refers to a single run of a set of ETL processes (ex. load of accounting data).

Each run of a package logs the start and end time of the package (phase L0, L1 etc.).

The main monitoring screen shows the load of all package runs with information if the package is still running what was the last time it was run, etc.

The total duration time is available including duration of individual phases and information about number of validation or system errors and warnings.

None

Figure 6 - History of package runs

The main screen refers to links to reports showing average durations of each package – see Figure 7 - Average duration of a single package run for the last month or total run time of packages for a chosen time period – see Figure 8 - History of execution of a particular package.

None

Figure 7 - Average duration of a single package run for the last month

Package processes and loaded files/tables

Each package is monitored on the level of individual ETL processes – see Figure 8 - History of execution of a particular package.

To each process an information about a number of new/changed/deleted rows is added. Report also contains a list of loaded files if there are files loaded from a file system. It also contains monthly averages of relation layer processing, OLAP processing and both parts together.

None

Figure 8 - History of execution of a particular package

Process detail

Each process has a detailed log in order to analyze, tune and discover errors and not well performing parts of the routine.

None

Figure 9 - Process detail

ETL Framework

  • price of license 1.990.000 CZK 

Support 20% per annum of the purchase price.

Prices do not include VAT.