I can’t stress enough the importance of designing the framework for your ETL prior to getting stuck in with individual packages.
I bring this up, because I have seen that look of dread on a developers face (including my own) when they are about to deploy a package to the live environment and aren’t too sure what gremlins are hiding in the live environment that don’t exist in development. Not only this, with the proliferation of SSIS packages across the corporation, it can become extremely difficult to figure out and manage the process flow across all your servers.
It is for this reason, that I am a big fan of metadata controlled ETL. And I’m not just referring to a configuration table set up as part of an SSIS package, I’m talking about modelling a set of tables within a control database that can hold all the data that a package might need to run. I’m also talking about a generic parent package that can use this meta data to control the execution of any packages defined within the tables.
Unlike DTS, the column metadata within an SSIS package cannot be dynamically set at runtime. This means (if you are like me and like to organise your packages into modular units of work) that you will probably have one SSIS package per table. Now most enterprise datawarehouses that I’ve worked on have in excess of 100 tables across say 6 or 7 datamarts and perhaps the transforms might be split into 2 or 3 phases (Staging & Publishing for example). You are now looking at some 200 packages!!
When your SQL Agent activity window becomes this busy, that look of dread I was referring to earlier, might become more frequent. The trouble is that when this happens, the work involved in reconfiguring all the packages is extreme. Its far better to invest the time at the start of the project in a framework that will handle this scenario.
There are other benefits to having a metadata driven ETL:
- It encourages conventions and standards in future development.
- It encourages a more modular approach to package development. Imagine a scenario where you have 10 new tables that require data imports and transforms. You might think that the easiest thing to do here is put all the dataflows into a single SSIS package so that you can set up the dependencies and control the order in which the transforms happen. With a metadata approach you can set up dependencies outside the packages thereby giving you the flexibility to run some dataflows and not others when required.
- This also has the added benefit that when a package needs fixing or updating for table changes say, you only need to swap out one package with one transform without the danger of breaking the entire process.
- It has its own cost benefits too, in that its much easier for new developers to pickup the work left by others. Given that they are able to interrogate the control database, they can quickly build up a picture of the batch processes.
- Its much easier to monitor a SQL table than it is to monitor the SQL Agent (especially if you are not a SQL Server admin). You can easily provide access to the relevant tables allowing them to be updated by analysts or report consumers, so that they can turn certain packages on or off say.
Although this post is about the benefits derived from metadata driven ETL, its also important to note that its not just this that needs attention prior to any SSIS development. Coding standards, naming conventions, clear annotation, code review and sufficient environments are other parts of an ETL framework that are just as important.
I have created a cut down version of a metadata controlled solution which I will publish in my next post.