A data warehouse is a centralized integrated database containing data from heterogeneous source systems in an organization. The data is converted to eliminate inconsistencies, aggregated to summarize data, and loaded into the data warehouse. This database can be accessed by multiple users, ensuring that each group in an organization is accessing valuable, stable data.
For processing the large volumes of data from heterogeneous source systems effectively, the ETL (Extraction, Transformation and Load) software's implemented the parallel processing.
Parallel processing divided into pipeline parallelism and partition parallelism.
IBM Information Server or DataStage allows us to use both parallel processing methods.
DataStage pipelines data (where possible) from one stage to the next and nothing has to be done for this to happen. ETL (Extraction, Transformation and Load) Processes the data simultaneously in all the stages in a job are operating simultaneously. Downstream process would start as soon as the data is available in the upstream. Pipeline parallelism eliminates the need of intermediate storing to a disk.
The aim of most partitioning operations is to end up with a set of partitions that are as near equal size as possible, including an even load across processors. This partition is ideal for handling very large quantities of data by breaking the data into partitions. Each partition is being handled by a separate instance of the job stages.
Combining pipeline and partition parallelism:
Greater performance gain can be achieved by combining the pipeline and partition parallelism. The data is divided and partitioned data fill up the pipeline so that the downstream stage processes the partitioned data while the upstream is still running. DataStage allows us to use these parallel processing methods in the parallel jobs.
Repartition the partitioned data based on the business requirements can be done in DataStage and repartition data will not load to the disk.
Parallel processing environments:
The environment in which you run your DataStage jobs is defined by your system's architecture and hardware resources.
All parallel-processing environments can be categorized as
- SMP (Symmetrical Multi Processing)
- Clusters or MPP (Massive Parallel Processing)
SMP (symmetric multiprocessing), shared memory:
- Some hardware resources may be shared among processors.
- Processors communicate via shared memory and have a single operating system.
- All CPU's share system resources
MPP (massively parallel processing), shared-nothing:
- An MPP as a bunch of connected SMP's.
- Each processor has exclusive access to hardware resources.
- MPP systems are physically housed in the same box.
- UNIX systems connected via networks
- Cluster systems can be physically dispersed.
By understanding these concepts on various processing methods and environments enabled me to understand the overall parallel jobs architecture in DataStage.