Parallelism -
There are three types of parallelism that Abinitio supports -
component parallelism, pipeline parallelism, and data parallelism
1) Component parallelism occurs when more than one component is running at the same time on different streams of data.
For example, you could use two FILTER BY EXPRESSION components at the same time to find transactions greater than and less than certain amounts:
2) Pipeline parallelism occurs when two or more components process different parts of the same data stream at the same time. This can happen when the first component produces output records before it has finished reading all its input. This enables the next component to begin processing before the first one has finished.
Any component that must read all its input before producing any output is said to “break” pipeline parallelism because the next component must wait for that component to finish before starting.
For example, a SORT component always breaks pipeline parallelism: it must see all the records before producing any output because the last record it reads might be the first one in the sort order.
Similarly, a ROLLUP component whose sorted-input parameter is set to False breaks pipeline parallelism, as its results cannot be produced until the input stream has been completely consumed.
3) Data parallelism occurs when a graph separates data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously.
Max memory - when sorted input set to true then max memory ( Maximum memory usage (before spilling to disk) in bytes)
when sorted input set to false that is in-memory sort then driving port and max core comes into the picture.
Driving port - The largest input; all other inputs will be read into memory.
Maxcore - Maximum memory usage (for non-driving inputs before spilling to disk) in bytes.
To print a record whose format or data is corrupt, use the -show-partial argument in the m_dump command.
To count the number of records in a printed output of records, use the -print-n-records argument.-
Using m_dump with -print-n-records
Phases -
A phase is a stage of a graph that runs to completion before the start of the next phase. By dividing a graph into phases, we can make the best use of resources such as memory, disk space, and CPU cycles ensuring, for example, that sufficient resources will be available for an especially demanding part of the job.
The boundary between two phases is called a phase break, and it belongs to the first of the two phases.
In the process of completing one phase before the next begins, the component immediately before a phase break writes all the data passing through it into temporary files in the layout of the component immediately after the phase break. When the first phase finishes running, the components after the phase break read these temporary files to begin the next phase.
Checkpoints -
A checkpoint is a point at which the Co>Operating System saves all the information it would need to restore a job to its state at that point. In case of failure, you can recover completed phases of a job up to the last completed checkpoint.
In batch graphs, you can have checkpoints only at phase breaks. When you set a phase in the GDE, by default it has a checkpoint at its phase break. We can remove the checkpoint by clicking the Toggle Checkpoint button Icon for the Toggle Checkpoint button from the on position to the off position Icon for the Toggle Checkpoint button as it appears with the checkpoint removed.
As the execution of the graph successfully passes the first checkpoint, the Co>Operating System saves all the information it needs to restore the job to its state at that checkpoint.
As the execution of the graph successfully passes each succeeding checkpoint, the Co>Operating System:
Deletes the information it has saved to be able to restore the job to its state at the preceding checkpoint
Deletes the temporary files it has written in the layouts of the components in all phases since the preceding checkpoint
Commits the effects on the filesystem of all phases since the preceding checkpoint.