Abinitio Interview Questions: November 2021

Tuesday, 30 November 2021

Datasets and Lineage -

Data files, Tables and queues are represented in the EME by Logical datasets (also called EME Datasets), which are placeholders that contain metadata - The location and record format of actual, physical data. They do not have contain data itself. (Physical data are not checked in into the EME). When a graph is checked in, The EME inspects all the file and table components in the graph, comparing them with the logical datasets that already exist in the EME datastore. If no corresponding dataset is found, The EME creates one, by default in of the following locations:

For a file component, in the location derived from the component URL.

For a table component, in the tables directory of the project, where the .dbc file is located.

In the EME datastore, you want to make sure that all data files are logically the same are represented by the same logical dataset, so lineage remain intact. you can design for and enable lineage in a number of ways.

Datasets Component -

The dataset components represent records or act on records, as of follow -

BLOCK_COMPRESSES LOOKUP TEMPLATE

INPUT FILE

INPUT TABLE

INTERMEDIATE FILE

LOOKUP FILE

LOOKUP TEMPLATE

OUTPUT FILE

OUTPUT TABLE

READ MULTIPLE FILES

WRITE MULTIPLE FILES

Monday, 29 November 2021

Two-stage Routing in Abinitio -

Two-stage routing is available only if the number of partitions in the source and target are the same.

when you set two-stage routing the flow symbol changes from a pattern with one X in the middle to a pattern with two Xs in the middle.

use two-stage routing for all-to-all flows.

When an all to all flow connect with the layouts containing a large number of partitions, the co >operating system uses many network resources, if the number of partitions in the source and destination component is N, an all to all flow uses resources proportional to N to the 2nd power.

To save network resources, you can mark an all-to-all flow as two-stage routing, with two-stage routing, the all-to-all flow uses only 2N√N resources.

ex-> an all-to-all flow with 25 portions uses 25*25=625 resources, but with two-stage routing uses only 2*25*5=250 resources.

The source component must be one of the following to use two-stage routing -

PARTITION BY KEY

PARTITION BY RANGE

PARTITION BY ROUND-ROBIN

PARTITION BY DB2EE

PARTITION BY EXPRESSION

use two-stage routing if the all-to-all floes in a graph have more than 30 partitions.

Setting two-stage routing -

To set two-stage routing, right-click on all to all flow and select Two-stage Routing from the shortcut menu.

Unix awk command Part 2 -

1. How to run the awk command specified in a file?

awk -f filename

2. Write a command to find the sum of bytes (size of file) of all files in a directory.

ls -l | awk 'BEGIN {sum=0} {sum = sum + $5} END {print sum}'

3. In the text file, some lines are delimited by colon and some are delimited by space. Write a command to print the third field of each line.

awk '{ if( $0 ~ /:/ ) { FS=":"; } else { FS =" "; } print $3 }' filename.txt

4. Write a command to print the squares of numbers from 1 to 10 using awk command

awk 'BEGIN { for(i=1;i<=10;i++) {print "square of a number is :",i,"is",i*i;}}'

5. Write a command to print the line number before each line?

awk '{print NR, $0}' filename.txt

6. Write a command to print the second and third line of a file without using NR.

awk 'BEGIN {RS="";FS="\n"} {print $2,$3}' filename.txt

7. Write a command to print the fields in a text file in reverse order?

awk 'BEGIN {ORS=""} { for(i=NF;i>0;i--) print $i," "; print "\n"}' filename

8. Another way to print the number of lines is by using the NR. The command is

awk 'END{print NR}' filename

9. Write a command to find the total number of lines in a file without using NR

awk 'BEGIN {sum=0} {sum=sum+1} END {print sum}' filename

10. Write a command to rename the files in a directory with "_new" as postfix?

ls -F | awk '{print "mv "$1" "$1".new"}' | sh

11. Write a command to print zero byte size files?

ls -l | awk '/^-/ {if ($5 !=0 ) print $9 }'

Tuesday, 16 November 2021

Unix Sed and awk command - Part 1

1) How to Print only Blank Line of File.

sed -n '/^$/p' data_file.txt

2) To Print all lines Except First Line

sed –n ‘1!p’ data_file.txt

3) To Print First and Last Line using Sed Command

sed -n ‘1p’ data_file.txt

sed –n ‘$p’ data_file.txt

4) How to get only Zero Byte files that are present in the directory

ls -ltr| awk '/^-/ { if($5 ==0) print $9 }'

5) How to add a First record and Last Record to the current file in Linux

sed -i -e '1i Header' -e '$a Trailor' data_file.txt

6) Delete all Line except First Line

sed –n ‘1!d’ data_file.txt

7) How to display an Even number of records into one file and Odd number of records into another file

awk 'NR %2 == 0' data_file.txt

awk 'NR %2 != 0' data_file.txt

8) Add at the start of the line

awk '{print "START"$0}' FILE

9) Add at end of the line

awk '{print $0"END"}' FILE

10)Remove all empty lines:

sed '/^$/d' data_file.txt or sed '/./!d' data_file.txt

11) To see a particular line

For example, if you just want to see the 180th line in a file

sed -n '180p' testfile.txt

12) To find a particular column in the file

cat data_file.txt |awk -F"," '{print $2}'

13) To rename the file with the current date

mv test test_`date +%Y-%m-%d

14) Command to take out all those lines which are having 8 at 17th position

grep '^.\{16\}8' data_file.txt > modifeid_data_file.txt

15) To remove the nth line without opening the file

sed 'nd' file1>file2 to remove multiple lines sed -e 1d -e 5d

16) To find the top 20 files with the most space

ls -ltr|awk -F" " '{print $5 $9}' sort -n tail -20

17) To find the record in the first file, not in the second

comm -13 data_file.txt data_file2.txt

18) If you are looking from something that is contained in a file but you don't know which directory it is in do the following:

find. -name "*" xargs grep -i something This will find all of the files in the directory and below and grep for the string something in those files!

19) Delete Files Delete all the files starting with the name data_file

find. -type f -name "data_file*" exec rm -f {} \;

20) Remove blank space from the file

sed -e "s/ *//g" data_file.txt >data_file.txt_wo_space

21) How to do grep on a large number of files in a directory

grep -rl "Search Text" /tmp

22) Get all the columns from the file and record the count for each of the columns?

head -n 1 file.name | awk -F'|' '{print NF; exit}'

Friday, 12 November 2021

m_dump

1) How to print records from a file where cust_id started with 1234 by using the m_dump command ?

m_dump dml data -select 'starts_with(cust_id,1234)'

2) How to get null records using the m_dump command?

m_dump dml data -select 'is_null(your column name)'

3) How to find 5 corrupted records from a file out of 500000 records from UNIX? only 5 corrupted records have to come in the output?

m_dump dml data_file -show-partial

Note : -show-partial shows incomplete/corrupted records.

4) How to find corrupted records in multifile and how to know which partition data is corrupted?

m_dump dml data_file -validate

5) How to read the first 10 records from a multifile?

In Filter by expression use (next_in_sequence() +1)*no_of_partition() + this_partition() +1 <=10

use m_dump command with -records -partition argumnets

6) How to find and remove corrupted data files of MFS?

m_dump with -validate will give a record number that has incorrect data as per the data type defined.

Use m_reformat and m_select to filter the record.

7) How to view data of a particular partition from a multi-file ?

m_dump dml_file_name.dml mfile:mfs8/tempfile.dat -partition 3

8) How to get 2nd record of each partition of 4 way multifile ?

m_dump dml_filename.dml data_file | awk 'NR==2'{print ;}

output-index and output-indexes functions in Reformat -

Output-index -

The component calls the specified transform function for each input record. The transform function must return a zero-based numeric value. Reformat uses this value to direct the input record to the output port that has the same number as the value, and execute the transform function if any associated with the port.

When we specify a value for this parameter, each input record goes to exactly one transform -output port pair.

Note - If you specify a value of the output-index parameter, you cannot also specify the output-indexes parameter, Use output-index when you want to direct a single record into a single transform-output-port, use output indexes to direct a single record to multiple transform-output ports.

Output-indexes -

The component calls the specified transform function for each input record. The transform function uses the value of the input record to direct that input record to particular transform-output ports.

The expected output of the transform function is a vector of numeric values. The component considers each element of this vector as an index into the output transforms and ports. The component directs the input records to the identified output ports and executes the transform functions if any associated with those parts.

If an index is out of range(less than zero or greater than the highest-numbered port) The component ignores the index.

If a port appears multiple times in the vector, the component uses the corresponding transform-port only once.

out :: output_indexes(in)=

begin

out::1: if (in.kind ==á' ) [vector 0,1,2];

out::2:if (in.kind =='b' ) [vector 2,3,4];

end;

Wednesday, 10 November 2021

ICFF Lookup -

When your lookup file is MFS and big then we should go for ICFF lookup.

Block Compressed Lookup component, creates 2 files; a Block Compressed Data File and an index file containing indexes that refer to the blocks in the data file.

This is a kind of dynamic lookup file that loads the data in memory only when it is referenced.

For eg: You have an existing graph joining 2 files with one file having around 100 million data whereas the other file having around 50 million. This job will take enough time to join the two files.

If you are not pulling many fields from one of the files, you can make it a lookup and speed up your process. You can create a block compressed lookup file of one of the files. This process will create 2 files Compressed data file and an index file containing indexes to each block of the data file.

Now you can read other files as a single flow and in reformat use the lookup_load function to only load the specific block in memory and perform a lookup on it.

This will save your memory and speed up the process.

How Indexed Compressed Flat file works?

To create an ICFF, we need presorted data. WRITE BLOCK-COMPRESSED LOOKUP component, compresses and chunks the data into blocks of more or less equal size. The graph then stores the set of compressed blocks in a data file, each file is associated with a separately stored index that contains pointers back to the individual data blocks. Together, the data file and its index form a single ICFF. A crucial feature is that, during a lookup operation, most of the compressed lookup data remains on disk —the graph loads only the relatively tiny index file into memory.

Generation -

The addition of data to an ICFF is possible even while it is being used by a graph. Each chunk of added update data is called a generation. Each generation is compressed separately; it consists of blocks, just like the original data, and has its own index, which is simply concatenated with the original index.

How Generations are created -

As an ICFF generation is being built, the ICFF building graph writes compressed data to disk as the blocks reach the appropriate size. Meanwhile, the graph continues to build an index in memory. In a batch graph, an ICFF generation ends when the graph or graph phase ends. In a continuous graph, an ICFF generation ends at a checkpoint boundary. Once the generation ends, the ICFF building graph writes the completed index to disk.

ICFF presents advantages in a number of categories -

Disk requirement - Because ICFF stores compressed data in a flat without the overhead associated with a DBMS, they require much less disk storage capacity than databases - on the order of 10 timeless.

Memory Requirement - Because ICFF organizes data in discrete blocks, only a small portion of the data needs to be loaded in memory at any amount of time.

Speed - ICFFs allows you to create a successive generation of updated information without any pause in processing. That means the time between a transaction taking place and the results of that transaction beginning accessibly can be a matter of seconds.

Performance - Making a large number of queries against database tables that are continually being updated can slow down a DBMS. In such applications, ICFF outperforms databases.

The volume of data - ICFFS can easily accommodate the very large amount of data - so large, In fact, that it can be feasible to take hundreds of terabytes of data from archive types, convert it into ICFFS and make it available for online access and processing

Tuesday, 9 November 2021

Reformat with lookup file versus JOIN

Should I use a Reformat component with a lookup file or a join component?

There are situations where you cannot use a Reformat with a lookup file instead of a JOIN.Forex - We can not do a full outer join using a Reformat and a Lookup. However, the two methods offer very similar performance when both of the following are true :

* Reformat or a Join can use used.

* If a join is used, its non-driving input fits into the available memory.

When considering whether to use a JOIN or a lookup file, the flexibility of the graph is also a concern. when you use a lookup file, the file consumes address space and depending on whether this file grows over time, this could be a problem.

For an In-memory join, as much of the non-driving input as can fit in the memory specified by the max-core parameter is loaded into memory, and the rest lands to disk.

* For a JOIN, the memory issue can be addressed by changing the max-core limit, which if exceeded handled by writing files to disk, Although this can be costly, it makes the graph fairly Robust. Using a JOIN also makes it easy to read and understand what the graph is doing.

*If the memory limits are exceeding with a lookup file, the graph fails due to lack of memory. Generally, if the information from the file is being used in other parts of the graph, using a lookup file reduces the amount of memory required.

Abinitio Interview Questions