Wednesday, 4 December 2024

Datasets and data lineage

 

Datasets and data lineage

The model that dependency analysis creates must be accurate in order to be useful. When data assets are accurately represented and correctly associated with the applications that read or write them, they are said to line up: they have accurate lineage.

The process of dependency analysis produces two types of lineage:

  • Data lineage occurs among data assets. It refers to the correct association of data assets with the applications that create or access them.

  • Field lineage occurs among data fields. It refers to the ability to trace a data field from its source to its target, correctly and completely.

This chapter primarily talks about creating and maintaining accurate data lineage. If the data lineage for a project is correct and dependency analysis completes without any warnings, then accurate field lineage tends to occur as a natural consequence.

This section describes how lineage is maintained in the technical repository. It covers the following topics:

Dependency analysis

Dependency analysis provides valuable information about your graphs and data, how they are created, and how they change. This chapter describes dependency analysis and how to facilitate it successfully in your graphs.

Why use dependency analysis?

Dependency analysis can save you time and resources in a number of ways:

  • Code analysis and data lineage — Dependency analysis produces an up-to-date view of the current project. You can see what data and graphs already exist and how they were created, which may help you avoid duplicating development work or data. You can see how a field gets created and how its value changes within a graph, across graphs, and across projects (lineage). You can also assess the impact of planned changes: for example, if you were to add a field to a particular dataset, which graphs would be affected?

  • Quality control — Dependency analysis helps you assess whether a graph matches its original specifications, and whether it meets its goals. It also provides information about how a graph will run outside the development environment. For example, graphs without analysis warnings are more likely to run smoothly when deployed in the production environment or migrated to a different repository. Dependency analysis even detects certain types of runtime errors, such as DML parsing problems.

  • Transparency and accountability — The results of dependency analysis are useful to people throughout an organization, beyond the development process. Anyone who uses the organization’s data may be interested in discovering, for example, how a particular data set is derived. Employees may need to know why their latest weekly sales report suddenly looks different, or which fields are used to calculate a particular metric. In addition, the results of dependency analysis may be used to support compliance with regulations, such as Sarbanes-Oxley or Basel II, or quality initiatives, such as ISO 9000.


Running dependency analysis

You can run dependency analysis in any of the following ways:

  • Automatically at check-in — Analysis is automatically performed on a new or changed file when you check it in. In the GDE, the Check-in Wizard reports any dependency analysis warnings prior to check-in. At the command line, the air project import command reports dependency analysis warnings. With the ‑dry‑run argument, you can run dependency analysis without checking in the files.

    Files are only checked in if they are new or changed. Unchanged files are not checked in and therefore are not analyzed. (Forcing a check-in, while possible, is usually to be avoided.) If you make changes to a generic graph but not its associated psets, the psets will not be analyzed at check-in, even if you check in the entire project. You must run dependency analysis on the psets themselves in order to update the lineage.

  • As you build a graph — From the GDE menu bar, choose File > Dependencies > Analyze to analyze the current graph or pset. (Or click the Analyze button  on the toolbar.) Warnings are reported on the Dependency Analysis tab of the Application Output window.

  • Explicitly at the command line — Analyze an entire project (or any subset of a project) by running the air project analyze-dependencies command.

  • TIP: 

    To analyze all the psets for a particular graph, run air project analyze-dependencies for that graph, using the -referencing-files option. 

     

how to create dml dynamically in Ab-initio

 $[ begin let int i = 0; let string(int) complete_list = "emp_nm,billable,age,designation,location"; let string(int) file_content ...