First, read the introduction to pipelines article, which explains the basic concepts and steps to creating pipelines.

This article focuses on the specifics of writing a Normal pipeline as opposed to a Flow pipeline.

A Normal pipeline processes data from 1 or more sources as a single batch and loads that data into 1 or more tables. Each batch has an ID and every data record of every table that is loaded by that batch is linked to a specific batch ID.

A simple pipeline

This is a simple pipeline which prompts the user to upload an Excel file, extracts data from a worksheet named “CASES”, and loads that data into a user-created table in xMart also named “CASES”. The name of the table while being processed by the pipeline is “data”.

<XmartPipeline>
  <Extract>
    <GetExcel TableName="CASES" 
              OutputTableName=“data” />
  </Extract>
  <Load>
    <LoadTable SourceTable=“data"
               TargetTable="CASES" 			
               LoadStrategy="MERGE">
      <ColumnMappings Auto="true" />
    </LoadTable>
  </Load>
</XmartPipeline>

The Extract section is required, which defines how data is extracted from data source(s) to be processed by the pipeline. Multiple “Get” commands can be listed here to extract multiple tables from multiple sources.

The Load section includes one or more LoadTable sections, which define how extracted data are loaded into xMart. Each LoadTable section corresponds to a single, particular xMart table. Multiple LoadTable sections can be present to load data into multiple tables.

Summary of each command property: summary of each command property

The LoadStrategy is an important concept in xMart pipelines: what is LoadStrategy?

ColumnMappings map source columns in a pipeline table to columns in xMart. Using Auto="true" saves time by automatically mapping columns that have the same name in the source as in xMart.

what are ColumnMappings?

To manage differences in columns between origins, you can use Optional="true" and/or define a list of origin codes for which the column mapping is relevant. what if a source column does not exist?

Full pipeline structure

Pipeline structure showing every possible section:

pipeline structure showing every possible section

Context

TO DO

Extract

TO DO

Transfom

TO DO

Load/LoadTable

TO DO

DataScope

Datascope is a filter defined in a LoadTable section and applied to both the source data (in your file) and the target data (in the DB). It defines the range or “scope” of the data to be modified. It prevents data outside of this range from being modified.

It can be used with either REPLACE or MERGE.

Here is an example.

<Load>
    <LoadTable SourceTable="Data" TargetTable="FACT_DEMAND" LoadStrategy="REPLACE">
      <DataScope Operator="AND">
        <Terms>
          <Term Operator="OR" Field="SOURCE_FILE_FK" Rule="LikeExactly">
            <Value>UNDP</Value>
          </Term>
        </Terms>
      </DataScope>
      <Transform >
        <AddColumn Name="SOURCE_CODE" FillWith="UNDP" />

A DataScope filter is a list of filter “terms”. Most people only need a single term. If you have multiple terms, they are “AND”ed by default (that’s Operator=”AND” on the DataScope line below). It’s like a search form in a user interface – each part of the search form is ANDed.

Inside of each term supports 1 or multiple values. Multiple values inside of a term are “OR”ed by default (that’s Operator=”OR” on the Term line below). This is like selecting multiple options in a dropdown on a search form – each item (like multiple countries or years) are ORed.

If you want multiple values to be ORed, list multiple values. This will filter on data where SOURCE_FILE_FK is either UNDP, UNICEF or GAVI.

<Load>
    <LoadTable SourceTable="Data" TargetTable="FACT_DEMAND" LoadStrategy="REPLACE">
      <DataScope Operator="AND">
        <Terms>
          <Term Operator="OR" Field="SOURCE_FILE_FK" Rule="LikeExactly">
            <Value>UNDP</Value>
            <Value>UNICEF</Value>
            <Value>GAVI</Value>
          </Term>
        </Terms>
      </DataScope>
      <Transform >

Records that are outside of Datascope are highlighted in the data preview as “Outside of DataScope”

image-20210209063351391

It is possible to use batch input variables in DataScope filters. This way filters can be customized per origin for the same pipeline. Since batch input variables can be set by prompting the user at the time of upload, the person uploading can also set the filter at the time of upload. In the following example a batch input variable “${SurveyCode}” is used to filter a SURVEY_FK value.

 <Load>
    <LoadTable SourceTable="interviews" TargetTable="RAW_2019_INTERVIEWS" LoadStrategy="REPLACE">
      <DataScope Operator="AND">
        <Terms>
          <Term Operator="OR" Field="SURVEY_FK" Rule="LikeExactly">
            <Value>${SurveyCode}</Value>
          </Term>
        </Terms>
      </DataScope>

Transform

TO DO

Validate

TO DO

ColumnMappings

Approvals

PostRun

TO DO

DB-Only Pipelines

A DB-only pipeline is a special type of pipeline which exclusively operates in the xMart database and is recommended for materializing a slow custom SQL view into a fast physical table.

Typically, pipelines process unstructured data outside of the database (ie in the web application) before being sent to the database. A DB-only pipeline only uses SQL statements that are executed on the database and is useful when there is only a need to move data already in an xMart table or view into another xMart table. The data is assumed to be structured and have perfectly correct data types.

DB-only pipelines are very efficient at materialization but the price is some pipeline features are not available, especially transforms.

How to make a DB-only pipeline

To make a DB-only pipeline, add IsDbOnly=”true” to the XmartPipeline element at the top:

DB-Only xml element

Unsupported pipeline features

Pipeline elements that are not supported when IsDbOnly == true:

  • Any extract command other than GetMart.

  • Transform (both global and in Load). No transforms supported.

  • LoadTable/LookupIds. No foreign key lookups supported, but if the source object is a custom SQL view you can put you lookups there (ie use JOIN).

  • LoadTable/Validate

  • Batch Preview Page: Mostly the same but there limited issue reporting. Final commit is attempted but may fail without detail. There is no validation of each value which normally happens in the web server.

Supported pipeline features

  • Context and origin input variables
  • Batch variables
  • REPLACE and MERGE strategies
  • Approvals
  • DataScope
  • PostRun
  • Batch preview still reports on NEW, UPDATED, UNCHANGED, etc.