History Tables Load Options

Data can be loaded into Fabric with different options. Full load is the most common, but can take a long time in case of a lot of data. So for those situations, it would be nice to load just a fraction of the data. This can be done in several different ways, often depending on what is possible in the source. This page explains the following situation:

Updating based on one or more column values

Updating Based on One or More Column Values

This is the simplest way to partially update. The column value is usually something related to a period, for example, a year. The source, for instance, provides the current year and the year before. In that case, these rows are stored in the temporary bronze table and if keephistory is enabled, the following is done:

New rows are added to the table (just like in a full load)
Modified rows are applied to the table (just like in a full load)

The crux lies in the deletion. Here, the filter must be taken into account.

Only rows that no longer appear in the delivery and that meet the specified filter are deleted. Everything outside the filter remains intact.

Of course, this is not limited to years; it can also be a value such as a company column, for example. In the configuration of the object, this is recorded as follows:

  Table:
  Connection: exa-example
  SourceTable: example.csv
  DataPlatformObjectname: example
  KeepHistory: true
  HistorySettings: 
    DeleteFilterQuery: "SELECT deliverydate as deliverydate FROM his.exa_example WHERE YEAR(DeliveryDate) = 2025" 
  Columns:
  - SourceColumn: ExampleID
    SourceDataType: int
    IsPrimaryKey: true
  - SourceColumn: ExampleText
    SourceDataType: varchar(250)
  - SourceColumn: DeliveryDate
    SourceDataType: datetime    
  - SourceColumn: Company
    SourceDataType: varchar(100)

This query will result in all the dates that are already in the history table for the year 2025. Normally all rows that are not delivered by the source are expected to be deleted. But if the source only delivers a subset of the data, this will mean rows that should not be set to deleted, are deleted anyway. The DeleteFilterQuery query will return all DeliveryDate values in the current history table. Only the values in this set will be checked by the delete.

This is used as the join condition for the dataframe

df_delete = df_tgt.alias('tgt').join(df_tst.alias('del'), on=expr("tgt.deliverydate = del.deliverydate"), how="left_semi")

This will result in a subset of the history table, on this subset there is a second filter based on the primary keys that don't exist anymore in the delivered dataset.

The columns from the DeleteFilterQuery are automatically mapped as a join condition. The columns in this query must exist in the history table. The source of this query can be any table as long as the values can be compared with the history table. So a query like this one, is also valid:

SELECT date as deliverydate FROM dbo.dates WHERE YEAR(DeliveryDate) = 2025

Filter on multiple columns is also possible.

  Table:
  Connection: exa-example
  SourceTable: example.csv
  DataPlatformObjectname: example
  KeepHistory: true
  HistorySettings: 
    DeleteFilterQuery: "SELECT deliverydate as deliverydate, 'Fav' as company FROM his.exa_example WHERE YEAR(DeliveryDate) = 2025" 
  Columns:
  - SourceColumn: ExampleID
    SourceDataType: int
    IsPrimaryKey: true
  - SourceColumn: ExampleText
    SourceDataType: varchar(250)
  - SourceColumn: DeliveryDate
    SourceDataType: datetime    
  - SourceColumn: Company
    SourceDataType: varchar(100)    

This query will result in all the dates that are already in the history table for the year 2025 and the value 'Fav' as an extra filter on the company field. So in this case all primary keys that don't exist in the source anymore for the company 'Fav' will be deleted. All other companies are out of scope.

This is used as the join condition for the dataframe

df_delete = df_tgt.alias('tgt').join(df_tst.alias('del'), on=expr("tgt.deliverydate = del.deliverydate and tgt.company = del.company"), how="left_semi")

This will result in a subset of the history table, on this subset there is a second filter based on the primary keys that don't exist anymore in the delivered dataset.

SELECT date as deliverydate FROM dbo.dates WHERE YEAR(DeliveryDate) = 2025

Your scenarios

We like to hear other scenarios as well.

Updating Based on One or More Column Values​

Updating Based on One or More Column Values