Home > SSIS > Data cleansing using Fuzzy Lookup -SSIS

Data cleansing using Fuzzy Lookup -SSIS

Fuzzy lookup transformation is one of the useful transformation used in data cleansing purpose.

In previous post we seen how to remove duplicate rows in source data using fuzzy grouping transformation. This Fuzzy Lookup transformation used identify the similar data in columns. For example User may enter data as spelling mistake like user name as “Robin” or “Robins” for the same person. So using reference data we can do fuzzy lookup to get the appropriate name.

Let see how we can use Fuzzy Lookup:-

We may have data which has spell is not consistent like

Fruits Month Cost
Mango Jan 10
Mangoees March 15
Mangoes Feb 20

 

SELECT ‘Mango’ Fruits ,’Jan’ Month,   10 Cost

UNION

SELECT ‘Mangoes’,  ‘Feb’, 20

UNION

SELECT ‘Mangoees’, ‘March’, 15

 

Above data when reference with look up table in ETL process, let say we have Fruit Dimension

DimFruit

FruitId FruitName
101 Mangoes
102 Apples
103 Grapes

 

Let Say we will lookup with the data in Dimension before any cleanup

Open SSIS designer and create one sample package and copy the source query into the OLE DB Source task and then lookup with the Fruit Dimension data.

1SourceQuery

 

Lets drag the normal lookup with having DIMFruit Query as below:-

SELECT 101 FruitId, ‘Mangoes’FruitName

UNION

SELECT 102 FruitId, ‘Apples’FruitName

UNION

SELECT 103 FruitId, ‘Grapes’FruitName

 

2LookupQuery

Go to the Column section in Transformation Editor as shown below screen:-

3.LookupColumn

 

In above screen we lookup on names and getting the value of FruitId which want to place in the Fact table.

4LookupMatchOutput

 

Add one unionall component with dataviewer control to get to know how many rows are matched.

Lets run the package, it will show as in below screen:-

5.Result

 

So we got 1 record as matched, however we want 3 records, but data is not correct due to miss spell.

Let’s use the FuzzyLookup:-

For Fuzzy Lookup we need to create DimFruit table physically as it will not take as query as we did in above lookup.

CREATE TABLE dbo.DimFruit(FruitId INT ,FruitName VARCHAR(100))

insert into dbo.DimFruit

values( 101 , ‘Mangoes’)

,( 102 , ‘Apples’)

,( 103 , ‘Grapes’)

Fuzzy lookup ask for table to reference and it will store as index for better performance during lookup.

6.FuzzylookupRefrence2

 

Go to the Column tab and set which source column should map with the reference column.

7.ColumnsRefrenced

 

Go to the advance tab and set the Similarity threshold to .50 , it depends upon how much similarity we want to check.

8.FuzzylookupRefrence

 

Lets add unionall and data viewer control to know the records.

9.FuzzyLookupResult

 

In above screen we can see we get the Fruitid for all the source data.

Thanks
Prasad

Learn Different BI Tools
Subscribe my YouTube Channel : BI Tools

Advertisements
Categories: SSIS Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: