How to remove special characters from a field

It is a common issue when we are providing the data cleaning the existence of string with special or undesired characters.

An elegant option is using a function based on a Tally table

The tally table is implemented as a view

CREATE VIEW [dbo].[Tally] as 
WITH lv0(N) AS (SELECT 0 FROM (VALUES (1),(1))G(N))
,lv1(N) AS (SELECT 0 FROM lv0 a CROSS JOIN lv0 b) -- 4
,lv2(N) AS (SELECT 0 FROM lv1 a CROSS JOIN lv1 b) -- 16
,lv3(N) AS (SELECT 0 FROM lv2 a CROSS JOIN lv2 b) -- 256
,lv4(N) AS (SELECT 0 FROM lv3 a CROSS JOIN lv3 b) -- 65,536
,lv5(N) AS (SELECT 0 FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296
,cteTally(N) AS
(
SELECT  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5
)
SELECT N FROM cteTally;

The function implementation is

CREATE FUNCTION [dbo].[RemoveString] (@Search NVARCHAR(max),@Remove NVARCHAR(max))
returns TABLE
AS
RETURN
WITH temp
AS (SELECT Substring(@Search, n, 1) AS _char,t.n
FROM   tally t
WHERE  n <= Len(@Search)
AND Charindex(Substring(@Search, n, 1), @Remove) = 0)
SELECT DISTINCT Result =
Stuff((SELECT _char + ''
FROM   temp mv2
ORDER  BY mv2.n
FOR xml path(''), type).value('.', 'NVARCHAR(MAX)'), 1, 0,
'')
FROM   temp mv; 

And the use is

DECLARE @table as table (value varchar(20))
DECLARE @table AS TABLE
(
value VARCHAR(20)
)

INSERT @table
VALUES ('maçã'),
('[row]'),
('car')

SELECT *
FROM @table t
CROSS apply dbo.Removestring(t.value,
'ÇÁÉÓÍÚÀÈÌÒÙÂÊÎÔÛÄËÏÖÜÃÕ%[~,@#$%&*()!´:]%`^-'
) x
ORDER BY 1

How it works

The tally view creates a table with numbers from 1 to 4,294,967,296. internally a table is created with 4,294,967,296 lines that contain one column with value 0

WITH lv0(N) AS (SELECT 0 FROM (VALUES (1),(1))G(N))
,lv1(N) AS (SELECT 0 FROM lv0 a CROSS JOIN lv0 b) -- 4
,lv2(N) AS (SELECT 0 FROM lv1 a CROSS JOIN lv1 b) -- 16
,lv3(N) AS (SELECT 0 FROM lv2 a CROSS JOIN lv2 b) -- 256
,lv4(N) AS (SELECT 0 FROM lv3 a CROSS JOIN lv3 b) -- 65,536
,lv5(N) AS (SELECT 0 FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296

The cteTally returns the row number of this table.

,cteTally(N) AS
(
SELECT  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5
)

If the view creates a table with 4,294,967,296 lines how it can be so fast?

The secret is the implementation of CTE tables the values are created in memory as the system needs it.

The function receives two parameters the first is the string to seek and the second the characters that will be removed.

The fragment of code below creates a table with to columns, the first is from the Tally view with the index of the character in the string and the second is the character.

The first condition of the where is the limit of rows in the Tally view, its guaranty the returns table is the same length of the seek string. The second condition is to return the no matched characters to remove.

SELECT Substring(@Search, n, 1) AS _char,t.n
FROM   tally t
WHERE  n <= Len(@Search)
AND Charindex(Substring(@Search, n, 1), @Remove) = 0)

The fragment of code below re-creates a string using the Stuff function

SELECT DISTINCT Result =
Stuff((SELECT _char + ''
FROM   temp mv2
ORDER  BY mv2.n
FOR xml path(''), type).value('.', 'NVARCHAR(MAX)'), 1, 0,
'')
FROM   temp mv;

The apply operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query.

select *
from @table t
cross apply dbo.RemoveString(t.value, 'ÇÁÉÓÍÚÀÈÌÒÙÂÊÎÔÛÄËÏÖÜÃÕ%[~,@#$%&*()!´:]%`^-') x order by 1

Attention: Cross Apply is completely different from Cross Join

Advertisements