Help with Replacing characters with text string chops of last character of string

All we need is an easy explanation of the problem, so here it is.

I need assistance with fixing a replace statement that truncates the last character of the string. The original string (Question_Text) contains HTML characters and other erroneous characters that need to be cleaned up. I included a multiple Replace statement within the Function. it is the Function I need assistance with.

As you can see in the sample data, Question_Text2 has been cleaned up, except it cuts of the last character. Along with the removal of erroneous characters in the middle of some of the questions (see ID = 12165). What am I doing wrong?

I have include code to create the sample data results after the function has been used to illustrate the problem.

/****** Object:  Table [dbo].[tblQuestionsSample]    Script Date: 12/10/2021 5:12:16 PM ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[tblQuestionsSample]') AND type in (N'U'))
DROP TABLE [dbo].[tblQuestionsSample]
GO

/****** Object:  Table [dbo].[tblQuestionsSample]    Script Date: 12/10/2021 5:12:16 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[tblQuestionsSample](
    [QuestionsRecID] [int] IDENTITY(1,1) NOT NULL,
    [Question_ID] [int] NULL,
    [Question_Text] [varchar](1000) NULL,
    [Question_Text2] [varchar](1000) NULL
 CONSTRAINT [PK_dbo.[QuestionsRecID] PRIMARY KEY CLUSTERED 
(
    [QuestionsRecID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
INSERT INTO [dbo].[tblQuestionsSample]
           ([Question_ID]
           ,[Question_Text]
           ,[Question_Text2])
     VALUES
        (11603,'<p>Date/Time</p>','Date/Tim')
        ,(11632,'<p>Attachments</p>','Attachment')
        ,(12166,'<p>Employee ID</p>','Employee I')
        ,(12166,'<p>Work Related?</p>','Work Related')
        ,(12165,'<p>Date &amp; Time of injury/onset illness?</p>','Date &amp; Time of injury/onset illness')
        ,(12165,'<p>Full Injury/Illness Description</p>','Full Injury/Illness Descriptio')
        ,(12165,'<p>Job Title</p>','Job Titl')
go

Results of query:

| id   |  question_text|                                Question_Text2| 
| ---- | -------- | -------------------------------------------------- |
| 11603| <p>Date/Time</p>|                               Date/Tim|
| 11632| <p>Attachments</p> |                            Attachment|
| 12166| <p>Employee ID</p> |                            Employee I|
| 12166| <p>Work Related?</p>|                            Work Related|
| 12165| <p>Date &amp; Time of injury/onset illness?</p>| Date &amp; Time of injury/onset illnes|
| 12165| <p>Full Injury/Illness Description</p>|       Full Injury/Illness Descriptio|
| 12165| <p>Job Title</p>|                                Job Titl|

As you can see I am using a function [dbo].[udf_StripHTML] to clean up the data:

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT

SET @Start = CHARINDEX('<',@HTMLText) SET @End = 
CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText)) 
SET @Length = (@End - @Start) + 1 WHILE @Start > 0
AND @End > 0
AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText) SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
--RETURN replace(REPLACE((@HTMLText),'&nbsp;',' '), '&#39;', '')
--RETURN replace(replace(LTRIM((@HTMLText)),'&nbsp;',' '), '&#39;','')
RETURN replace(substring(REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),'&nbsp;',' '), '&#39',' '), '&amp;',''), charindex('.)',REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),'&nbsp;',' '), '&#39;', ''), '&amp;','')), len(REPLACE(REPLACE(replace(ltrim(rtrim(@htmltext)),'&nbsp;',' '), '&#39;', ''), '&amp;',''))

END


Select [Question_ID]
,[Question_Text]
,[dbo].[udf_StripHTML]([question_text2])
From [dbo].[tblQuestionsSample]
WHERE  (question_text NOT LIKE '%Comments%') AND (question_text NOT LIKE '%Archived%')

I appreciate any assistance you may offer,

Karen

How to solve :

I know you bored from this bug, So we are here to help you! Take a deep breath and look at the explanation of your problem. We have many solutions to this problem, But we recommend you to use the first method because it is tested & true method that will 100% work for you.

Method 1

If your HTML is valid XHTML (and it appears to be so from the small sample provided), you can cast it to xml and then use .value with XQuery to parse it. This will do a far better job than mucking around with string manipulation and HTML escaping:

SELECT
  Question_Text_Parsed = TRY_CAST(Question_Text AS xml).value('(/p/text())[1]','nvarchar(1000)')
FROM tblQuestionsSample
Question_ID Question_Text_Parsed
11603 Date/Time
11632 Attachments
12166 Employee ID
12166 Work Related?
12165 Date & Time of injury/onset illness?
12165 Full Injury/Illness Description
12165 Job Title

db<>fiddle

Note: Use and implement method 1 because this method fully tested our system.
Thank you 🙂

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply