Search Amazon:

Find unique words in a string using Regular Expressions

Regular expressions provide a flexible and efficient way to process text. Their extensive pattern-matching notation lets you to quickly parse large amounts of text to find specific character patterns and to extract, edit, replace, or delete text substrings. Here I show how to locate unique words in a text document. If you are unfamiliar with regular expressions, see my regular expressions page.

One way to locate all unique words in a string or document is to use a hashtable.

    Imports System.Text.RegularExpressions

    Dim test As String = "one two three two four five three tone"
    Dim re As New Regex("\w+")  ' \w+ matches any word
    Dim words As New Hashtable
    
    For Each m As Match in re.Matches(text)
        If Not words.Contains(m.Value) Then
            words.Add(m.Value, Nothing)
        End If
    Next

In the end, the hashtable will contain: "one two three four five tone". You can achieve the same result using the following regular expression:

    Imports System.Text.RegularExpressions

    Dim pattern As String = "(?\b\w+\b)(?!.+\b\k\b)"
    Dim re As New Regex(pattern)
     
    For Each m As Match in re.Matches(text)
        Console.Write(m.Value & " ")
    Next

The "(?\b\w+\b)" regular expression pattern match a sequence of alphanumeric characters "(\w)" on a word boundary "(\b)" and assign the sequence the name word. The "(?!)" means that the word just found must not be followed by another occurrence of itself "(\k)" even if there are any characters "(.+)" in between.

The regular expression finds all unique words. The "\b" pattern prevents partial matches ("one" will not match the end of "tone".

You can also display all unique dates in the form mm-dd-yy in a string using the pattern: "(?\d\d-\d\d-\d\d)(?!.+\k)".

With a change to the regular expression you can find all the duplicate words in a document using: "(?\b\w+\b)(?=.+\b\k\b)". The "(?=)" means the word match must be followed by another instance of itself. This will find duplicates which means it finds two duplicates if there are three occurrences of a word.

Sign In
  User Id 
  Password 


Submit Your Own Code and Articles




About TheScarms
About TheScarms

Ask me your programming questions

I read every email and answer all I can.

User Feedback: Be the first to add a comment! Items to Show:     

     
You must log in to post feedback.
Comment:    
 

If you use this code, please mention "www.TheScarms.com"

Email this page


TheScarms AppSentinel lets you securely copy protect and create evaluation versions of your software

TheScarms(tm) AppSentinel lets you quickly and easily create evaluation versions of your software and stop unauthorized copying and unregistered use of your programs!

Get your free
trial copy today!


      The World's Number 1 Web Host

© Copyright 2008 TheScarms