This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Learning regexpr

Issues often discussed about Tcl scripting. Check before posting a scripting question.
Post Reply
p
ppslim
Revered One
Posts: 3914
Joined: Sun Sep 23, 2001 8:00 pm
Location: Liverpool, England

Learning regexpr

Post by ppslim »

While again, not directly Tcl related, this subject is something worthwhile talking about. It will help in a hell of a lot fo ways, and make no less than 3 commands, far more fun, with 2 requiring them full stop.

This entry, actually comes from our very own "stdragon". Posted back in Sep 02 (available here), it is more worthwhile in a Tcl FAQ, than at the bottom of a failed search, or long and boring trail through the forums history.

Originally inspired by the very question "Tutorials please", "How do I use them", "What can I use them for", this was the reply, of which, is what makes this forum as a whole, such a powerful learning and help tool.

Regexps?

1. I learned to use them by experimentation in tclsh. If you're in a hurry and don't want to 'learn as you go' then sit down for an hour and play with every special character you see in re_sytnax until you know what it does.

2. You can use them for complex string matching and substitution. That's about it, but that encompasses a lot, since most things in life can be represented as a string. Usually people use it for syntax checking (e.g. "Is the input a valid email address?" or "Is this sentence 'bad' as defined by this list of user rules?"). Other common uses are getting rid of color/control codes, extracting parts of a line into variables, and performing substitutions into kick messages,etc (like the person's nick replaces %nick in the kick message).

3. Regexp returns 0 for no match and 1 for a match. Regsub returns the number of substitutions, e.g. 0 for no match and non-zero for a match.

Looking through re_syntax is good, but it's better to find some example scripts. Depending on what you want to do, most regular expressions are very simple. There are only a few special chars you have to escape (like (), |, ., *, {}, +, erm, maybe some more..).

Here's a quick mini tutorial:

( ) is used for match reporting.. regexp lets you specify 'match variables' that get filled in with what exactly matched. The matches within parenthesis are what get reported. Also it's just like math, they allow you to group other operations. An example of match reporting would be: regexp {(.*)!(.*)@(.*)} $from match nick user host

| means "or". It lets your regexp match more than one thing. For instance, "hello there|hi there" would match either "hello there" or "hi there". Using ( ) for grouping, you could say "(hello|hi) there" and it would be the same thing.

. means "any single character". So the regex "..." would match any 3 letters (including spaces). To match an actual period, escape it with a backslash \.

* means "any number of the previous thing, including 0." So if you have "baa*", what is the previous thing? "a". So that would match "ba" (0), "baa", "baaaaaaaaa", etc. However, using grouping ( ) you can match multiple things: "(baa)*" would match "baabaabaa". If you notice, .* will match anything, because . means "any char" and * means "any number". So any easy way to translate between dos-style wildcards and regexps is to replace ?'s with single dots, and * with ".*"

+ is exactly the same as *, but it requires at least 1. So "baa+" would not match "ba" anymore.

{ } is a range operator. It's just like * and + but it lets you specify how many repeats are acceptable. For instance, "ba{2,10}" would match from "baa" (2 a's) to "baaaaaaaaaa" (10 a's).

There are a few more ones that are either less useful or way more complicated. For instance the section on negative lookaheads meant very little to me until the other day. It's a silly name, what it really is is a "and not" operator. For instance, if you want to match something that "contains an a and no b" you could use a negative lookahead. (Yes for that example you can use the [ ] operator to match for ^b, but [ ] can't contain a full regular expression, whereas negative lookaheads can.)

So those are the basics. Anything more advanced would require exponentially more amounts of text I think. If you have specific questions feel free to ask.

As the man said, post any further questions if you feel fit.
User avatar
user
 
Posts: 1452
Joined: Tue Mar 18, 2003 9:58 pm
Location: Norway

A note to future regexp lunatics :)

Post by user »

People often stick with regexp (once they learn how to use it) even when there's better (faster) methods avaliable to deal with a certain problem. 'scan' and 'string map' is the most common commands ignored by regexp fanatics.

Example using scan to chop up eggdrop's $botname:

Code: Select all

scan $botname %\[^!\]!%\[^@\]@%s nick user host
which is ~13.5 times faster than

Code: Select all

regexp {(.*)!(.*)@(.*)} $botname match nick user host
in my tclsh (8.3)

/rant :P
p
ppslim
Revered One
Posts: 3914
Joined: Sun Sep 23, 2001 8:00 pm
Location: Liverpool, England

Post by ppslim »

Thanks to a kind sugestion from pgpkeys (#egghelp@efnet), there is also a PDF document for you to download.

I ahvn't had a look at it myself yet, but I sure it may be of more use to sombody here.

The document is called Mastering Regular Expressions
User avatar
Sir_Fz
Revered One
Posts: 3794
Joined: Sun Apr 27, 2003 3:10 pm
Location: Lebanon
Contact:

Post by Sir_Fz »

well guys :), I was about to ask how to learn regexp, but I found this post and its realy handy :)

nice idea Ppslim.
User avatar
caesar
Mint Rubber
Posts: 3778
Joined: Sun Oct 14, 2001 8:00 pm
Location: Mint Factory

Post by caesar »

The Mastering Regular Expressions no longer seems to be working, dose anyone have a good link to it?
Once the game is over, the king and the pawn go back in the same box.
User avatar
awyeah
Revered One
Posts: 1580
Joined: Mon Apr 26, 2004 2:37 am
Location: Switzerland
Contact:

Post by awyeah »

Here is a good website to learn regular expressions (regexp) from, it includes tutorials and examples:
http://www.regular-expressions.info/

Here are some softwares only made for the purpose of testing/using regular expressions with their respective examples:
http://www.regular-expressions.info/tools.html
·­awyeah·

==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================
User avatar
awyeah
Revered One
Posts: 1580
Joined: Mon Apr 26, 2004 2:37 am
Location: Switzerland
Contact:

Post by awyeah »

I have been dealing with regexp's alot these days. Here are a few common examples which can help you in eggdrop scripts.

Suppose you want to count the number of A's (alphabet) in a string:

Code: Select all

regexp -all {A} $string
Suppose you want to count the number of the's (word) in a string:

Code: Select all

regexp -all {the} $string
Suppose you want to count more than one character:

Code: Select all

regexp -all {[abcd]} $string
#This code will count and add all the number of a's, b's, c's and d's found
Suppose you want the script to exeucte if any of these characters are not present:

Code: Select all

regexp -all {[^abcd]} $string
#This code will check and add all the number of a's, b's, c's and d's found. #The total number should be 0, for this statement to be true. (negative logic)
Sometimes while matching with regexp's you can use:

Code: Select all

regexp "string" $string
regexp \[string\] $string
regexp {string} $string
I would you to use the curly brackets or the square brackets.

Counting special characters:

Code: Select all

#To count the number of ['s or use:
regexp -all \[\\\[\] $string

#To count the number of {'s or use:
regexp -all \[\\\\\] $string

#To count the number of {'s or use:
regexp -all \[{\] $string

NOTE: Generally you will only need to add 3 escape's infront of each [, ] or \ special characters. For others mostly you need not.
Note: regexp has a -nocase switch, which can be used for ignoring cases while doing matching.

Code: Select all

regexp -nocase -all {abc} $string
#and
regexp -nocase -all {ABC} $string
#will be considered the same then
Matching range of characters:

Code: Select all

#To match a character in between the range of a, b, c, d, ..........z:
regexp {[a-z]} $string > will give 1 for MATCH, 0 for NO-MATCH (lower case match)
regexp -all {[a-z]} $string > will return total number of MATCHES (lower case match)
regexp -nocase -all {[a-z]} $string > will return total number of MATCHES (case ignored)

#To match a character in between the range of 0, 1, 2, 3, ..........9:
regexp {[0-9]} $string > will give 1 for MATCH, 0 for NO-MATCH
regexp -all {[0-9]} $string > will return total number of MATCHES

#Note: The nocase switch for the [0-9] would be redundant.

#To match a character in between the range of a, b, c,....z and 0, 1, 2...9:
regexp {[a-z0-9]} $string > will give 1 for MATCH, 0 for NO-MATCH
regexp -all {[a-z0-9]} $string > will return total number of MATCHES
Note also, it cis also necessary to use the proper matching format:

Code: Select all

regexp {^string_here$} $string
^ = Assert position at the start of the string
$ = Assert position at the end of the string (or before the line break)
$+ = Assert position at the end of the string (or before the line break)
The | operator is used as a LOGICAL "OR".

Code: Select all

regexp {abc|efg|hij} $string
#This will try to match "abc" or "efg" or "hij" if none is found, returns 0, if anyone is found returns 1.
The ^ operator used in a
  • before the first element is used as a LOGICAL "NOT".

    Code: Select all

    regexp {^[^abcd]$} $string
    #If a, b, c and d are not present return 1, if anyone of them is, return 0.
    
    Note if you want to find the matched patterns in the string of regexp you can use the -inline switch. But you should use it with -all in most cases.

    Other examples:
    If you want to match certain patterns:

    Code: Select all

    regexp {^[a-z]{3,}[0-9]{2,}$} $string
    #This will match only if 3 or more characters are present in the range [a-z] and 2 or more characters in the range [0-9] of the string.
    
    #Example:
    abgfg452 > will match
    as456342 > will not match
    abc12 > will match
    
    Other examples:

    Code: Select all

    regexp {^[a-z]{3,5}[0-9]{2,8}$} $string
    #This will match only if 3, 4 or 5 characters are present in the range [a-z] and 2, 3, 4, 5, 6, 7 or 8 characters in the range [0-9] of the string.
    
    #Examples:
    adfsd3463 > will match
    adsfsdgfs325 > will not match
    wer436234 > will match
    gdtweer436322512 > will not match
    
    Other examples:

    Code: Select all

    regexp {^[a-z]{4}[0-9]{3}$} $string
    #This will match only if 4 characters are present in the range [a-z] and 3 characters in the range [0-9] of the string.
    
    #Examples:
    wrew364 > will match
    we436 > will not match
    whg6743 > wil not match
    wga63 > will not match
    se65 > will not match
    
    Other matchings:

    Code: Select all

    regexp {https?} $string
    This will return 1 if "http" or "https" is present in the string, else return 0.
    
    Here are some advanced examples:

    Code: Select all

    regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} $string
    #IP Address -- Matches 0.0.0.0 through 999.999.999.999
    
    regexp {(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)} $string
    #IP Address -- Matches 0.0.0.0 through 255.255.255.255
    
    regexp {(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]}
    #Matching a url
    
    regexp {[0-9]{5}(?:-[0-9]{4})?}
    #US Zipcode
    
    regexp {[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}}
    #Email address
    
    regexp {(0[1-9]|[12][0-9]|3[01])[-/.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}}
    #Date in formats: dd-mm-yy, dd.mm.yy, dd/mm/yy
    
    regexp {^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{14}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})$}
    #Matching all major credit cards
    
    More of these examples can be found by DOWNLOADING and installing
    the SOFTWARE "REGEXYBUDDY".

    Download link: http://www.regexbuddy.com/download.html

    1) After downloading, install the trial version of the software.
    2) After installation, run the software and click on the Library tabs.
    3) In the long search list on the right panel, highlight any matching pattern
    of your choice and in the left of the software, the window you would be able to see the regular expression match pattern.
    4) This is a good software to learn regexp from.
Last edited by awyeah on Fri Jul 08, 2005 4:40 am, edited 2 times in total.
·­awyeah·

==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================
User avatar
awyeah
Revered One
Posts: 1580
Joined: Mon Apr 26, 2004 2:37 am
Location: Switzerland
Contact:

Post by awyeah »

Here are some quick and easy examples of substitutions. We can use 'regsub' (regular substitution) or 'string map'.

Nevertheless regular substitutions are slower, yet more advanced, complicated and effective than string map. However they both can be utilized to accomlish the same thing.

If you want to remove a character from a string:

Code: Select all

#regsub
regsub -all {a} $data "" data
#Will remove all occurences of character "a" in the string $data.

regsub -all {a} $data "b" data
#Will replace all occurences of character "a" in the string by "b".

Similarly,

#string map
string map {"a" ""} $data
#Will remove all occurences of character "a" in the string $data.

#string map
string map {a b} $data
string map {"a" "b"} $data
#Will replace all occurences of character "a" in the string by "b".
Mostly, regsub and string map are used in filters, to filter out certain parts, characters or words in texts.

Similar to regexp, regsub expressions can be used as
  • for matching each character individually.

    Code: Select all

    #This will remove all occurences of a, b, c and d in the string $data.
    regsub -all {[abcd]} "sgfdszasbdgds" "" data
    #Note: We are using the -all switch here so it will return '5' as per the matches.
    
    You can also strip control codes (colors, bolds, underlines etc) from strings using regsub, string map filters as you might have seen in most posts on the forum. Here are some I found on the forum:
    
    #For removing colors
    regsub -all {\003([0-9]{1,2}(,[0-9]{1,2})?)?} $str "" str
    
    #For removing control codes
    regsub -all {\017|\037|\002|\026|\006|\007} $str "" str
    
    #For removing control codes
    set str [string map {"\017" "" "\037" "" "\002" "" "\026" "" "\006" "" "\007" ""} $str]
    
    You might have noticed, removing colors takes advanced regsub logics, which string map can't accomplish as above.

    Note: It is best to indicate control codes in their ascii codes.

    Then normally, string map and regsub can be used as filters to strip out certain special characters or to escape them with extra \'s.

    Here are common examples I found to escape special characters by creating small filters.

    Code: Select all

    #regsub
    proc filter {data} {
    regsub -all -- \\\\ $data \\\\\\\\ data
    regsub -all -- \\\[ $data \\\\\[ data
    regsub -all -- \\\] $data \\\\\] data
    regsub -all -- \\\} $data \\\\\} data
    regsub -all -- \\\{ $data \\\\\{ data
    regsub -all -- \\\$ $data \\\\\$ data
    regsub -all -- \\\" $data \\\\\" data
    return $data
    }
    #Taken from: http://www.peterre.com/characters.html
    
    #string map
    proc filter {data} {
     set data [string map {\\ \\\\ [ \\\[ ] \\\] \{ \\\{ \} \\\} $ \\\$ \" \\\"} $data]
    }
    #Taken from: spambuster.tcl
    
    A list of all special characters that can choke scripts if not used properly:

    Code: Select all

    \, [, ], {, }, $, "
    
    Note: regsub can be used in similar format as regexp:

    Code: Select all

    regsub -all "\002|\003|\017|\026|\037" $text "" text
    regsub -all {\002|\003|\017|\026|\037} $text "" text
    
    For example:

    Code: Select all

    #To remove the total number of capital letters in a string:
    regsub -all {[A-Z]} $text "" counted
    
    #The total number of capital letters in $text will be placed in $counted and $text would have been stripped of the capital letters.
    
    Same goes similarly with numbers, [0-9] or both [a-z0-9].
    Also the -nocase switch is available in regsub for case sensitive matching or if you want to ignore cases while matching -- only for alphabets.
    
    String map does not have an all switch hence it is difficult to count the total number of characters, so string map does have limitations.

    For example:

    Code: Select all

    regsub -all {a} $text "" counted
    
    #is similar as:
    
    set counted 0
    for {set count 0} {$count < [string length $text]} {incr text} {
     if {[string equal "a" [string index $text $count]]} {
      incr counted
      }
    }
    
    (Adv: As you can see regsub is more simpler, easier and has a smaller code)
    (Disadv: regsub is slower than string map)
    
·­awyeah·

==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

attention should be paid to some subtle aspects of regexps, for example so-called "greedy matching"

by default, regexp characters '+' and '*' will match as much as possible ("greedy matching"), which might mean rather unexpected results, most likely in HTML parsing constructs like this:

Code: Select all

% set str "<tag>foo</tag>some text<tag>bar</tag>"
% regsub -all {<tag>(.*)</tag>} $str ""
%
here, we need to strip the tags and their contents, leaving the text in-between; however, because of the greedy matching, we end up with all of the characters between the first opening tag and the last closing tag stripped, effectively leaving us with an empty string - definitely not what we wanted!

the solution is to add '?' specifier after the asterisk, to avert the greedy matching and force '+'/'*' to match as little as possible:

Code: Select all

% set str "<tag>foo</tag>some text<tag>bar</tag>"
% regsub -all {<tag>(.*?)</tag>} $str ""
some text
%
Post Reply