string match

egghead · Post by **egghead** » Fri Nov 01, 2002 5:35 am

stdragon wrote:Have you tried a single regexp?

% set line "hello #egghead everybody come to #egghead2"
hello #egghead everybody come to #egghead2
% time {regexp -nocase {#(?!egghead(\W|$))} $line} 1000
9 microseconds per iteration

The code seems a bit simpler

The single line regexp works fine with the functionality of the snippets I produced. But before comparing it to the other candidates, I need it in a form where the channel name can be set separately, including the # character. This will mimic the practical situation where a proc is called with 2 arguments: a channel name and a string.
Naturally, I can produce these "preprocessing" lines myself, but they are not necessarily the best ones

Once I have these preprocessing lines, I'll put it in a proc form and compare it to the other procs.

Code: Select all

proc stringisbad { channel string } {
   # some preprocessing to strip the # from the 
   # channel name and build the regexp
...
   # regexp test returning either 0 or 1
...
}

egghead · Post by **egghead** » Fri Nov 01, 2002 9:38 am

Call me a freak, but the following snippet is also submitted for competition

Code: Select all

proc stringisbad { channel string } {
   # find first occurance of #, abort if none
   # dont like while loops, but competition is fierce :)
   while { [set index [string first "#" $string]] != -1 } {
      # strip first part from string
      set string [string range $string $index end]
      # find occurance of space and set index accordingly
      set indsp [string first { } $string]
      # if space not found set to string end, else set to word end
      if { $indsp == -1 } { set indsp end } else { incr indsp -1 }
      # retrieve channel name
      set channame [string range $string 0 $indsp]
      # compare channel names
      if { ![string equal -nocase $channame $channel] } { return 1 } 
      # update the string
      set string [string range $string $indsp end]
   }
   # iterated on all words, but nothing bad found :)
   return 0
}

stdragon · Post by **stdragon** » Fri Nov 01, 2002 12:44 pm

I think I can improve on that :)

Code: Select all

# based on egghead's code
proc string2isbad {chan line} {
  set idx 0
  while {[set idx [string first {#} $line $idx]] != -1} {
    set space [string first { } $line $idx]
    if {$space == -1} {set space end} {incr space -1}
    set word [string range $line $idx $space]
    if {[string compare -nocase $chan $word ]} {return 1}
    set idx $space
  }
  return 0
}

Of course, both of these don't check for valid word separators like commas, periods, exclamation points, question marks.

egghead · Post by **egghead** » Fri Nov 01, 2002 7:14 pm

After some further messing and testing around, 3 candidates remained:

BARE: a single bare line regexp, not (yet) generical for any channel

(stdragon)
SPDY: a speedy string comparison proc (egghead/stdragon)
SPLT: a good old line splitting and word by word checking (strikelight, minor mod by egghead)

First some differences are highlighted in what each proc considers GOOD/BAD:

Code: Select all

Testing: BARE
BARE   GOOD     join egghead
BARE   GOOD     join #egghead
BARE   GOOD     join #egghead!
BARE   GOOD     join#egghead
BARE   GOOD     join #egghead for fun
BARE   GOOD     #egghead
BARE   GOOD     #egghead!!!aaaaaaa
BARE   BAD      #egghead2
BARE   GOOD     #egghead-xxxxx

Testing: SPDY
SPDY   GOOD     join egghead
SPDY   GOOD     join #egghead
SPDY   BAD      join #egghead!
SPDY   GOOD     join#egghead
SPDY   GOOD     join #egghead for fun
SPDY   GOOD     #egghead
SPDY   BAD      #egghead!!!aaaaaaa
SPDY   BAD      #egghead2
SPDY   BAD      #egghead-xxxxx

Testing: SPLT
SPLT   GOOD     join egghead
SPLT   GOOD     join #egghead
SPLT   BAD      join #egghead!
SPLT   BAD      join#egghead
SPLT   GOOD     join #egghead for fun
SPLT   GOOD     #egghead
SPLT   BAD      #egghead!!!aaaaaaa
SPLT   BAD      #egghead2
SPLT   BAD      #egghead-xxxxx

Next the procs are tested against a list of 20 identical strings, each string being "hello #egghead join #xxx for great fun"

Code: Select all

Testing 20 strings WITH channel names (10000 times)
BARE:  855 microseconds per iteration (average of 42 microseconds per line)
SPDY:  741 microseconds per iteration (average of 37 microseconds per line)
SPLT:  807 microseconds per iteration (average of 40 microseconds per line)
BARE:  919 microseconds per iteration (average of 45 microseconds per line)
SPDY:  758 microseconds per iteration (average of 37 microseconds per line)
SPLT:  806 microseconds per iteration (average of 40 microseconds per line)

Several tests indicated that the SPDY and SPLT are the fastest, but the differences between the 3 are minor.

Next the procs were tested against 20 identical strings without any channel name in it ("hello world, this is a line"):

Code: Select all

Testing 20 strings WITHOUT channel names (10000 times)
BARE:  364 microseconds per iteration (average of 18 microseconds per line)
SPDY:  235 microseconds per iteration (average of 11 microseconds per line)
SPLT:  753 microseconds per iteration (average of 37 microseconds per line)
BARE:  366 microseconds per iteration (average of 18 microseconds per line)
SPDY:  301 microseconds per iteration (average of 15 microseconds per line)
SPLT:  750 microseconds per iteration (average of 37 microseconds per line)

In this case SPLT has some difficulties to catch up, because it has to check each word in the line up to the last one for the presence of a channel name. These differences can be amplified by adding more words to the line.

Finally, the procs are tested against 20 identical lines containing 8 times the valid channelname as words in it ("#egghead #egghead etc...):

Code: Select all

Testing 20 strings WITH channel names (1000 times)
BARE: 2221 microseconds per iteration (average of 111 microseconds per line)
SPDY: 2283 microseconds per iteration (average of 114 microseconds per line)
SPLT: 1250 microseconds per iteration (average of 62 microseconds per line)
BARE: 2646 microseconds per iteration (average of 132 microseconds per line)
SPDY: 2275 microseconds per iteration (average of 113 microseconds per line)
SPLT: 1230 microseconds per iteration (average of 61 microseconds per line)

In this extreme case the SPLT proc clearly is the fastest. These results can be further amplified by adding the channel name a couple of times more. In such case the SPLT will remain the fastest, followed by BARE and SPDY as last one.

Based on the above results both the SPDY and the SPLT are good solutions. If the proc has to be used to check general IRC conversation, then the SPDY proc is preferred over the SPLT proc. In that case the SPLT proc can be upgraded, by adding the line "if {[string first {#} $string] == -1 } { return 0 }" as the first line in the proc.

It will be very interesting to see how the results compare with a proc using an [lsearch -all -inline] (available in Tcl 8.4) followed by a channelname check.

Finally, as mentioned by stdragon, note that there are no checks for valid word separators in the SPDY and LIST proc, there is some in the BARE proc.
Adding such check, which is left as a trivial exercise for the interested reader, will raise the timing results of the SPDY and the LIST procs in an identical way, since they perform the channel name checks in the same way.

The procs:

Code: Select all

#--------------------------------------------------------------------- 
# one line regexp by stdragon
#--------------------------------------------------------------------- 

proc stringisbadBARE { channel string } {
   if { [regexp -nocase {#(?!egghead(\W|$))} $string] } { return 1 }
   return 0
}

#--------------------------------------------------------------------- 
# split lines and test by strikelight (slightly modified)
#--------------------------------------------------------------------- 

proc stringisbadSPLT { channel string } {
#   if {[string first {#} $string] == -1 } { return 0 }
   foreach word [split $string] { 
      if {![string match "*#*" $word]} { continue }
      if  {[string compare -nocase $word $channel] == 0 } { continue }
      return 1
   } 
   return 0
}

#--------------------------------------------------------------------- 
# review occurances of # in string (egghead/stdragon)
#--------------------------------------------------------------------- 

proc stringisbadSPDY {channel string } { 
   set idx 0 
   while {[set idx [string first {#} $string $idx]] != -1} {
      set space [string first { } $string $idx] 
      if {$space == -1} {set space end} {incr space -1} 
      set word [string range $string $idx $space] 
      if {[string compare -nocase $channel $word ]} {return 1} 
      set idx $space 
   } 
   return 0
}

Post by **caesar** » Tue Nov 12, 2002 1:54 pm

And finaly, wich is the good code? :]

egghead · Post by **egghead** » Tue Nov 12, 2002 5:37 pm

caesar wrote:And finaly, wich is the good code? :]

It is indeed true that the conclusion was extremenly vague and poorly written:

Based on the above results both the SPDY and the SPLT are good solutions

Why don't you give the splt proc a try:

Code: Select all

#---------------------------------------------------------------------
# split lines and test by strikelight (slightly modified) 
#---------------------------------------------------------------------

proc stringisbad { channel string } { 
   if {[string first {#} $string] == -1 } { return 0 } 
   foreach word [split $string] { 
      if {![string match "*#*" $word]} { continue } 
      if  {[string compare -nocase $word $channel] == 0 } { continue }
      return 1 
   } 
   return 0 
}

Post by **caesar** » Mon Nov 18, 2002 12:39 pm

Using a foreach word [split $arg] to chk each word for a channel, makes the eggdrop be a little slow when it's an spam. Isn't there any other alternative, maybe a faster one to do such a chk?