This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

string match

Old posts that have not been replied to for several years.
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Post by egghead »

stdragon wrote:Have you tried a single regexp?

% set line "hello #egghead everybody come to #egghead2"
hello #egghead everybody come to #egghead2
% time {regexp -nocase {#(?!egghead(\W|$))} $line} 1000
9 microseconds per iteration

The code seems a bit simpler :)
The single line regexp works fine with the functionality of the snippets I produced. But before comparing it to the other candidates, I need it in a form where the channel name can be set separately, including the # character. This will mimic the practical situation where a proc is called with 2 arguments: a channel name and a string.
Naturally, I can produce these "preprocessing" lines myself, but they are not necessarily the best ones :)
Once I have these preprocessing lines, I'll put it in a proc form and compare it to the other procs.

Code: Select all

proc stringisbad { channel string } {
   # some preprocessing to strip the # from the 
   # channel name and build the regexp
...
   # regexp test returning either 0 or 1
...
}
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Post by egghead »

Call me a freak, but the following snippet is also submitted for competition :)

Code: Select all

proc stringisbad { channel string } {
   # find first occurance of #, abort if none
   # dont like while loops, but competition is fierce :)
   while { [set index [string first "#" $string]] != -1 } {
      # strip first part from string
      set string [string range $string $index end]
      # find occurance of space and set index accordingly
      set indsp [string first { } $string]
      # if space not found set to string end, else set to word end
      if { $indsp == -1 } { set indsp end } else { incr indsp -1 }
      # retrieve channel name
      set channame [string range $string 0 $indsp]
      # compare channel names
      if { ![string equal -nocase $channame $channel] } { return 1 } 
      # update the string
      set string [string range $string $indsp end]
   }
   # iterated on all words, but nothing bad found :)
   return 0
}
User avatar
stdragon
Owner
Posts: 959
Joined: Sun Sep 23, 2001 8:00 pm
Contact:

Post by stdragon »

I think I can improve on that :)

Code: Select all

# based on egghead's code
proc string2isbad {chan line} {
  set idx 0
  while {[set idx [string first {#} $line $idx]] != -1} {
    set space [string first { } $line $idx]
    if {$space == -1} {set space end} {incr space -1}
    set word [string range $line $idx $space]
    if {[string compare -nocase $chan $word ]} {return 1}
    set idx $space
  }
  return 0
}
Of course, both of these don't check for valid word separators like commas, periods, exclamation points, question marks.
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Post by egghead »

After some further messing and testing around, 3 candidates remained:

BARE: a single bare line regexp, not (yet) generical for any channel :) (stdragon)
SPDY: a speedy string comparison proc (egghead/stdragon)
SPLT: a good old line splitting and word by word checking (strikelight, minor mod by egghead)

First some differences are highlighted in what each proc considers GOOD/BAD:

Code: Select all

Testing: BARE
BARE   GOOD     join egghead
BARE   GOOD     join #egghead
BARE   GOOD     join #egghead!
BARE   GOOD     join#egghead
BARE   GOOD     join #egghead for fun
BARE   GOOD     #egghead
BARE   GOOD     #egghead!!!aaaaaaa
BARE   BAD      #egghead2
BARE   GOOD     #egghead-xxxxx

Testing: SPDY
SPDY   GOOD     join egghead
SPDY   GOOD     join #egghead
SPDY   BAD      join #egghead!
SPDY   GOOD     join#egghead
SPDY   GOOD     join #egghead for fun
SPDY   GOOD     #egghead
SPDY   BAD      #egghead!!!aaaaaaa
SPDY   BAD      #egghead2
SPDY   BAD      #egghead-xxxxx

Testing: SPLT
SPLT   GOOD     join egghead
SPLT   GOOD     join #egghead
SPLT   BAD      join #egghead!
SPLT   BAD      join#egghead
SPLT   GOOD     join #egghead for fun
SPLT   GOOD     #egghead
SPLT   BAD      #egghead!!!aaaaaaa
SPLT   BAD      #egghead2
SPLT   BAD      #egghead-xxxxx
Next the procs are tested against a list of 20 identical strings, each string being "hello #egghead join #xxx for great fun"

Code: Select all

Testing 20 strings WITH channel names (10000 times)
BARE:  855 microseconds per iteration (average of 42 microseconds per line)
SPDY:  741 microseconds per iteration (average of 37 microseconds per line)
SPLT:  807 microseconds per iteration (average of 40 microseconds per line)
BARE:  919 microseconds per iteration (average of 45 microseconds per line)
SPDY:  758 microseconds per iteration (average of 37 microseconds per line)
SPLT:  806 microseconds per iteration (average of 40 microseconds per line)
Several tests indicated that the SPDY and SPLT are the fastest, but the differences between the 3 are minor.

Next the procs were tested against 20 identical strings without any channel name in it ("hello world, this is a line"):

Code: Select all

Testing 20 strings WITHOUT channel names (10000 times)
BARE:  364 microseconds per iteration (average of 18 microseconds per line)
SPDY:  235 microseconds per iteration (average of 11 microseconds per line)
SPLT:  753 microseconds per iteration (average of 37 microseconds per line)
BARE:  366 microseconds per iteration (average of 18 microseconds per line)
SPDY:  301 microseconds per iteration (average of 15 microseconds per line)
SPLT:  750 microseconds per iteration (average of 37 microseconds per line)
In this case SPLT has some difficulties to catch up, because it has to check each word in the line up to the last one for the presence of a channel name. These differences can be amplified by adding more words to the line.

Finally, the procs are tested against 20 identical lines containing 8 times the valid channelname as words in it ("#egghead #egghead etc...):

Code: Select all

Testing 20 strings WITH channel names (1000 times)
BARE: 2221 microseconds per iteration (average of 111 microseconds per line)
SPDY: 2283 microseconds per iteration (average of 114 microseconds per line)
SPLT: 1250 microseconds per iteration (average of 62 microseconds per line)
BARE: 2646 microseconds per iteration (average of 132 microseconds per line)
SPDY: 2275 microseconds per iteration (average of 113 microseconds per line)
SPLT: 1230 microseconds per iteration (average of 61 microseconds per line)
In this extreme case the SPLT proc clearly is the fastest. These results can be further amplified by adding the channel name a couple of times more. In such case the SPLT will remain the fastest, followed by BARE and SPDY as last one.

Based on the above results both the SPDY and the SPLT are good solutions. If the proc has to be used to check general IRC conversation, then the SPDY proc is preferred over the SPLT proc. In that case the SPLT proc can be upgraded, by adding the line "if {[string first {#} $string] == -1 } { return 0 }" as the first line in the proc.

It will be very interesting to see how the results compare with a proc using an [lsearch -all -inline] (available in Tcl 8.4) followed by a channelname check.

Finally, as mentioned by stdragon, note that there are no checks for valid word separators in the SPDY and LIST proc, there is some in the BARE proc.
Adding such check, which is left as a trivial exercise for the interested reader, will raise the timing results of the SPDY and the LIST procs in an identical way, since they perform the channel name checks in the same way.

The procs:

Code: Select all

#--------------------------------------------------------------------- 
# one line regexp by stdragon
#--------------------------------------------------------------------- 

proc stringisbadBARE { channel string } {
   if { [regexp -nocase {#(?!egghead(\W|$))} $string] } { return 1 }
   return 0
}

#--------------------------------------------------------------------- 
# split lines and test by strikelight (slightly modified)
#--------------------------------------------------------------------- 

proc stringisbadSPLT { channel string } {
#   if {[string first {#} $string] == -1 } { return 0 }
   foreach word [split $string] { 
      if {![string match "*#*" $word]} { continue }
      if  {[string compare -nocase $word $channel] == 0 } { continue }
      return 1
   } 
   return 0
}

#--------------------------------------------------------------------- 
# review occurances of # in string (egghead/stdragon)
#--------------------------------------------------------------------- 

proc stringisbadSPDY {channel string } { 
   set idx 0 
   while {[set idx [string first {#} $string $idx]] != -1} {
      set space [string first { } $string $idx] 
      if {$space == -1} {set space end} {incr space -1} 
      set word [string range $string $idx $space] 
      if {[string compare -nocase $channel $word ]} {return 1} 
      set idx $space 
   } 
   return 0
} 
User avatar
caesar
Mint Rubber
Posts: 3778
Joined: Sun Oct 14, 2001 8:00 pm
Location: Mint Factory

Post by caesar »

And finaly, wich is the good code? :]
Once the game is over, the king and the pawn go back in the same box.
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Post by egghead »

caesar wrote:And finaly, wich is the good code? :]
It is indeed true that the conclusion was extremenly vague and poorly written:
Based on the above results both the SPDY and the SPLT are good solutions
Why don't you give the splt proc a try:

Code: Select all

#---------------------------------------------------------------------
# split lines and test by strikelight (slightly modified) 
#---------------------------------------------------------------------

proc stringisbad { channel string } { 
   if {[string first {#} $string] == -1 } { return 0 } 
   foreach word [split $string] { 
      if {![string match "*#*" $word]} { continue } 
      if  {[string compare -nocase $word $channel] == 0 } { continue }
      return 1 
   } 
   return 0 
} 
User avatar
caesar
Mint Rubber
Posts: 3778
Joined: Sun Oct 14, 2001 8:00 pm
Location: Mint Factory

Post by caesar »

Using a foreach word [split $arg] to chk each word for a channel, makes the eggdrop be a little slow when it's an spam. Isn't there any other alternative, maybe a faster one to do such a chk?
Once the game is over, the king and the pawn go back in the same box.
Locked