This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

string match

Old posts that have not been replied to for several years.
N
Nexus6
Op
Posts: 114
Joined: Mon Sep 02, 2002 4:41 am
Location: Tuchola, Poland

string match

Post by Nexus6 »

I'm using one script which bans people if they say *www.* *#* on public, it works ok but I don't wanna bot do anything if somone says #channel on #channel, I tried this:

Code: Select all

 if {[string match "#channel" [string tolower $arg]]} {return 0}
I thought that it will work, but it doesn't and "*#channel*" doesn't fix prob because if people spammed #channel2 they wouldn't get banned.
How could I make bot do nothing if someone says "Hello....#channel", "#channel is ok"

Thanks for ideas.
p
ppslim
Revered One
Posts: 3914
Joined: Sun Sep 23, 2001 8:00 pm
Location: Liverpool, England

Post by ppslim »

For this, you could use 2 string matches.

If the incoming text is stored in $arg and the channel name is in $chan (these shoudl have been set in the proc line).

Code: Select all

if {([string match -nocase "*$chan * $arg]) || ([string match -nocase "*$chan" $arg)} { return 0 }
The reason for 2, is to prevent people using (as you said) #channel2.

The first one, will work with text, where there must be space after the channel name, beofre the next word.

The second will allow a user to say the channel name as the last word.

Note, this system is still flawed. I could go into #channel and say.
Heya #channel - Why no come to #xxx and for xxx pics
User avatar
strikelight
Owner
Posts: 708
Joined: Mon Oct 07, 2002 10:39 am
Contact:

Re: string match

Post by strikelight »

Nexus6 wrote:I'm using one script which bans people if they say *www.* *#* on public, it works ok but I don't wanna bot do anything if somone says #channel on #channel, I tried this:

Code: Select all

 if {[string match "#channel" [string tolower $arg]]} {return 0}
I thought that it will work, but it doesn't and "*#channel*" doesn't fix prob because if people spammed #channel2 they wouldn't get banned.
How could I make bot do nothing if someone says "Hello....#channel", "#channel is ok"

Thanks for ideas.
You probably would rather be using something like:

Code: Select all

if {[lsearch -exact [split [string tolower $arg]] [string tolower "#channel"]] != -1} { return 0 }
This way it will only do nothing if and only if the exact channel name is mentioned.
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Re: string match

Post by egghead »

Nexus6 wrote:I'm using one script which bans people if they say *www.* *#* on public, it works ok but I don't wanna bot do anything if somone says #channel on #channel, I tried this:

Code: Select all

 if {[string match "#channel" [string tolower $arg]]} {return 0}
I thought that it will work, but it doesn't and "*#channel*" doesn't fix prob because if people spammed #channel2 they wouldn't get banned.
How could I make bot do nothing if someone says "Hello....#channel", "#channel is ok"

Thanks for ideas.
Nexus, ppslim has a good point that a user could come in and do "Heya #channel - Why no come to #xxx and for xxx pics".

After messing around for a while with lists, scans and regexp here is an attempt:

Code: Select all

#---------------------------------------------------------------------
# proc stringisbad checks a string for channel names (# chans only!)
# returns 1 if string contains undesired channelnames
# returns 0 if string does not contain undesired channelnames
#---------------------------------------------------------------------

proc stringisbad { string } {
   # allowed channel
   set channel #egghead
   # iteration counter (limit number of iterations)
   set count 0
   # regexp rule
   set rule {^[^#]*(#[^ ]*)(.*)}
   # iterate on the string, breaking at every "#channame"
   while {[regexp $rule $string match channame string] == 1} {
      # limit the number of iterations.
      # more than 5: consider it a bad string.
      incr count
      if { $count > 5 } { return 1 }
      # compare the channame with the allowed channelname
      if { [string compare $channel $channame] != 0 } { return 1 }
   }
   return 0
} 
running the test:

Code: Select all

#---------------------------------------------------------------------
# test
#---------------------------------------------------------------------

lappend strings "join egghead"
lappend strings "join #egghead"
lappend strings "join #egghead for fun"
lappend strings "#egghead"
lappend strings "#egghead2"
lappend strings "#egghead2 for fun"
lappend strings "join #egghead2"
lappend strings "join #egghead2 for xxx"
lappend strings "hi #egghead join #xxx "
lappend strings "join #xxx for xxx"
lappend strings "#xxx for xxx"
lappend strings "#xxx"
lappend strings "####xxx"
lappend strings "#egghead #egghead #egghead #xxxx"
lappend strings "#egghead #xxxx #egghead"
lappend strings "#"
lappend strings "xxx#"
lappend strings "##egghead"

proc mput { keyword string } {
   set line [format "%-8s %s" $keyword $string]
   puts $line
}

foreach line $strings {
   if {[stringisbad $line]} {
      mput BAD $line
   } else {
      mput GOOD $line
   }
}
results in

Code: Select all

GOOD     join egghead
GOOD     join #egghead
GOOD     join #egghead for fun
GOOD     #egghead
BAD      #egghead2
BAD      #egghead2 for fun
BAD      join #egghead2
BAD      join #egghead2 for xxx
BAD      hi #egghead join #xxx 
BAD      join #xxx for xxx
BAD      #xxx for xxx
BAD      #xxx
BAD      ####xxx
BAD      #egghead #egghead #egghead #xxxx
BAD      #egghead #xxxx #egghead
BAD      #
BAD      xxx#
BAD      ##egghead
comments, suggestions welcome.
User avatar
strikelight
Owner
Posts: 708
Joined: Mon Oct 07, 2002 10:39 am
Contact:

Re: string match

Post by strikelight »

egghead wrote:
Nexus6 wrote:I'm using one script which bans people if they say *www.* *#* on public, it works ok but I don't wanna bot do anything if somone says #channel on #channel, I tried this:

Code: Select all

 if {[string match "#channel" [string tolower $arg]]} {return 0}
I thought that it will work, but it doesn't and "*#channel*" doesn't fix prob because if people spammed #channel2 they wouldn't get banned.
How could I make bot do nothing if someone says "Hello....#channel", "#channel is ok"

Thanks for ideas.
Nexus, ppslim has a good point that a user could come in and do "Heya #channel - Why no come to #xxx and for xxx pics".

After messing around for a while with lists, scans and regexp here is an attempt:

Code: Select all

#---------------------------------------------------------------------
# proc stringisbad checks a string for channel names (# chans only!)
# returns 1 if string contains undesired channelnames
# returns 0 if string does not contain undesired channelnames
#---------------------------------------------------------------------

proc stringisbad { string } {
   # allowed channel
   set channel #egghead
   # iteration counter (limit number of iterations)
   set count 0
   # regexp rule
   set rule {^[^#]*(#[^ ]*)(.*)}
   # iterate on the string, breaking at every "#channame"
   while {[regexp $rule $string match channame string] == 1} {
      # limit the number of iterations.
      # more than 5: consider it a bad string.
      incr count
      if { $count > 5 } { return 1 }
      # compare the channame with the allowed channelname
      if { [string compare $channel $channame] != 0 } { return 1 }
   }
   return 0
} 
comments, suggestions welcome.
Or..

Code: Select all

set bad 0
foreach word [split $arg] {
  if {[string match "#*" $word] && [string tolower $word] != [string tolower $chan]} {
    set bad 1
  }
}
if {!$bad} { return }
or to cut down on iterations while still actually verifying each channel name..

Code: Select all

set bad 0
set mylist [lsort [split [string tolower $arg]]]
set aloc [lsearch $mylist #*]
if {$aloc == -1} { return }
set mylist [lrange $mylist $aloc end]
foreach word $mylist {
  if {![string match "#*" $word]} { break }
  if {$word != [string tolower $chan]} { set bad 1 }
}
if {!$bad} { return }
Last edited by strikelight on Wed Oct 30, 2002 9:05 pm, edited 1 time in total.
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Re: string match

Post by egghead »

strikelight wrote:
[snip]

Or..

Code: Select all

set bad 0
foreach word [split $arg] {
  if {[string match "#*" $word] && [string tolower $word] != [string tolower $chan]} {
    set bad 1
  }
}
if {!$bad} { return }
Aside from having a "break" right when the "set bad 1" has been reached, there is another issue:
Hello #egghead, join#xxx for xxx, join#xxx!!!
:roll:

Another drawback of doing a word by word check, is that it's not cheap on the average say 5 to 10 word sentences not containing any "#" char. Maybe doing a check for the presence of a # char in the string before going to a word by word check can solve this.
Last edited by egghead on Wed Oct 30, 2002 9:15 pm, edited 1 time in total.
User avatar
strikelight
Owner
Posts: 708
Joined: Mon Oct 07, 2002 10:39 am
Contact:

Re: string match

Post by strikelight »

egghead wrote:
strikelight wrote:
[snip]

Or..

Code: Select all

set bad 0
foreach word [split $arg] {
  if {[string match "#*" $word] && [string tolower $word] != [string tolower $chan]} {
    set bad 1
  }
}
if {!$bad} { return }
Aside from having a "break" right when the "set bad 1" has been reached, there is another issue:
Hello #egghead, join#xxx for xxx, join#xxx!!!
:roll:
I thought I had read him say something that he didn't..
in any event, you would simply change #* to *#*
Chances are you would never make a spam phrase with an allowed channel name... ie. join#channel or hello#channel...
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Re: string match

Post by egghead »

strikelight wrote:
or to cut down on iterations while still actually verifying each channel name..

Code: Select all

set bad 0
set mylist [lsort [split [string tolower $arg]]]
set aloc [lsearch $mylist #*]
if {$aloc == -1} { return }
set mylist [lrange $mylist $aloc end]
foreach word $mylist {
  if {![string match "#*" $word]} { break }
  if {$word != [string tolower $chan]} { set bad 1 }
}
if {!$bad} { return }
Actually, Tcl8.4 has an [lsearch -all -inline] option which in principle should be able to produce immediately a list of matches. Unfortunately I'm still on Tcl8.3, so I couldn't test that idea :)
User avatar
strikelight
Owner
Posts: 708
Joined: Mon Oct 07, 2002 10:39 am
Contact:

Re: string match

Post by strikelight »

egghead wrote:

Another drawback of doing a word by word check, is that it's not cheap on the average say 5 to 10 word sentences not containing any "#" char. Maybe doing a check for the presence of a # char in the string before going to a word by word check can solve this.
Already addressed this in my modification on my post containing the example, with the lsort, lrange, etc...
User avatar
strikelight
Owner
Posts: 708
Joined: Mon Oct 07, 2002 10:39 am
Contact:

Re: string match

Post by strikelight »

egghead wrote:
Actually, Tcl8.4 has an [lsearch -all -inline] option which in principle should be able to produce immediately a list of matches. Unfortunately
I'm still on Tcl8.3, so I couldn't test that idea :)
Same here... :-?
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Mudding

Post by egghead »

After some further mudding around, four solutions were tested:

1. REXP: solution using regexp, first script egghead
2. SPLT: iterate on each word in a string, first script strikelight
3. MODF: iterate on each word, modified version strikelight
4. FAST: iterate on each word using string functions only (code given below)

Running the full set of lines given previously containing channel names, produced the following results on the shell (averaged over 10000 iterations).

REXP: 1611 microseconds per iteration
SPLT: 740 microseconds per iteration
MODF: 970 microseconds per iteration
FAST: 639 microseconds per iteration

Clearly the REXP solution is the most expensive one, about twice the other 3 solutions. Another observation is that the modified script does not perform better compared to the script that splits the string in words and checks on each word.

Next a set of 20 strings NOT containing a #channel name , each line having 6 words ("hello world, this is a line") was tested.

The timing results:
REXP: 150 microseconds per iteration
SPLT: 991 microseconds per iteration
MODF: 781 microseconds per iteration
FAST: 164 microseconds per iteration

In this case the REXP is dead cheap as it directly detects that there are no "#" characters in the line.

To mimic behaviour with a "bind PUBM - {* *#*} " by putting the line "if {[string first "#" $string] == -1 } { return 0 }" right at the start of the SPLT and MODF scripts (i.e. first checking for the presence of a # character in the string), the following results are obtained.

REXP: 150 microseconds per iteration
SPLT: 153 microseconds per iteration
MODF: 157 microseconds per iteration
FAST: 156 microseconds per iteration

As stated somewhere else in this thread, the option of using a
  • has not been tested.

    All in all, the idea of splitting the line and checking each word resulted in another piece, shown below. This script produced the results under FAST. Note that this script will consider "join#egghead" as good and "join#xxx" as bad.

    Code: Select all

    proc stringisbad { string } {
       # omit sentences without # in it
       if {[set index [string first "#" $string]] == -1 } { return 0 }
       # remove first part of the string
       set string [string range $string $index end]
       # allowed channel
       set channel #egghead
       # split the line into a list of words
       set list [split $string]
       # iterate on the words
       foreach word $list {
          # if word does not contain a "#" then continue ...
          if { [set index [string first "#" $word]] == -1 } { continue }
          # ... else retrieve channel name
          set channame [string range $word $index end]
          # if found name is not allowed name then string is bad
          if { $channame != $channel } { return 1 }
       }
       # iterated on all words, but nothing bad found :)
       return 0
    }
    
User avatar
stere0
Halfop
Posts: 47
Joined: Sun Sep 23, 2001 8:00 pm
Location: Brazil

full

Post by stere0 »

so, wich is the final full script? Can u paste or send to me?
[]s
p
ppslim
Revered One
Posts: 3914
Joined: Sun Sep 23, 2001 8:00 pm
Location: Liverpool, England

Post by ppslim »

May I include this in no-spam?

This is somthing people have been crying out for, Bu ti have been unable to generate somthing small and fast enough to deal with large load channels - nospam is packed as it is.
e
egghead
Master
Posts: 481
Joined: Mon Oct 29, 2001 8:00 pm
Contact:

Post by egghead »

If you intend to use snippets of this thread produced by myself, they are free to use (if of any use). A "bind pubm - {* *#*} proc" defiantly will come handy :)
User avatar
stdragon
Owner
Posts: 959
Joined: Sun Sep 23, 2001 8:00 pm
Contact:

Post by stdragon »

Have you tried a single regexp?

% set line "hello #egghead everybody come to #egghead2"
hello #egghead everybody come to #egghead2
% time {regexp -nocase {#(?!egghead(\W|$))} $line} 1000
9 microseconds per iteration

The code seems a bit simpler :)
Locked