split ||

arcane · Post by **arcane** » Mon Aug 11, 2003 8:58 am

hi
anyone a solution for this:

i've got a string "a||b||c||d" and i want to split it into a, b, c and d. now my problem is that tcl won't split by "||". it just splits by "|" and gives me too much parts.

Code: Select all

set test "a||b||c||d"
set length [llength [split $test "||"]]

length is now "7".

Code: Select all

set test "a|b|c|d"
set length [llength [split $test "||"]]

length is now "4".

i've tried everything i could think of (split "\\||", split {||}, split "\|\|"...). none did work. can you help me?

Post by **user** » Mon Aug 11, 2003 9:28 am

Check the manual.
'split' splits on ALL the chars specified in the second argument (if any).

Code: Select all

regexp -all -inline {[^|]+} $yourString

would return a list like you want, but doesn't care how many |'s there are between the "elements".
Another solution is replacing || with some single char not used anywhere else in your content (using 'string map' or 'regsub') and then split by that char.

ppslim · Post by **ppslim** » Tue Aug 12, 2003 7:54 am

Code: Select all

proc chunk {in chars} {
  if {[string first $chars $in] < 0} { return [list $in] }
  set temp [list]
  set chunks 0
  set chunke 0
  while {[set chunke [string first $chars $in $chunks]] != "-1"} {
    lappend temp [string range $in $chunks [expr $chunke - 1]]
    set chunks [expr $chunke + [string length $chars]]
  }
  if {[string length [string range $in $chunks end]]} {
    lappend temp [string range $in $chunks end]
  }
  return $temp
}

Simalar to split, however, it does it in chunks like you asked.

% set a "123,@.456,@.789,@.abc,@.def,>.ghi,@.jklm"
123,@.456,@.789,@.abc,@.def,>.ghi,@.jklm

% chunk $a ",@."
123 456 789 abc def,>.ghi jklm

% chunk $a ",>."
123,@.456,@.789,@.abc,@.def ghi,@.jklm

Post by **user** » Tue Aug 12, 2003 9:29 am

This check:

ppslim wrote: if {[string length [string range $in $chunks end]]} {

will lead to invalid results if the last chars of the string is the chars you're "splitting" by. (should result in a empty element at the end)

Here's a rewrite of ppslim's proc that should produce results more like the original split:

Code: Select all

proc chop {str {by " "}} {
	set l [string length $by]
	set i 0
	set j 0
	while {[set j [string first $by $str $i]]>-1} {
		lappend out [string range $str $i [expr {$j-1}]]
		set i [expr {$j+$l}]
	}
	if {$i<=[string len $str]} {
		lappend out [string range $str $i end]
	}
	set out
}

EDIT: I still think

Code: Select all

proc chop {str {by "  "} {re \0}} {
  split [string map [list $by $re] $str] $re
}

is better (at least for text recieved from irc)

strikelight · Post by **strikelight** » Tue Aug 12, 2003 2:19 pm

user wrote:This check:
ppslim wrote: if {[string length [string range $in $chunks end]]} {
will lead to invalid results if the last chars of the string is the chars you're "splitting" by. (should result in a empty element at the end)

Here's a rewrite of ppslim's proc that should produce results more like the original split:
Code: Select all
proc chop {str {by " "}} {
	set l [string length $by]
	set i 0
	set j 0
	while {[set j [string first $by $str $i]]>-1} {
		lappend out [string range $str $i [expr {$j-1}]]
		set i [expr {$j+$l}]
	}
	if {$i<=[string len $str]} {
		lappend out [string range $str $i end]
	}
	set out
}
EDIT: I still think
Code: Select all
proc chop {str {by "  "} {re \0}} {
  split [string map [list $by $re] $str] $re
}
is better (at least for text recieved from irc)

It most definitley is better for ANY text... not only because of code size, but also cpu time wise.. the previous implementation would render approximately O(8n) instructions whereas the second one only renders
about O(3) instructions ... so if the text was 128 chars long, the first one would be issuing about 384 instructions (worst case scenario) to the processor, as opposed to the mere 3 instructions sent out by the shorter proc. Although I would have used \x81 instead of \0 myself

stdragon · Post by **stdragon** » Tue Aug 12, 2003 9:18 pm

Just to nitpick, you're assuming that the operations in question have the same penalty time-wise, but that's wrong. If you think about it, "string map" and "split" both cycle through the entire string searching. Both procs are O(n).

strikelight · Post by **strikelight** » Tue Aug 12, 2003 10:09 pm

stdragon wrote:Just to nitpick, you're assuming that the operations in question have the same penalty time-wise, but that's wrong. If you think about it, "string map" and "split" both cycle through the entire string searching. Both procs are O(n).

I was referring to the 'string map' versus the proc initially proposed by ppslim, which uses a while loop, as well as many other functions, which is obviously going to require a larger O notation.
And if you are referring to the proc which does use both split and string map and my calculation of big-O, you will see the word 'about' in my estimation (which is what O notation is).. It would be O(n+n) = O(2n) then.. and even then it's probably less, because when you think about it, if you were to use regsub in place of string map, you would find it takes longer in practical tests.. so assuming the regsub would be O(n), then string map < O(n) .. Nitpick nitpicked.

Post by **user** » Tue Aug 12, 2003 10:34 pm

strikelight wrote:It most definitley is better for ANY text...

Not if the text can contain any char. Then it's useless.

strikelight wrote:the previous implementation would render approximately O(8n) instructions whereas the second one only renders about O(3) instructions ... so if the text was 128 chars long, the first one would be issuing about 384 instructions (worst case scenario) to the processor, as opposed to the mere 3 instructions sent out by the shorter proc.

By "instructions" I assume you mean command invocations, and counting them, like stdragon said, makes little sense.

Why think when we've got "time"?

I named the three procs from this thread in the order they were posted and timed them:

Code: Select all

set a "ab||cde||fghi||jklmn||opqrst||uvwxyz0||12345678||"

foreach cmd {chop1 chop2 chop3} {
  puts "$cmd: [time [list $cmd $a ||] 10000]"
}

Result:

chop1: 377 microseconds per iteration
chop2: 118 microseconds per iteration
chop3: 36 microseconds per iteration

strikelight wrote:Although I would have used \x81 instead of \0 myself

WHY?
\x81 can be sent via irc, \0 can't. (unless it's encoded in a ctcp iirc) That's my reason for using \0.

strikelight · Post by **strikelight** » Tue Aug 12, 2003 10:43 pm

user wrote:
strikelight wrote:It most definitley is better for ANY text...
Not if the text can contain any char. Then it's useless.

Hence the \x81 furthur on..

user wrote:
strikelight wrote:the previous implementation would render approximately O(8n) instructions whereas the second one only renders about O(3) instructions ... so if the text was 128 chars long, the first one would be issuing about 384 instructions (worst case scenario) to the processor, as opposed to the mere 3 instructions sent out by the shorter proc.
By "instructions" I assume you mean command invocations, and counting them, like stdragon said, makes little sense.

O-notation is largely used in computer science... To call it sensless, is pure ignorance. I suggest researching "O Notation" on google.

Post by **user** » Tue Aug 12, 2003 10:53 pm

strikelight wrote:Hence the \x81 furthur on..

I still don't get it.

strikelight wrote:O-notation is largely used in computer science... To call it sensless, is pure ignorance. I suggest researching "O Notation" on google.

I didn't call O-notation senseless. What I meant is that it's very inaccurate when used on the uncompiled tcl code.

stdragon · Post by **stdragon** » Wed Aug 13, 2003 1:28 am

strikelight wrote:
stdragon wrote:Just to nitpick, you're assuming that the operations in question have the same penalty time-wise, but that's wrong. If you think about it, "string map" and "split" both cycle through the entire string searching. Both procs are O(n).
I was referring to the 'string map' versus the proc initially proposed by ppslim, which uses a while loop, as well as many other functions, which is obviously going to require a larger O notation.

I was referring to those two procs too. Ppslim's does not require a bigger O notation, because "string map" is itself a function that uses a loop and many other functions. It is not a constant-time function. So the two procs are basically the same in terms of efficiency -- although ppslim's is slower overall because it's implemented in tcl instead of C. That doesn't change its O value.

strikelight wrote: And if you are referring to the proc which does use both split and string map and my calculation of big-O, you will see the word 'about' in my estimation (which is what O notation is).. It would be O(n+n) = O(2n) then.. and even then it's probably less, because when you think about it, if you were to use regsub in place of string map, you would find it takes longer in practical tests.. so assuming the regsub would be O(n), then string map < O(n) .. Nitpick nitpicked.

O(8n) = O(2n) = O(n) (you ignore constants). That's why I said both are O(n).

Just to clear this up: the purpose of big-O notation is to estimate the change in something like memory usage or running time relative to a change in input (n). In this case we're talking about string length. So if you double the string length, an O(n) algorithm will take double the time to finish. You can see that (2n) / (n) = 2, twice the time. If you have O(8n), you get (8 * 2n) / (8 * n) = 2 (same as O(n)). If you have an O(n^2) algorithm, you get ((2n)^2) / (n^2) = 4, which means it takes 4 times as long when you double the input.

Also, what does comparing regsub and string map have to do with anything? Ppslim's original proc didn't use regsub so I don't see where that comparison is going.. but even so, it's wrong, because regsub is not always O(n), it can be higher, like O(n^2) for certain operations, or lower, like O(1) for other operations.

nitpicked nitpick nitpick nitpicked :)

strikelight · Post by **strikelight** » Wed Aug 13, 2003 1:48 am

stdragon wrote:
O(8n) = O(2n) = O(n) (you ignore constants). That's why I said both are O(n).

Yes, this is true.. However, because tcl execution is quite slow in comparison to compiled languages, the coefficients become increasingly harder to simply discard.

stdragon wrote: Also, what does comparing regsub and string map have to do with anything? Ppslim's original proc didn't use regsub so I don't see where that comparison is going.. but even so, it's wrong, because regsub is not always O(n), it can be higher, like O(n^2) for certain operations, or lower, like O(1) for other operations.

I brought it up as a comparison for the short 'chop' procedure...
ie. regsub -all $by $str $re str in place of the string map...

So either (from timed results):
a) regsub does extra work to do the same thing as done by the string map (> O(n))
b) string map doesn't take an iterative-search approach to implementing it's changes (< O(n))

Nitpick nitpicked nitpicked nitpick nitpick nitpicked

egghelp/eggheads community

split ||

split ||

Slight bug

Re: Slight bug

Re: Slight bug

Re: Slight bug

Re: Slight bug