Hi,
I'm going to put some trivia on a channel and I have about 11000 questions in a file. Some of them are repeated and I want to remove the *clones* and leave only one. Also, for the to long questions to be removed. Any ideeas? I was thinking to get the *line* length and remove it if it's bigger than *x*, or something like this..
Once the game is over, the king and the pawn go back in the same box.
if the whole thing is read in a table, you would only have to make 2 foreach and check each line against each other. this could take a while, but its a onetime srcipt, so who cares? . for the length... isnt there a linelength count command!? just include this check ^-^.
for clone-detection:
i think it saves time, if you first sort the lines and then check only the next...
c++: if youre interested in a win-app i could help you.
#!/usr/bin/tclsh
proc out {text} {
puts stdout $text
}
proc usage {} {
global argv0
out "Usage: $argv0 \[InputFile\] \[OutputFile\] <MaxLen>"
out "\[InputFile\] : File to filter for duplicate or long entries"
out "\[OuputFile\] : File to create with filtered output"
out "<MaxLen> : If lines should be a maximum length, specify the limit here (optional). 0 or not given disables this"
exit
}
if {$argc < 2} { usage }
set in [lindex $argv 0]
set out [lindex $argv 1]
set max 0
if {$argc >= 3} { set max [lindex $argv 2] }
if {[regexp -- {[^0-9]} $max]} {
out "Invalid max length given"
usage
}
if {(![file exits $in]) || ([file exists $out])} {
out "Input file doesn't exist, or output is allready there"
usage
}
set fp [open $in r]
set buf [read $fp]
set outbuf [list]
close $fp
out "Scanning file"
set idx 1
foreach line $buf {
if {[lsearch -exact [lrange $buf $idx end] $line] >= 0} {
incr idx
continue
}
if {($max) && ([string length $line] > $max)} {
incr idx
continue
}
lappend outbuf $line
incr idx
}
out "Writting output buffer"
se fp [oepn $out w]
foreach a $outbuf {
puts $fp $a
}
close $fp
out "Output complete. Discarded [expr [llength $buf] - [llength $outbuf]] record(s)"
If this works, and the speed of it is another matter.
However, there is one small issue. It doesn't account for different letter cases in duplicates.
IE
This is a test by PPSlim!
This is a test by ppslim!
[irc@delta irc]$ tclsh sort.tcl trivia.wri result.wri 0
Scanning file
list element in quotes followed by ";" instead of space
while executing
"foreach line $buf {
if {[lsearch -exact [lrange $buf $idx end] $line] >= 0} {
incr idx
continue
}
if {($max) && ([string length $line] > $max)} {..."
(file "sort.tcl" line 39)
#!/usr/bin/tclsh
proc out {text} {
puts stdout $text
}
proc usage {} {
global argv0
out "Usage: $argv0 \[InputFile\] \[OutputFile\] <MaxLen>"
out "\[InputFile\] : File to filter for duplicate or long entries"
out "\[OuputFile\] : File to create with filtered output"
out "<MaxLen> : If lines should be a maximum length, specify the limit here (optional). 0 or not given disables this"
exit
}
if {$argc < 2} { usage }
set in [lindex $argv 0]
set out [lindex $argv 1]
set max 0
if {$argc >= 3} { set max [lindex $argv 2] }
if {[regexp -- {[^0-9]} $max]} {
out "Invalid max length given"
usage
}
if {(![file exists $in]) || ([file exists $out])} {
out "Input file doesn't exist, or output is allready there"
usage
}
set fp [open $in r]
set buf [split [read $fp] \n]
set outbuf [list]
close $fp
out "Scanning file"
set idx 1
foreach line $buf {
if {[lsearch -exact [lrange $buf $idx end] $line] >= 0} {
incr idx
continue
}
if {($max) && ([string length $line] > $max)} {
incr idx
continue
}
lappend outbuf $line
incr idx
}
out "Writting output buffer"
se fp [open $out w]
foreach a $outbuf {
puts $fp $a
}
close $fp
out "Output complete. Discarded [expr [llength $buf] - [llength $outbuf]] record(s)"
I had a nasty brain fart, and I think I am getting the flu (I feel rotton, and the rest of the family claims they are ill).
I had made major "Just woekn" spelling errors and forgot to create a list out of the incoming text.