|
|
Hi, I've been banging my head against a wall trying to figure this out and nothing seems to work.
I have two files, File A contains a bunch of student information and File B contains email aliases. File A has about 15,000 lines with each line being a different student. File B has about 50,000 lines with about 49,000 lines being aliases. The first 1000 lines or so in File B is miscellaneous text but is required.
File A is in the following format:
year;FirstName;LastName;LastName (again);firstname.lastname;firstname.lastname (again);code
File B is in the following format:
misc misc somename:somename[ at ]domain.com firstname.lastname:firstname.lastname[ at ]studentdomain.com
What I need to do is this: 1) Read line from File A 2) get firstname.lastname 3) search in File B for firstname.lastname 4) if found, replace studentdomain.com with domain.com 5) Do until all names from File A are completed.
I know of an easy way to do it, but that would mean reading the file and writing to File B 15000 times. Not efficient enough. I've been trying to come up with something that can atleast do 500-1000 users/hr.
I've tried a bunch of things and nothing works. I get duplicate entries and all kind of nasties due the nested loops.
Any help would be greatly appreciated.
Thank you.
|
|
I would read File B into a string array and while in the process, check each line to see if it is a firstname.lastname:firstname.lastname[ at ] and if so, use the firstname.lastname as the key into a hashtable where the value is the line index e.g.:
$ht = [ at ]{} $fileb = Get-content fileb.txt | %{$i=0}{if ($_ -match '^(\w+\.\w+):\1[ at ]') { $ht[$matches[1]] += [ at ]($i) }; $i++; $_}
Now scan through file A, pull out firstname.lastname, use that to index in to the hashtable. That will give you back an array of line indices (I assuming each person can have more than one alias). Go patch up those line indices in the array and when you're down, save the array back out to a new file.
HTH -- Keith
"Swackhammer1" <Swackhammer1[ at ]discussions.microsoft.com> wrote in message news:3C2CA67B-C492-4090-A978-7D60E5421780[ at ]microsoft.com...
[Quoted Text] > Hi, > I've been banging my head against a wall trying to figure this out and > nothing seems to work. > > I have two files, File A contains a bunch of student information and File > B > contains email aliases. File A has about 15,000 lines with each line being > a > different student. File B has about 50,000 lines with about 49,000 lines > being aliases. The first 1000 lines or so in File B is miscellaneous text > but > is required. > > File A is in the following format: > > year;FirstName;LastName;LastName > (again);firstname.lastname;firstname.lastname (again);code > > > File B is in the following format: > > misc > misc > somename:somename[ at ]domain.com > firstname.lastname:firstname.lastname[ at ]studentdomain.com > > > What I need to do is this: > 1) Read line from File A > 2) get firstname.lastname > 3) search in File B for firstname.lastname > 4) if found, replace studentdomain.com with domain.com > 5) Do until all names from File A are completed. > > I know of an easy way to do it, but that would mean reading the file and > writing to File B 15000 times. Not efficient enough. > I've been trying to come up with something that can atleast do 500-1000 > users/hr. > > I've tried a bunch of things and nothing works. I get duplicate entries > and > all kind of nasties due the nested loops. > > Any help would be greatly appreciated. > > Thank you. >
|
|
Thanks for the quick response. I'll try what you suggested and let you know how it works out.
"Keith Hill [MVP]" wrote:
[Quoted Text] > I would read File B into a string array and while in the process, check each > line to see if it is a firstname.lastname:firstname.lastname[ at ] and if so, use > the firstname.lastname as the key into a hashtable where the value is the > line index e.g.: > > $ht = [ at ]{} > $fileb = Get-content fileb.txt | %{$i=0}{if ($_ -match '^(\w+\.\w+):\1[ at ]') > { $ht[$matches[1]] += [ at ]($i) }; $i++; $_} > > Now scan through file A, pull out firstname.lastname, use that to index in > to the hashtable. That will give you back an array of line indices (I > assuming each person can have more than one alias). Go patch up those line > indices in the array and when you're down, save the array back out to a new > file. > > HTH > -- > Keith > > "Swackhammer1" <Swackhammer1[ at ]discussions.microsoft.com> wrote in message > news:3C2CA67B-C492-4090-A978-7D60E5421780[ at ]microsoft.com... > > Hi, > > I've been banging my head against a wall trying to figure this out and > > nothing seems to work. > > > > I have two files, File A contains a bunch of student information and File > > B > > contains email aliases. File A has about 15,000 lines with each line being > > a > > different student. File B has about 50,000 lines with about 49,000 lines > > being aliases. The first 1000 lines or so in File B is miscellaneous text > > but > > is required. > > > > File A is in the following format: > > > > year;FirstName;LastName;LastName > > (again);firstname.lastname;firstname.lastname (again);code > > > > > > File B is in the following format: > > > > misc > > misc > > somename:somename[ at ]domain.com > > firstname.lastname:firstname.lastname[ at ]studentdomain.com > > > > > > What I need to do is this: > > 1) Read line from File A > > 2) get firstname.lastname > > 3) search in File B for firstname.lastname > > 4) if found, replace studentdomain.com with domain.com > > 5) Do until all names from File A are completed. > > > > I know of an easy way to do it, but that would mean reading the file and > > writing to File B 15000 times. Not efficient enough. > > I've been trying to come up with something that can atleast do 500-1000 > > users/hr. > > > > I've tried a bunch of things and nothing works. I get duplicate entries > > and > > all kind of nasties due the nested loops. > > > > Any help would be greatly appreciated. > > > > Thank you. > >
|
|
Here's another way. First collect the students from FileA.txt and stored unique ones in an array. Then read FileB.txt and apply changes where the line matches one of the students. Finally pipe all lines to a new file. There are four different methods to accomplish the second step:
# using UTF8 encoding in these samples # extract a list of unique students, good for all four methods gc FileA.txt -en utf8 | % {$z = [ at ]()} { $x,$y = $_.split(';')[4,5] if ($x.length -and $x -eq $y -and $z -notcontains $x) {$z += ,$x} }
# here are the four different methods # method A - Switch statement, -match and -contains operators $(switch -file FileB.txt { {$_ -match '^(\w+\.\w+):\1[ at ]' -and $z -contains $matches[1]} {$_ -replace '(?<=\[ at ])student(?=domain)'} default {$_} }) | sc FileC_A.txt -en utf8
# method B - Switch statement, -match operator against an ORed RegEx # build the ORed and escaped RegEx pattern, good for methods B and C $pat = &{$ofs = '|' "$($z | sort | % {[regex]::escape($_)})"}
$(switch -regex -file FileB.txt { "($pat):\1[ at ]" {$_ -replace '(?<=\[ at ])student(?=domain)'} default {$_} }) | sc FileC_B.txt -en utf8
# method C - Get-Content, -match operator against the ORed RegEx gc FileB.txt -en utf8 | % { if ($_ -match "($pat):\1[ at ]") { $_ -replace '(?<=\[ at ])student(?=domain)' } else {$_} } | sc FileC_C.txt -en utf8
# method D - Get-Content, -match and -contains operators gc FileB.txt -en utf8 | % { if ($_ -match '(\w+\.\w+):\1[ at ]' -and $z -contains $matches[1]) { $_ -replace '(?<=\[ at ])student(?=domain)' } else {$_} } | sc FileC_D.txt -en utf8
I wonder which is faster, including Keith's?... :)
-- Kiron
|
|
|