Friday, August 14, 2009

Finding email addresses in a text file

I got stuck with an project recently which entailed pulling email addresses out of a flat text file.
This request surprisingly more tricky than I thought because of all of the variations of acceptable email addresses. Anyway I ended up finding something out on the net that did most of what I needed so I'm posting it here in case anybody else needs something similar. However, I do recommend eyes on data verification because there are a few cases problems with it. I'll put tome more time in on this one later on, but for simple jobs this seems to get it done.


#!/bin/bash
## This is a quickie script to pull emaill addresses out of a flat text file.
## However, there are some short comings and bugs in it that still need to
## to be fixed
## 1) if the email host has more than 1 dot in the name the second dot
## and it everything after it get lost. Such as foo@tampabay.rr.com
## whatever comes after the .rr is there
## 2) somtimes the recipient gets mangled. Haven't quite figured out
## the pattern to this bug, but it will reuire visual inspection
##

egrep -o "\w+([._-]\w)*@\w+([._-]\w)*\.\w{2,4}"



As I said I'm working a better one of these and I'll post it when I can figure one out.

Thursday, August 13, 2009

Data scrubbing with sed

This is just s quickie, I recently had to do some clean up work on a database
that had irregular columns separators. There were single tabs, multiple tabs
mixed with single white spaces and multiple white space. Here's a quick one liner
in sed that will clean those up and leave you with just single white space.


#!/bin/bash
#first we strip off the tabs and replace with white spaces
sed -e 's/\t/ /g' -e 's/ */ /g' $1