Command Center: Extract email addresses from big file

Monday, August 17, 2009

Aug 17 2009

Extract email addresses from big file

Category: Parsing Data, Perl, Python, scripts — SkyHi @ Monday, August 17, 2009

1. import .pst into outlook
2. export the bounce folder to .excel
3. extract the bounce From address into linux

grep -C 2 "fatal errors" nurseaug3.txt > nurseaug3a.txt

perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt

find . -name "*.txt" | xargs perl -wne'while(/[\w\.\-]+@[\w\.\-]+\w+/g){print "$&\n"}' emails.txt | sort -u > output.txt

Also see:
Reference:
1.http://lifehacker.com/391205/email-address-extract-grabs-addresses-from-any-file

2. Yeah, that's just a perl script wich takes in a file and checks every word to see if it's a valid email address. It prints it out if so. Here's a commented version:

#!/usr/local/bin/perl -w
use strict;
# that stuff is just to make it a perl script

# email::Valid is a module to check for valid email addresses
# you can get it from CPAN.org along with tons of other modules
# If you're using perl on windows, i bet activestate has a version.
# The author says that it may be slow on Win32 if you have addresses
# where there is no nameserver to check them against.
use email::Valid;

# this loops over each line in the input
while (<>) {
# this loops over each "word" in the line (it splits on whitespace)
for my $word ( split() ) {
# if it's a valid address..
if ( my $address = email::Valid->address( $word ) ) {
# print it out.
print $address, "\n";
}
}
}

Put it in a file and call the file "getemails.pl" or something, then send all of your files to it:
./getemails.pl < somefile.txt or cat * ¦ ./getemails.pl and wait for your list of emails to come out. I just tested it and it seems to do pretty well. -Andy

#!/usr/bin/env python
'''
  emailsfromfile.py -- Get all unique email addresses from a file

  by Patrick Mylund Nielsen
  http://patrickmylund.com/projects/emailsfromfile/

  License: WTFPL (http://sam.zoy.org/wtfpl/)
'''

__version__ = '1.1'

import sys
import os
import re
import codecs

# Regular expression matching according to RFC 2822 (http://tools.ietf.org/html/rfc2822)
rfc2822_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
email_prog = re.compile(rfc2822_re, re.IGNORECASE)

def isEmailAddress(string):
    return email_prog.match(string)

def main(filename, separator='\n', encoding=None):
    separator_replace = {
        'space': ' ',
        'newline': '\n',
    }
    if not os.path.isfile(filename):
        raise IOError("%s is not a file." % filename)
    results = set()
    with codecs.open(filename, 'rb', encoding) as f:
        for line in f:
            results.update(email_prog.findall(line))
    for k, v in separator_replace.iteritems():
        separator = separator.replace(k, v)
    print(separator.join(results))

if __name__ == '__main__':
    args = len(sys.argv) - 1
    if 0 < args < 4:
        main(*sys.argv[1:])
    else:
        print("Usage: python %s <filename> [separator] [encoding]" % sys.argv[0])
        print("The default separator is a newline. To separate by space, literally enter 'space' as the separator.")

Usage

python emailsfromfile.py [separator] [encoding]

The separator and encoding parameters are optional. The separator is a new line and the file encoding is 8-bit ASCII by default. If you want to specify an encoding, you also have to set a separator; to use a new line (the default), specify newline as the separator.

Examples:

python emailsfromfile.py contacts.csv — returns all email addresses from contacts.csv, displaying one email address per line
python emailsfromfile.py contacts.csv , — returns a comma-separated list of all email addresses in contacts.csv
python emailsfromfile.py contacts.csv space — returns all email addresses from contacts.csv, separated by a space
python emailsfromfile.py contacts.csv ; > emails.txt — writes all of the email addresses from contacts.csv, separated by a semi-colon, to emails.txt
python emailsfromfile.py utf8-contacts.csv newline utf-8 — returns all email addresses, one per line, from the UTF-8 encoded file utf8-contacts.csv

References:
http://www.webmasterworld.com/forum10/1195.htm
http://patrickmylund.com/projects/emailsfromfile/

Command Center

Monday, August 17, 2009

Extract email addresses from big file

Labels

Blog Archive

My Blog List