Email Web Crawler


This simple python script allow you to scan websites looking for email addresses.

In the begin, you have to set a seed url to have a initial point to scan

code:
urlsToProcess = deque(['https://moz.com/top500']) 

In the main loop (who runs until there is no more urls) process url looking for more urls to feed the deque urlsToProcess

code:
url = urlsToProcess.popleft()
    processed_urls.add(url)
     
    urlsFile.write(url + "\n")
    
    parts = urlparse(url)
    base_url = "{0.scheme}://{0.netloc}".format(parts)
    path = url[:url.rfind('/')+1] if '/' in parts.path else url
 
    print("Crawling site: %s" % url)
    try:
        response = requests.get(url)
    

for anchor in soup.find_all("a"):
        # extract link
        link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        if link.startswith('/'):
            link = base_url + link
        elif not link.startswith('http'):
            link = path + link
        # add url to the the list
        if not link in urlsToProcess and not link in processed_urls:
            urlsToProcess.append(link)

And scan and save the emails in a file.

code:
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
    
    for email in new_emails:
        if email not in emails:
            emailsFile.write(email + "\n")
            emails.add(email)

To donwload the full script clone/fork it from bitbucket zxcoders