Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 2
ETL with Python (Structured Data)
out = out + "," + str(email.get_payload
()).replace("\n"," ").replace(","," ")
print out,"\n"
conf = open(sys.argv[1],'w')
conf.write("path=" + config["path"] + "\n")
conf.write("folder=" + config["folder"] + "\n")
for usr in users.keys():
conf.write("name="+ usr +",value=" + users[usr] + "\n")
conf.close()
Sample config file for above code.
path=/cygdrive/c/share/enron_mail_20110402/enron_mail_20110402/
maildir
folder=Inbox
name=storey-g,value=142
name=ybarbo-p,value=775
name=tycholiz-b,value=602
Topical Crawling
Topical crawlers are intelligent crawlers that retrieve information from
anywhere on the Web. They start with a URL and then find links present in
the pages under it; then they look at new URLs, bypassing the scalability
limitations of universal search engines. This is done by distributing
the crawling process across users, queries, and even client computers.
Crawlers can use the context available to infinitely loop through the links
with a goal of systematically locating a highly relevant, focused page.
Web searching is a complicated task. A large chunk of machine
learning work is being applied to find the similarity between pages, such as
the maximum number of URLs fetched or visited.
42