I regularly listen to get-rich-entrepreneurship podcasts. They make me feel good and get me excited about innovating. Last week, I was listening to James Altucher’s podcast episode #177 with Ramit Sethi. At one point Ramit (totally paraphrasing) says your passion doesn’t fall down from the sky one day… it’s always with you and you need to find it. I ask people, what are you reading on Saturday morning when everyone else is still sleeping? That’s probably your passion. It made me think that I can probably data mine my passion in browser history and email text. I know qualitatively what I love to do (aside from family time & running), but I always want some quantitative confirmation. This is how I attacked my browser history and then my giant box of gmails.
Chrome History
I start here – http://superuser.com/questions/602252/can-chrome-browser-history-be-exported-to-an-html-file. That post, using the terminal (or command prompt) and SQLite, helps me get my Chrome browser history into a CSV file for the time period after January 1, 2016. Then, a few simple lines in python with pandas and the built-in urlparse python package let’s me discover my most visited URL’s this year…
import pandas as pd import numpy as np from urlparse import urlparse data = pd.read_csv('history.csv') # define function to get base url and add column to dataframe def getUrl(url): parse_object = urlparse(url) return parse_object.netloc data['cleanurl'] = np.vectorize(getUrl)(data['url']) # count visits to each url data['cleanurl'].value_counts()
Result from top 20 (cleaned up with Google Sheets)…
Ok. The first five links don’t count to me… I check those sites every five minutes (which is a problem in its own right). After that, I see two rather distinct categories: data science / computing / programming and learning (all highlighted in yellow). It’s telling, but is my passion one or the other… or is it a combination? Learning data science? Data science learning? Learning with computers? This is a good start… maybe my emails can tell me more.
Gmail Mining
For this one I start here – http://engineroom.trackmaven.com/blog/monthly-challenge-natural-language-processing/. This is a bit more challenging because I am working with a previously unfamiliar file format (mbox) and the file is 5.6GB… not huge but no longer small. Fortunately, Fletcher Heisler’s post does a fantastic job helping me get from gmails to mbox to pandas dataframe.
The resulting dataframe has 113,776 rows and 8 columns which are
u'subject', u'body', u'from', u'to', u'date', u'labels', u'epilogue'
and the date range is between January 2007 and September 2016. In order to get some telling information from the data I look at A) who are the senders of emails I am actually reading and B) a high level NLP analysis of the subject and body text of emails I’m reading.
To get started, I immediately do four things to the dataframe:
- Filter only read emails. The column named ‘label’ tells me if the message is read, unread, sent, chat, etc. Obviously, I’m most interested in the emails that I actually open.
- Parse the date as datetime objects so I can analyze changes over time.
- Create a new column that transforms the ‘from’ column to only the domain of that sender (i.e. ‘frank@gmail.com’ becomes ‘gmail.com’).
- I filter out the most common domains such as
['gmail.com', 'yahoo.com', 'aol.com', 'hotmail.com', 'us.af.mil']
That last one is because my wife is in the US Air Force and if I don’t remove it the analysis is all about our email communication : )
Then I look at the domains of the senders that I’m opening (reading) most frequently by year. I highlight in yellow the domains that are definitely data science and/or learning related.
Over the past five years, I’ve gotten more and more into data everything and this time-lapse analysis confirms it. In 2012/2013, I was much more interested in financial markets (Thomson Reuters daily briefing and email briefings from Financial Times) and running (stanfordalumni.org was my running coach at NBSV, NYAC.com was my running coach at New York Athletic Club, and optonline.com is the current running coach at Iona College). But then, in 2013, I discovered Coursera and r programming. I rarely receive an email from a data-related entity that I don’t open and at least skim through. The most common word in the subject line of my emails over the past few years is ‘data.’
In the table above I again highlight in yellow the words that are definitely data science and/or learning related. In 2013, I took an online course called ‘Maps and the Geospatial Revolution.’ In 2012, every morning I was reading the European version of a financial markets newsletter called ‘The Morning Benchmark’ that arrived in my inbox at 6AM. A more clear picture starts to form when I explore frequent bi-grams (words that appear together most frequently) in the body text of my emails over the past three years. Here are the top 20…
[(('data', 'science'), 0.0017113140105257248), (('bg', 'bg'), 0.0013495052160767866), (('mountain', 'view,'), 0.0008504586030437685), (('view,', 'ca'), 0.0008296649941673928), (('read', 'more:'), 0.0007963952199651915), (('big', 'data'), 0.0007340143933360642), (('rights', 'reserved.'), 0.0007236175888978764), (('data', 'analyst'), 0.0006113321009654473), (('new', 'york'), 0.0005739036049879709), (("o'reilly", 'media,'), 0.0005739036049879709), (('copyright', '(c)'), 0.0005572687178868703), (('machine', 'learning'), 0.00054895127433632), (('reply', 'directly'), 0.0005468719134486825), (('new', 'york,'), 0.0005052846956959309), (('data', 'scientist'), 0.0004782530041566424), (('data', 'visualization'), 0.00047617364326900483), (('visit', 'support'), 0.00047409428238136725), (('4', 'et'), 0.00047201492149372967), (('@', '4'), 0.00047201492149372967), (('briefing', 'room'), 0.0004636974779431794)]
Some of the above bigrams make sense and others don’t, primarily because the bigrams occur in the header or footer of a commonly read newsletter. The ones that stand out to me are data science, big data, o’reilly media, machine learning, data scientist, and data visualization.
At this point, I’ve convinced myself (quantitatively) that I’m passionate about data-driven problem-solving and I’m passionate about sharing methodologies in which to do that (see Update below). In fact, I recently started as an Udacity Forum Mentor to, hopefully, help others improve their technical skills. Now, James and Ramit, what experiments can I run to see how I can monetize this passion?
Update 9-10-2016: I was listening to Derek Sivers (and James Altucher) earlier this week and something he said lit up a bright light. He was talking about his core value(s). He said his core value is “learning things for the sake of creating things which is for the sake of learning things which then is for the sake of creating things. That loop is a thing to me that should be one word.” And I thought about this and listened to it again and again. For the first time in years I thought that I may not love teaching for the sake of teaching… but I love teaching because it allows me to constantly learn.
It makes sense to me that my passion is learning things for the sake of producing creative data-driven solutions which is for the sake of learning things which then is for the sake of producing creative data-driven solutions. Today, I’m extremely happy with that.
For those wanting to help me improve the aesthetics or efficiency of my code, here is the ipython notebook for the analysis of the gmail dataframe (after following Fletcher Heisler’s post and getting mbox data into pandas dataframe)…