At Fyle, we record a lot of information around product usage. This information is in different silos — we have our backend database, but we also have substantial amount of data in Mixpanel, Intercom and log files. For a comprehensive understanding of how users are using the product, we needed to be able to merge data from all of these sources, massage it and visualize the results in an internal dashboard. In this blog, I’ll explain our approach to solving this problem and, hopefully, you can map it to your use-case.
Our requirements for a dashboarding solution were:
- Must work across data sources — databases, csv, log files, external services. Must be able to clean/pre-process bad data.
- Must be able to iterate quickly. Oftentimes, someone in the team gets an idea over watercooler chat — “Hey, it would be interesting to see what % users create more than one expense in a session” or “Do users that use feature X tend to be power users?” Some of the questions may be dud, but some end up revealing something unexpected. We want to be able to try out new analyses quickly.
- Must be accessible to non-developers.
- Must be inexpensive.
- Must be able to do complex analysis.
We considered a few alternative approaches, but ended up discarding them.
- We considered building an internal site with flask backend and visualization using D3.js. However, this seemed super heavy-weight. Adding a new chart or new analysis would involve significant engineering effort.
- Excel / Google Sheets— Don’t laugh :) Excel actually came pretty close — everybody is familar with them. However, connecting disparate data sources and cleaning up data from unstructured sources like log files seemed like significant effort.
- BI tools — BI tools tend to have a big learning curve and their limitations become apparently only after you’ve spent a considerable amount of time with them. Furthermore, the problem of cleaning data and doing custom analytics still remained.
Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. It is a spin-off from a project called IPython. This post is not a tutorial on Jupyter, there are many, many excellent sites and articles devoted to the subject. What I will do is explain why Jupyter was a good fit for our problem.
Jupyter allows you to create “notebooks” in Python (and many other languages). You can write python code to pull data from databases using libraries like psycopg, read from log files and clean up the data, read CSV files using csv module and make REST API calls to external services etc using requests. You can use libraries like pandas, numpy, scikit-learn to do sophisticated analysis of the data. Finally, you can use matplotlib for creating visualizations.
Here’s an example of a public notebook. It is a live document — it contains data and results and text describing the analysis, but also contains the code that produced the results. My mind was blown the first time I saw a notebook in action. You can find a number of public examples here and here. After exploring a few examples like this, I was convinced that Jupyter notebooks were a good option for us.
So what’s the effort involved?
Familiarity with Python helps a lot — Jupyter supports other languages as well, but I haven’t tried them out.
Setting up Jupyter on your machine — MacOS or Linux takes very little time. Pandas and dataframes takes a little getting used to, especially if you come from the SQL world, like me. Operations like joins, groupby, windowing and sessionization feel a little awkward initially. I’d recommend spending a little bit on dataframes to avoid frustration. Stackoverflow is your friend.
It took me about 4 hours to produce my first couple of charts pulling in data from our database. However, adding a new analysis/visualization typically takes 10–20 mins. This fits our requirement of being able to ask new questions quickly.
Hope this post inspires you to consider Jupyter notebooks for your internal analytics needs. In a subsequent post, I’ll talk about some specific insights on what Fyle users were doing using Jupyter notebooks and how it helped us improve the product. As always, comments and feedback are welcome.