Friday, April 25th, 2014

Python for Data Analysis
Python for Data Analysis by Wes McKinney
My rating: 5 of 5 stars

In my office, we spend a lot of time in the database. As such, we tend to become fairly adept at analyzing data with SQL: join some tables on interesting columns, group by other interesting columns, sprinkle in some aggregates, and pretty soon you have yourself a table of answers.

The relatively new windowing functions added in SQL Server 2012 let you do even fancier analysis (at the risk needing to understand some new syntax).

Yet, sometimes, a raw table of SQL results just isn’t enough. You might want to have access to some interesting statistical functions like the standard deviation or variance. You might want to correlate your data with other data available via a web service or just create a nice looking chart.

In these cases, SQL may not be the best choice. In recent years, Python has become an extremely popular language for doing data analysis. With libraries like pandas, numpy, and matplotlib, Python has rapidly become an extremely credible challenger to established special-purpose envrionments like R and SAS. Combined with these libraries, the dynamic ease-of-use of Python becomes perfect for the sort of data analysis tasks we find sometimes find ourselves trying to approach.

This book is an excellent introduction to the subject. It’s written by the creator and lead developer on the pandas project, which provides the table-like data structures that make this sort of thing so comfortable for us SQL developers.

By focusing on data analysis, McKinney assumes no real knowledge of Python programming and even provides a “Python Language Essentials” at the back of the book. Assuming that you’re already a competent programmer in Java, C#, VB.Net, or some other procedural language, you can easily pick up the basics of Python just by working through the examples in the book.

And that’s highly recommended because the examples are a lot of fun! It’s fun to grab stock data from Yahoo, run it through some computations, and then graph the result. McKinney strongly encourages the use of the IPyton shell and I echo that (especially the use of the notebook). It adds a strong sense of interaction to the standard Python REPL — indeed, it feels a lot like tweaking queries and hitting F5 in SQL Server Management Studio. IPython notebooks are now easily my favorite way to experiment with new programming methods.

Working through this book is a great way to spend a weekend. And when you’re done, you’ll be able to dazzle your designers and product owners when they ask for data about something. Instead of making them squint at a grid of SQL results, you can hand them charts and graphs that will make it look like you spent hours in Excel.

blog comments powered by Disqus