Data OOO

by zenquaker

I am now one week in to my new exercise habit. As usual, I am starting out small with streching, pushups and crunches. Now that it’s been one week I will ramp it up and add bicycling afterwards. But more importantly, I am also one week into my Quantified Self habit of collecting data on myself.

Again, as usual, I am starting out small. At the moment I am using a phone app to log my weight in the morning, my pulse before and after exercise, whether I stretched, how many pushups, and how many crunches. I will also be tracking how long my bicycle rides are. For now I’m tracking the length of the ride in time, but the plan is to set up specific routes and measure their distance using Google maps. That gives distance and together with time that will give me average speed. That’s about ten data points a day. In those terms it doesn’t seem like starting out small, but it really is. Of course, that’s not all I’m doing. There’s other data I was already tracking in a loose sense, from my budget tracking to my web browsing history. I’m still tracking that, and I’m being a bit more careful about it so that I don’t lose any of it.

That’s what I’m tracking, but that doesn’t cover how I’m tracking it. I’ve got a phone app, a desktop app that runs on two different computers, spreadsheets that I’m in the process of converting to relational databases, stuff in proprietary online databases, an email program running on two machine, and two different web browsers. It’s a mess. So I decided to design one database to bind them all. Of course this will require numerous bespoke programs to find them all and in the darkness bind them.

I did some calculations, looking at what I was planning on tracking and the current volume of events in each category. I have no idea how long I’m going to track data like this, but since it’s data about my life the outside range is when I die. I expect to live until around 80 to 100, so I could be collecting data for 40-60 years. Given that range, I could expect between 1.6 to 2.4 million data points. So it seemed that I should put some thought into the design of this database.

I deal with databases a lot, but typically in the sense that I have to figure out how to pull data from them. I rarely have to make a database myself, and when I do they’re pretty simple. So I decided to research database normalization, something I don’t generally have to deal with directly. Normalization rules are a formalization of removing redundancy from databases. If databases are redundant, they are bigger than they need to be and run into the database corollary of Parkinson’s law (data expands to fill the space available for storage).

I worked through that over the past couple weeks and came up with a database structure involving thirteen different tables. I normalized through fourth normal form, although I decided not to do fifth normal form. Fifth normal form would have required splitting the data into different pieces based on the type of data being collected, which seemed contrary to the point of having all the different data types in one place. Skipping fifth normal form really just means I’ll have to be careful in the documentation and applications for the database. That shouldn’t be a big problem since I’m the only one that’s going to use it.

I found out something interesting learning about database normalization. The Sharp Floor program which has be causing so much gnashing of teeth at work includes a database that is a mess. The database it was replacing had 13 main tables and some ancillary tables bringing the full count up to a couple dozen, and included data from four different sources. The new database, with some added features and only using three of the data sources (and supposedly removing distinctions between them) has almost 300 tables. Which are poorly documented.

At work we figured this was because the contractors sub-contracted the database design to Microsoft, and they have a reputation for totally normalizing tables. In researching normalization rules, I ran across a Microsoft website that includes best practices for database design. They have a whole host of guidelines that go beyond database normalization. Interestingly most of them seem to be ways to avoid documenting peculiarities in the data by putting that information into the structure of the database. Sort of like an expansion of the fifth normal form I skipped. Having the information in the structure of the database was apparently their excuse for not documenting it in a form that someone who wasn’t familiar with Microsoft’s “best practices” could understand. Job security must be nice.

Anyway, there’s still the question of designing the applications to feed the data from the various other data sources into the master database I have designed. I’ll talk more about those when I’ve had a chance to write them.