Sunday, January 3, 2016

Data Pipeline and ETL tasks in Go using Ratchet

As Data Engineers and Software Engineers we might define Data Pipelines differently. This Wikipedia entry defines a pipeline as a set of data processing elements connected in series, where the output of one element is the input of the next one; often executed in parallel. That is pretty close to how I'll define it here. More specifically, I'll use The Golang Blog's informal definition:

A series of stages connected by channels, where each stage is a group of goroutines running the same function. In each stage, the goroutines receive values from upstream via inbound channels; perform some function on that data, usually producing new values; and send values downstream via outbound channels.

Ratchet is a Go library that abstracts these pipeline concepts, giving you a clearcut framework to work with. You will satisfy a data processing interface and be provided with some convenience methods for managing the flow of data.

Perhaps you have been tasked with creating automated pipelines that fetch data from disparate sources -- such as remote databases or web services -- and then cleaning up, transforming and aggregating the unorganized data into specialized databases or datasources (an RDBMS, CSV files, etc.) so your team can use that organized data for analytics and other purposes.

Ratchet is an open-source project by Daily Burn. It provides a better way of extracting data from the databases and web services an application uses, then transforming and loading that data into reporting-oriented formats. It makes doing these tasks more robust than creating basic scripts or using limited 3rd-party services that don't give you full control over your data processing.

Install ratchet by running:

 $ go get

Example project layout:

    ├── main.go             (Main package. Pipeline functions)
    ├── packages            (Your reporting package)
    │   ├── models.go       (JSON structs for transformer.go)
    │   ├── queries.go      (SQL query functions for SQLReaders)
    │   └── transformer.go  (Custom DataProcessors)

Ratchet consists of a Pipeline with a series of PipelineStages, which each handle one or more DataProcessors. DataProcessors each run in their own goroutine so all of the data is processed concurrently. The DataProcessors send and receive JSON for convenience. Ratchet gives you some useful data processors and an interface for you to implement custom ones.

Your Ratchet tasks will likely use a combination of the provided DataProcessors and your own custom ones. If you're creating an ETL task, the custom DataProcessors will usually be for the Transform stage while the Extract and Load stages could use the built-in Reader/Writer DataProcessors such as SQLReader and SQLWriter. But there are times when you will create custom Data Processors even for the extract stages, such as when you're making calls to 3rd party REST Apis, and the like.

There are other DataProcessors provided for your convenience, such as ones for Google's BigQuery and more. You will get to mix-and-match things as they make sense to your application. See the full-list of provided DataProcessors.

You will typically begin your Ratchet code by including the necessary packages. For the sake of example I will assume you are also going to access an RDBMS, such as MySQL, to perform some SQL reads and/or writes on:

 import (
     _ ""

Since an SQLReader takes a *sql.DB as the first parameter, you will want to start by creating one. If you only need a basic SQL string then you may want to use NewSQLReader. Example:

 func UsersQuery(minId int, maxId int) string {
     return fmt.Sprintf(`SELECT id, name FROM users 
         WHERE id >= %v AND id <= %v`, minId, maxId)

We simply created a function that takes arguments, generates some SQL and return a string. We can call it like:

 users := processors.NewSQLReader(someDB, mypkg.UsersQuery(5, 10))

If you need a query to use the JSON results generated from a previous stage, use NewDynamicSQLReader.

When you need to create your own data processors, you implicitly satisfy the DataProcessor interface by implementing its ProcessData and Finish methods. You will create a struct and attach these methods to it. It's up to you to determine how your structure will hold any state you need. Also, because ProcessData deals with receiving JSON from the previous stage and then passing JSON on to the next stage, you will need to create structs for Unmarshaling and Marshaling the JSON.

Tip: Creating structs for all the JSON you will be processing can be tedious so I highly recommend the tool json2go.

The data package includes some types and functions for using JSON with Ratchet. It provides wrapper functions for json.Unmarshal() and json.Marshal() called data.ParseJSON() and data.NewJSON() that provide additional logging. It also creates the type data.JSON as a simple name to a byte slice: type JSON []byte. Example:

 package mypkg

 import ""

 type myTransformer struct{}

 // Expose our DataProcessor for clients to use
 func NewMyTransformer() *myTransformer {
     return &myTransformer{}

 // Consider moving these two structs to models.go
 type ReceivedData struct {
     UserID int `json:"user_id,omitempty"`
 type TransformedData struct {
     UserID         int    `json:"user_id,omitempty"`
     SomeNewField   string `json:"some_new_field"`

 func (t *myTransformer) ProcessData(d data.JSON, 
                                     outputChan chan data.JSON,
                                     killChan chan error) {

     // Step 1: Unmarshal json into slice of ReceivedData structs
     var users []ReceivedData
     var transforms []TransformedData 
     err := data.ParseJSON(d, &users)
     if err != nil {
         killChan <- err

     // Step 2: Loop through slice and transform data
     for _, user := range users {
         transform := TransformedData{}
         transform.UserID = user.UserID;
         transform.SomeNewField = "whatever"
         transforms = append(transforms, transform)

     // Step 3: Marshal transformed data and send to next stage
     dd, err := data.NewJSON(transforms)

     if err != nil {
         killChan <- err
     } else {
         outputChan <- dd

 func (t *myTransformer) Finish(outputChan chan data.JSON,
                               killChan chan error) {}

Notice the idiomatic NewMyTransformer() function that returns a pointer to our DataProcessor's struct with its zero value. Since it starts with a capital letter, it will be used by package users like:

 transform := mypkg.NewMyTransformer()

Finish() is called by Ratchet after a previous stage is done sending its data. You can often implement this as an empty method. Finish is more useful when you want to wait until all the data has been received before doing something with it, and if you're working with more than one input source you may really want this. If that's the case, you will typically use ProcessData() for validating and storing the incoming data into the receiver struct and then doing the second and third steps above inside of Finish() instead. Here is an example of how we could rewrite the above to batch things using a non-empty struct and a pointer reciver, then complete the transformation stage in the Finish():

 type myTransformer struct{
     BatchedUsers []User

 func NewMyTransformer() *myTransformer {
     return &myTransformer{}

 func (t *myTransformer) ProcessData(d data.JSON, 
                                     outputChan chan data.JSON,
                                     killChan chan error) {

     // Step 1: Unmarshal the JSON into a User slice
     var users []User
     err := data.ParseJSON(d, &users)
     if err != nil {
         killChan <- err

     // Step 2: append via pointer receiver
     t.BatchedUsers = append(t.BatchedUsers, users...)

 func (t *myTransformer) Finish(outputChan chan data.JSON,
                               killChan chan error) {

     var transforms []TransformedData

     // Step 3: Loop through slice and transform data
     for _, user := range t.BatchedUsers {
         transform := TransformedData{}
         transform.UserID = user.UserID;
         transform.SomeNewField = "whatever"
         transforms = append(transforms, transform)

     // Step 4: Marshal transformed data and send to next stage
     // Write the results if more than one row/record.
     // You can change the batch size by setting loadDP.BatchSize
     if len(transforms) > 0 {
         dd, err := data.NewJSON(transforms)

         if err != nil {
             killChan <- err
         } else {
             outputChan <- dd

Once you have your data processors setup, you will just need to pass them into a new pipeline for processing. If you just have one reader, one transformer and one loader, you can use a basic 3 stage pipeline using NewPipeline():

 pipeline := ratchet.NewPipeline(extractDP, transformDP, loadDP)
 err := <-pipeline.Run()

If you do not even need a transform stage then you can just have extract and load stages: ratchet.NewPipeline(extractDP, loadDP). This is enough if your extracted SQL fields match and table you will load the data into.

Things aren't always that simple so Ratchet provides a more flexible branching pipeline, via NewPipelineLayout(), that takes a variable amount of NewPipelineStage()'s which in turn take a variable amount of DataProcessors. Inside the pipeline stages you will wrap your data processor instances with the Do() function and pass its returned value to the Outputs() method, following these rules:

  • DataProcessors in a non-final PipelineStage must use Outputs().
  • A DataProcessor must be pointed to by one of the previous Outputs() (except in the first stage).
  • Outputs() must point to a DataProcessor in the next immediate stage.
  • DataProcessors in the final stage must not use Outputs().

Here's how the basic 3 stage pipeline shown above would look as a branching pipeline:

 layout, err := ratchet.NewPipelineLayout(
     ratchet.NewPipelineStage( // Stage 1
     ratchet.NewPipelineStage( // Stage 2
     ratchet.NewPipelineStage( // Stage 3

 pipeline := ratchet.NewBranchingPipeline(layout)
 err = <-pipeline.Run()

Fortunately, you can do a lot more than that with a branching pipeline. Outputs() can take multiple parameters to shove data into and you will often have multiple calls to Do() in the intermediate stages, for handling disparate data.

Sometimes you will want to pass the original object through the various stages in case you need to use it in a later stage. This is very easy to do using NewPassthrough DataProcessor. In the following example, passing passthrough to Outputs() makes it so aDP gets passed to the next stage along with the other values passed to Outputs():

 passthrough := processors.NewPassthrough()


     ratchet.Do(aDP).Outputs(anotherDP, passthrough),

Ratchet provides a logging facility that's very useful for debugging Ratchet tasks. It is often helpful to temporarily place the following line into your ProcessData() implementation so that you can see output from the calls to logger.Debug():

 logger.LogLevel = logger.LevelDebug

Even better is for you to create a way to set the different levels from the CLI. For example: --log-level="debug". And so on for LevelError, LevelInfo, LevelStatus, LevelSilent. I recommend using LevelSilent on production or whenever you just need the job to run faster.

Be sure to read the log output (ratchet_default.log) because it shows the SQL INSERT Data that was actually executed, for example. The logger package provides some other useful logging functions for you to use as well.

Another debugging/development tip is to temporarily write the final stage output to a CSVWriter that goes to standard output. This allows you to quickly prototype a Ratchet task without having to set up or write to your final database table(s) yet. Example:

 // Setup all the DataProcessors
 users := processors.NewSQLReader(inputDB, mypkg.UsersQuery())
 bq := processors.NewDynamicBigQueryReader(bqConfig, mypkg.BQQuery)
 bq.UnflattenResults = true
 transformer := mypkg.NewMyTransformer()
 writeCSV := processors.NewCSVWriter(os.Stdout)

 // Create a new Pipeline using the DataProcessors
 layout, err := ratchet.NewPipelineLayout(

I have covered all of the main concepts. Please see the Ratchet documentation for more information. If you have any questions please post them in the comment section below.

Sunday, May 31, 2015

Full-Site HTTPS Lessons

There may be more to consider than you think before attempting to convert your web site to HTTPS. Even if you already use HTTPS, you might benefit from this information if you aren't sure you set it up properly. I will cover both the obvious and the not-so-obvious things you need to know so you can avoid common pitfalls and plan out your conversion more completely.

Every web site these days should support https:// on every page in order to be more secure and to provide privacy to its users. It can be be very easy or fairly difficult to convert an existing site to HTTPS. If it happens to be difficult for your site, don't be discouraged. It will benefit everybody because your site will be more secure and enjoy a better reputation.

Google has done a great job of explaining why you should use HTTPS on all your pages (HTTPS Everywhere). Man-in-the-Middle aren't just easy if you aren't using HTTPS at all, but also if you're not using it everywhere or properly. For example, encrypting passwords isn't enough if your cookies aren't secure, and your cookies can't be secure if you're not using https everywhere. We don't want our faithful users to have their sessions hijacked while they're browsing our site at a coffee shop.

What I would like to talk about are the lessons that I learned from setting this up at, which was an existing site with millions of users at the time. I'll start by saying that you should not procrastinate on doing this. It is much easier to set up HTTPS everywhere if you start early but it is possible to do it on any website -- it will just take more work.

I recommend reading this entire article once through first, before trying any of it out. Use this information at your own risk since it is just from my experience and your mileage may vary.

The first thing you'll need is an SSL certificate. Find out if you will need a wildcard cert and so forth. Your site without a subdomain is called a "naked domain" (e.g. You may need to ensure your certificates include subdomains to avoid site visitors getting certificate errors if they go to the www. address for your site - or maybe you handle this by redirecting www to the naked domain in your web server config. And while not necessary, you should also decide if you want to use Extended Validation (EV) certificates. Anyway, do your homework before purchasing your certificate.

Once you have your certificates figured out, you start by making every page redirect from http to https. This is usually as simple as adding the appropriate redirect directives to wherever you terminate SSL (NGINX, Apache, etc). In NGINX this is usually as simple as adding  return 301 https://$domain$request_uri;  to your HTTP server block; then in your HTTPS server block you'll add directives to enable ssl, listen on port 443, link to your certificate and key and add any needed headers. Refer to your web server's documentation.

It is a myth that having https on every page is going to be too slow. So please do not try to only add https to a subset of pages, such as just the home page or login page. Doing so will only add complexity to your configurations without making your site secure enough.

In your web server config, you will also specify the protocols to support. You should disable support for SSL v2 and v3. With NGINX, just specify the protocols you want, and leave out what you don't want, otherwise it defaults to SSSv3. So do something like: ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

It may seem confusing that we're using the terms SSL and TLS interchangeably. Just note that you'll want to configure support for TLS encryption, not SSL. See also There IS NO SSL.

We're about half-way there. Unfortunately, simply having every web page be HTTPS isn't enough. You also need to have all your important resources that load on those pages (images and scripts) be https or you aren't really secure yet. Plus if you don't do this, you will see a lot of Mixed-Content warnings in the browser in the console and a warning on the lock icon. So having https:// in the address bar isn't enough. Don't try to cheat because you will not fool attackers nor your users.

You should start by first configuring this all on your dev machine (or at least a staging machine), rather than production. This is because you will likely encounter errors that can take a while to resolve and you don't want to leave your site in a broken state. For example, if it turns out that your site has hundreds of hard-coded http:// URL strings in the database then you'll need first run a migration script to convert these to "https://". (Yes https://, because protocol-relative // urls are now an anti-pattern.)

If your site has hundreds of thousands of users, then running such a migration script could take hours or even days. So you will want to run a migration script first if needed. Once your migration finishes, you will also need to refactor your code to start saving your URLs as https so you don't run into the same problem again. Depending on when you deploy, you may need to run your migration script once more to catch up.

DO NOT skip this step: To prevent session hijacking, you'll want to enable the Secure and HttpOnly attributes on your session cookies. These are security flags that you add to your cookies by appending 'Secure; HttpOnly' to the cookie HTTP response header. The Secure flag makes the cookie only accessible over HTTPS so users can't unknowingly "leak" their cookies by typing http:// to get to your site before it redirects to https://. The HttpOnly flag makes cookies invisible to client-side scripts so cookies can't be stolen by XSS exploits.

To test that your cookies have the Secure and HttpOnly flags, simply check the Set-cookie value in the response from curl'ing your website. In a browser such as Google Chrome, you can also go into the developer tools and inspect your cookies and you should see checkmarks under the Secure and HttpOnly columns.

Your development sandboxes should use https:// since it's best that development machines closely match production. This will help prevent developers from hard-coding http:// url's but not realize it until there are https warnings on production. This means that your developer machines will need to be running NGINX or whatever web server you use in production. Creating a self-signed certificate will make this easy enough. There are additional benefits to this, such as using a hostname like instead of http://localhost:3000.

Lastly, make sure that any middleware and front-end code that does redirects, does so as https. You can test this on the command-line by running:

 $ curl -I -L https://yoursite/whatever 

and checking if any Location: lines show http:// URLs. Assuming that /whatever is a controller that does redirects, you want to be sure that it doesn't do any redirecting back to http:// at any request. I've had to ensure https:// redirects at the WSGI-middleware and at the front-end controller code layers.  

To further test that you set everything up correctly, run your domain through a scanner such as It will give you different grades and your goal is to get an 'A'. For a command-line tool, check your package installer for sslscan (Homebrew, apt-get, etc). Additionally you can check that your site uses strong certificates by using If you get bad scores, make sure you're using SHA-2 certificates and 256-bit TLS ciphers.

Further considerations are making sure you disable support for weak ciphers, using at least SHA-2 (even on your intermediate certs) and looking into SSLStrip and HSTS headers. Also, look into Perfect Forward Secrecy (PFS) so attackers can't use private key to decrypt archived sessions. Finally, make sure your supported browsers support everything you have set up.

You should also start linking to the https:// version of your site -- in emails and so on -- so your servers don't incur any unnecessary HTTP redirects and because you should be proud that you can provide https links to your web site.

So please get started on this as soon as possible. Don't put off the inevitable.

UPDATE: Further reading: Let's Encrypt and QUIC.

Saturday, April 26, 2014

Using Browserify with jQuery, Backbone.js and more.

This article aims to help you understand how to use Browserify in your projects to manage your front-end dependencies, as well as your back-end ones. It assumes you know what Browserify does and that you are confused about exactly how to use in a real project to handle all of your dependencies, particularly with client-side frameworks. I will show examples with jQuery and Backbone.js, but what I explain will apply somewhat to other frameworks and libraries too.

Using Browserify with jQuery

A good first step toward understanding how to use browserify is with jQuery. Delete any script tags that include jquery.js in your html on the client-side. Instead, install the jquery node module (into ./node_modules) and save a reference to it in package.json:

 $ npm install jquery --save

jQuery 2.1.0 is capable of using browserify's module.exports. That is, you can install the npm package called "jquery" instead of the older workarounds such as the "jquery-browserify" npm package.

jQuery will need a window object:

 var $ = require('jquery');

Older jQuery's will need to do this as var $ = require('jquery')(window);or var $ = require('jquery/dist/jquery')(window);

If you omit the relative path it'll resolve to the module in node_modules, allowing us to just require('jquery') rather than specify a path to jquery.

Whenever any of your JavaScript files need a reference to $ or jQuery you simply create them as needed, with the above jquery require line or with var jQuery = require('jquery'); if you use the name jQuery instead of $.

Now you browserify your code:

 $ npm install -g browserify
 $ browserify whatever.js -o bundle.js

You'll then use a script tag to include bundle.js instead of jquery.js. That's it!

Using Browserify with Backbone.js

Install Backbone.js, Underscore.js and jQuery as node modules:

 $ npm install backbone underscore jquery --save

We explicitly npm installed underscore, instead of letting backbone install it, so that we can require('underscore') later when needed. And we used the --save flag to store these versions in package.json because it's a good habit to get into.

If you structured your Backbone code using a single global object, like app, then you'll need to replace all the lines (if any) that say:

 var app = app || {};

with something like:

 var $ = require('jquery');
 var Backbone = require('backbone');
 var _ = require('underscore');
 Backbone.$ = $;

in all of your JavaScript files. Each file may only need certain objects, and so we should only put the ones each module needs. Being explicit about which dependencies are needed is part of what good modularity is all about. For example, if you only need a reference to Backbone in one file then only do:

 var Backbone = require('backbone');

Backbone can run fine without jQuery, but you can explicitly tell Backbone to use jQuery when needed by doing Backbone.$ = $;

If you explicitly need the '_' object because you're invoking underscore directly, then also include the var _ = require('underscore');

Because of the way Browserify encapsulates every module into its own scope, we no longer have to pass things like jQuery into an IIFE to avoid global conflicts, and we can even put 'use strict' at the TOP of our modules. So instead of:

 var jQuery = require('jquery');
 var Backbone = require('backbone');
 Backbone.$ = jQuery;

 (function($) {
     'use strict';

      module.exports = Backbone.View.extend({

we can simply write:

 'use strict';

 var $ = require('jquery');
 var Backbone = require('backbone');
 Backbone.$ = $;

 module.exports = Backbone.View.extend({ ... });

Notice how the above code also needed var Backbone = require('backbone'); because it references Backbone in this file. Everything not in module.exports will stay private to its module. In Node.JS, module.exports is simply the object that gets returned as the result of a require call.

Next delete any script tags that include your user-defined Backbone javascript files and replace with just:

 <script src="js/bundle.js"></script>

Generate the bundle.js file

 $ browserify ./public/js/app.js -o public/js/bundle.js

If we were using the global app object instead of browserify, we'd have to make sure js/views/app.js gets included before js/app.js in the script tag include order. But with browserify we take care of the ordering differently. For example, in public/js/app.js put:

 var $ = require('jquery');
 AppView = require('./views/app');
 // The only thing that should be in a DOMReady
 $(function() {
     new AppView();

Then in public/js/views/app.js put:

 'use strict';

 var Backbone = require('backbone');
 var $ = require('jquery');
 Backbone.$ = $;

 // examples of requiring some user-defined Backbone stuff
 var Todos = require('../collections/todos');
 var TodoView = require('./todo');

 //var app = app || {};

 //app.AppView = Backbone.View.extend({
 module.exports = Backbone.View.extend({

     initialize: function() {
         //this.listenTo(app.Todos, 'add', this.addTodoTask);
         this.listenTo(Todos, 'add', this.addTodoTask)

Notice how I commented out the old app global object stuff and replaced it with Browserify'd code. You basically just need to replace app.whatever with module.exports = whatever. And for every file that used to reference the global app object, you now need to require() that file and store it in a local object. So continuing the example, above '../collections/todo.js' might contain:

 'use strict';

 var Backbone = require('backbone');
 var Todo = require('../models/todo');
 //var app = app || {};

 var TodoList = Backbone.Collection.extend({
     //model: app.Todo,
     model: Todo,


 module.exports = new TodoList();

Again I commented out the global app object to show that's where browserify's stuff goes.

Since a Backbone.js project can become large, or any project for that matter, it's a good idea to install watchify. It takes the same command-line arguments as browserify but will automatically update the bundle.js file whenever you make changes, and it will update it faster since it only updates the changes.

Additionally, using the power of Browserify transforms, you can integrate Bower packages and more. This is useful if you need a dependency that uses CSS or something that Browserify isn't yet suited for. Perhaps I'll write more about this next time.

Monday, March 17, 2014

Headless Dropbox

You can install Dropbox on a Linux server or any Linux system that doesn't have an external monitor or a GUI Windowing System.

I recently installed Dropbox on my Digital Ocean Droplet, running Ubuntu 12.10 server. I did not figure this all out on my own. I read the sources mentioned at the bottom of this post. However, some of the information was incomplete or contained broken links, and I ran into other issues. So this blog post is more of about how I got it working in hopes that you won't encounter the same issues I did.

First you will need to link your Linux machine with your Dropbox account by downloading and running a Dropbox server script called dropboxd:
 $ cd  # changes to your $HOME directory
 $ wget -O - "" | tar xzf -
 $ ~/.dropbox-dist/dropboxd

 Use the above command if you're running "uname -m" says i686.
 $ cd
 $ wget -O - "" | tar xzf -
 $ ~/.dropbox-dist/dropboxd

 Use the above command if "uname -m" says x86_64.
The running dropboxd command should output a link for you to paste into a browser to link that system. I had issues with this at first and just had to keep rerunning the command over time and leaving it running until it finally started saying something. This was probably due to connection issues that you hopefully won't have. You can then verify that it's linked by logging into your dropbox web account and going under the "Security" tab to see linked devices.

The link uses a token so you don't have to store your Dropbox password on the Linux instance.

Once Dropbox is linked you should see your ~/Dropbox directory starting to sync. At this point, you can temporarily stop the syncing by hitting ^C (Control-C) to stop dropboxd. Don't worry we'll finish syncing in a minute.

Let's download a nice python script to help us manage Dropbox better:
 $ wget -O ~/bin/ ""
 $ chmod 755 ~/bin/
and download a bash script that we'll use to start and stop Dropbox:
 $ wget -O ~/bin/dropbox_temp "\
 $ chmod u+x ~/bin/dropbox_temp
Edit ~/bin/dropbox_temp in Vim (or whatever editor you like) and change DROPBOX_USERS="user1 user2" to whatever user(s) on your system you want to run dropbox clients for.

Now move dropbox_temp script:
 $ sudo mv ~/bin/dropbox_temp /etc/init.d/dropbox
and have it start when we boot:
 $ sudo update-rc.d dropbox defaults
Close the dropboxd script if you still have it running manually: ^C.

Check if dropbox is running, if it's not then start it:
 $ sudo service dropbox status
 $ sudo service dropbox start
If it's is running then run:
 $ status
If the status command says Dropbox is now downloading or syncing a ton of files, you might want to remove some directories from being sync'd to this linux machine by using Dropbox's Selective Sync.

Selective Sync allows you to exclude certain folders that you don't want to sync to your Linux server, e.g., maybe you don't need all your photos on it:
 $ cd ~/Dropbox
 $ ls
 Photos Projects Public iphone-photos
Let's exclude the photos and Public directories:
 $ exclude add Photos
 $ exclude add iphone-photos
 $ exclude add Public
 $ ls
At this point, I would wait until status says "Up to date" before editing files in your Dropbox folder. Just to make sure nothing gets corrupted. Once Dropbox has finished downloading everything, you should try to sync files to AND from Dropbox. Make sure to test both directions and also test nested directories and your most important directories. I had an issue once where files in ~/Dropbox/Projects/notes/ wouldn't sync outgoing, yet ~/Dropbox/Projects/code/ files would. If it doesn't, you'll need to reinstall Dropbox (instructions are below). It shouldn't be a firewall issue because the docs specify that Dropbox runs on port 80.

Alright, if you made this far and everything seems working, the great!

You may also want to delete any cached deleted files from the exclude step (NOTE: don't even try this unless you need it and at least until you're sure dropbox is already working):
 $ sudo service dropbox stop
 $ rm -rf ~/Dropbox/.dropbox.cache/*
 $ sudo service dropbox start
Don't delete the .dropbox.cache dir itself, just its contents.

If you ever need to delete the dropbox instance:
  • ^C to stop the dropboxd daemon
  • rm -rf ~/.dropbox* ~/Dropbox
  • Revoke the token that was issued for your session by clicking 'unlink' next to your Droplet's host name at
It's important that you actually delete the ~/.drobox* folders, because your old key is stored there, and you have to generate a new one next time you run the original wget commands. So don't think of keeping it around as a shortcut to reinstalling.


Sunday, March 2, 2014

The Path to Homebrew

The home page and documentation for Homebrew show how to install and use Homebrew. However, they currently don't seem to explain exactly how pathing works. This can trip up a lot of newcomers, who might give up on Homebrew or fumble around the internet and land on bad advice - such as using sudo and editing /etc/paths. All of this is unnecessary and potentially dangerous.

You just really need to understand a few basic concepts:
  • Never run brew as sudo. Not "sudo brew install" nor "sudo brew link".
  • The "Cellar" is a place that all your "kegs" go. Homebrew installs packages to their own directory (in the Cellar) and then symlinks their files into /usr/local/.
  • Change /usr/local/* to be owned by $USER, not root, so you can have write permissions and not need sudo.
  • The $PATH entry for /usr/local/bin should occur before /usr/bin.
Here's an example of installing Python and setting paths correctly. I'm using OS X 10.9.2 (Mavericks) and Homebrew 0.9.5:
 $ sudo chown -R $USER /usr/local/*
 $ brew doctor
 $ brew update
 $ brew install python --with-brewed-openssl
 $ ls /usr/local/Cellar/python
 $ python
 >>> 2.7.5
Wait, I expected to see python 2.7.6 now. What happened?
 $ which python
But the Homebrew docs said we will be using a symlink from /usr/local/bin/ that points at the Cellar instead of using /usr/bin:
 $ ls -l /usr/local/bin/python
 lrwxr-xr-x 1 rkulla admin 33 Mar 2 06:37 /usr/local/bin/python ->
Aha. So it must be a PATH issue:
 $ echo PATH

 $ cat /etc/paths
When using Homebrew, we actually now want to change the position of /usr/local/bin to be before /usr/bin in $PATH. But don't edit /etc/paths. Instead edit ~/.bashrc and prepend /usr/local/bin to $PATH, like:
Next, run:
 $ source ~/.bashrc
 $ echo $PATH
 $ /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
Don't worry that there's duplicate entries for /usr/local/bin. It doesn't matter. What matters is that you've modified your PATH safely and ultimately cleaner than the other ways.
 $ ls -l $(which python)
 lrwxr-xr-x 1 rkulla admin 33 Mar 2 06:37 /usr/local/bin/python ->
Yay! And if you ever want to use the old python you can just run: /usr/bin/python.

Best of all from now on whenever you need to install anything with brew, you can just run:
 $ brew install whatever
and not have to fumble with permissions or worry about overwriting any global system files.

I also like to source ~/.bashrc from ~/.bash_profile. For why see: .bash_profile vs .bashrc

Monday, January 2, 2012

The Wallpaper Algorithm Using HTML5 and JavaScript

There's a fun little book called The New Turing Omnibus - 66 Excursions in Computer Science from 1993. An omnibus is "a volume containing several novels or other items previously published separately". The first item in this book is the following algorithm:

1. input corna, cornb
2. input side
3. for i ← 1 to 100
     1. for j ← 1 to 100
       x ← corna + i x side/100
       y ← cornb + j x side/100
       c ← int (x² + y²)
       if c even
         then plot(i, j)

All that is needed to translate this algorithm to a running computer program is to have a Turing-complete programming language and a way to plot pixels on a display. So if you want to try it out for yourself first before seeing one solution, stop reading now and try it. I'm going to demonstrating how I implemented this algorithm by using JavaScript and HTML5's canvas element. The book only shows one implementation using Pascal so I thought I'd do something a little more modern.

We'll be creating two files. The first one will be the html5 content:
<html lang="en">
  <meta charset="UTF-8">

  <button id="myButton">Show Wallpaper</button>

  <canvas id="myCanvas">
  Your browser does not support the canvas element.

  <script src="wallpaper-algorithm.js"></script>

Next create a file called wallpaper-algorithm.js with:
document.getElementById("myButton").onclick = function () {'none';

function wallpaper() {

  var browserWidth = window.innerWidth;
  var browserHeight = window.innerHeight;
  var canvas = document.getElementById("myCanvas");
  var ctx = canvas.getContext("2d");
  var corna = 0;
  var cornb = 0;
  var side = 345;
  var xmul;

  canvas.setAttribute('width', browserWidth);
  canvas.setAttribute('height', browserHeight); = 'black';
  for (var i = 1; i <= browserWidth; i++) {
    x = corna + i * side / 100;
    xmul = x * x;

    for (var j = 1; j <= browserHeight; j++) {
      y = cornb + j * side / 100;

      if (i % 2 == 0) {
        ctx.fillStyle = 'white';
      } else if (j % 2 == 0) {
        ctx.fillStyle = 'lime';
      } else {
        ctx.fillStyle = 'purple';

      if (Math.round(xmul + y * y) % 2 == 0) {
          ctx.fillRect(i, j, 1, 1);



As an exercise, the book says to write the algorithm using three colors instead of two. That was easy enough so above I had it use four colors (black, white, lime and purple).

The book also says to make corna, cornb and side as user inputs, but for the sake of example, I hard-coded them. The variables corna and cornb represent the lower left-hand corner of the square of interest. The variable side is the length of the square's sides. Try side values of 5, 25, 33, 41, 45, 51 and 325 for some neat patterns.

Feel free to experiment with the algorithm and have fun with it. All-in-all it's a great example of what can be done with a simple algorithm and screen graphics. It's also a great demonstration of HTML5's canvas element coupled with JavaScript.

For exact details on how the algorithm works, buy the book. I think Google Books has a free version online as well.

Tuesday, December 6, 2011

I See Dead Code

I recently removed a lot of dead code from our code base at work. There were several similar directories, dead files, and unused code within live files.

All this unused code was a major nuisance for over a year. It would cause grep to show too much information -- most of it meaningless -- and it would even cause your IDE to jump to the old unused class file, rather than the one you actually wanted.
People were afraid to remove this code for fear of breaking things. After all, it was tied to our main shared library that could cripple over 200 of our websites at once. But enough was enough. I decided to embark on removing it. I have a lot of experience with removing dead code but rarely this high of a risk and with so much of it. I had only one day to figure it out, delete it, and test it. Otherwise, it would have to have be rolled back, for who knows how long, until anyone would attempt to remove it again.

Before removing code like this, I ask myself if it's good code that we might want to start using soon - or even right now under a configuration option. If it's useful code I wouldn't just delete it. Rather, I'd extract it out into classes that could become part of our library.

In my case it really was all just useless code. Some programmer from long ago had been a little too happy to copy and paste code to new files, leaving the old dead files lying around.

Some other code was living in live files but was effectively in a "dead block" because it would only happen under a condition that would never happen. I decided to remove this as well because such code will only confuse programmers every time they see it. In fact, I was about to spend time converting it to work with our new platform. What a waste of time that would have been! Fortunately, dead code like this is often easy to determine as dead, because you go to refactor it and can't figure out how to get it to run under that condition from the user interface. This is why it's we shouldn't rely solely on unit tests.

A coworker suggested that I keep the condition in the code at first, replacing the body of useless code with code to log a message such as "This condition should never happen." This way we could monitor the logs once it was deployed for a while, and really make sure the code wasn't being used by anything.

The easiest dead code to spot is commented-out code. Go ahead and remove this any time you see this. It will only sit there forever, no one will ever uncomment it, and if they really do need it that's what source control is for.

For those times that it's hard to tell if code is dead or not, the main tool I used to figure it out is grep. More specifically I use ack-grep. The key is to trust your grepping abilities and be extremely thorough. If you can prove that certain code isn't being referenced via other files, that's a huge step toward having the confidence to remove it.

Of course, it's not always that easy. Sometimes the code is referenced but that's okay; replacing it to use the proper files instead is certainly worth the effort.

Once you have removed the code, and all is working in your development sandbox, it's time to deploy it. It really isn't much different than writing new code or refactoring existing live code. Any time you commit code it could cause other things to break. This is why we do regression testing.

I highly recommend that you wait to push this kind of change out when there's no other important code that needs to be deployed too. If the "new" code is found to not be in a working state, then your code removal attempt will be the first source of blame and you will likely have to rollback and risk never getting to try again -- even if it wasn't the problem.

So go ahead and remove that dead code the next time you see it. You'll thank yourself later and so will your coworkers.

About Me