Sunday, October 8, 2017

Using Amazon Redshift with Looker

This post is an attempt to share some lessons I learned while migrating a Looker instance to Amazon Redshift.

It turns out that Looker isn't just great for Business Users to avoid having to write SQL, it's also great for DevOps to create Admin Dashboards into Redshift that everyone can use to avoid manually writing many complex queries against Redshift. I'll start by setting up Looker/Redshift connections, then Admin dashboards, then end with optimizing it all.

Redshift currently makes sense to use with Looker because it's an OLAP (Online Analytical Processing) database, as opposed to OLTP DBs like MySQL. If you plan to use other AWS services, such as Kinesis Firehose to store event data in Redshift, using Redshift as your data warehouse enables you to perform JOINs on that event data. Redshift already supports Window Functions (great for more real-time/ELT operations with Looker Derived Tables), CTE's (for more readable queries), etc. MySQL 8 will support many of these features but Redshift still has the advantage of being designed as a columnar database while offering additional analytical advantages.

The cloud-database space moves rapidly so keep an eye on Google, Oracle and others as they compete with Amazon. If you have a cloud-native app that uses another provider, that's fine since you can extract all that data in ETL/ELT but the main thing is to have a single data warehouse because Looker should only connect to one warehouse.


Fortunately, you don't need to recreate your existing Looker account just to switch connections -- as long as the table and field names will (at least mostly) be the same between environments. To switch connections go to Admin>Database>Connections and create a new connection (leaving your old connection there for now):

  • The connection name has to be different. (I appended -redshift to the name)
  • For Dialect put "Amazon Redshift"
  • For Host:Port just put the host part of your Redshift endpoint, e.g.,
  • For Database enter the name of the production Redshift DB to connect to
  • Enter the Schema you will connect to. (schemas are just like in Postgres)
  • You can probably leave the other options as defauls for now

At this point click `test connection`. If it says "Can connect", great. It should be pretty instant. If it takes a while and then fails with: Cannot connect: connection refused: Java::OrgPostgresqlUtil::PSQLException: Connection refused..." then it means Looker doesn't have access to your cluster yet.

If your production Redshift cluster is locked down, such that can't connect to it yet, then whitelist Looker's IPs inside your Redshift VPC Security Group (VPC Security Groups work just like EC2 Security Groups). Looker required 5 IPs to be whitelisted, which worked but made it harder to tell what the IPs were for and we already had another IP listed for Kinesis that looked like Type=Redshift, Protocol=TCP, Port Range=5439 Source=<ip>/27. So instead of adding more IPs to the Inbound section, I created a new VPC Security Group called "Looker" and whitelisted those 5 IPs in there:

 Redshift TCP 5439 <ip>/32
 Redshift TCP 5439 <ip>/32
 Redshift TCP 5439 <ip>/32
 Redshift TCP 5439 <ip>/32
 Redshift TCP 5439 <ip>/32

(Get the list of IPs to whitelist from here.)

I attached that new security group to the Redshift Cluster. This is key because you can't attach the new Looker security group to the Redshift security group (even though you can add it as a tag or type its group ID or name in since that would only act like a tag).

In the Develop menu dropdown, you should have a menu option called Content Validator (If you don't, ask Looker support to add access). Go into it and make sure your current Looker connection is validating. You will use this to find any errors when you switch your connection over to your new Redshift connection and update your LookML.

BTW, I highly recommend considering hosting your own Looker Git repo, so that you can view all of the diffs your team makes to the LookML.

IMPORTANT: before you can switch connections, you'll have to be in development mode and changing connections will BREAK PRODUCTION Looker for now. So don't attempt this until you can bring Looker down for maintenance!

To switch DBs, go to Develop>Manage LookML Projects. Click 'Edit Configuration' for the project you want to change connections for, change to the new connection, and click Save. It will probably say: model '<your model>' is not allowed to use connection '<your schema>' because you still need to update the model file(s) to use your new connection. Go to the model file from Develop>Project. Where it says connection: "<your connection>", change it to your new Redshift connection and click Save.

Next you'll need to update any table or field names in the View files to reference your new name. For example, if your old instances of sql_table_name were referencing a MySQL DB name, they will now need to reference your Redshift schema. Use search to find any other old references to change.

Now you should just need to port any SQL syntax to Redshift. For example, if your old connection was MySQL, look for errors such as The Amazon Redshift database encountered an error while running this query ... and convert it to Redshift syntax. One example I encountered was a query that used SELECT AVG(some_table.paid) which no longer works because Redshift doesn't want you to do an AVG on a bool (whereas the 'paid' field in MySQL was a TINYINT(1). So I cast it by going into its View file where a Measure defined the 'paid' field and changed: sql: ${paid} ;; to sql: ${paid}::int::double precision ;;. Remember: Redshift is not the exact same as Postgres where you could have just cast it to an int. Redshift's AVG() gives you an int result if the input is int. If you're used to using TINYINT(1) bools in MySQL, you can use a Dimension of type number: sql: case when ${TABLE}.paid then 1 else 0 end ;; rather than a yesno so that a sum Measure can just reference ${paid} and an average Measure can just use ${paid}::double precision. YMMV, but keep this in mind.

If you have lot of old LookML, keep going through and fixing all the syntax issues switching to Redshift caused. If your old connection was Postgres then there's not as much to change than if it were MySQL. For example, MySQL date functions such as now() will need to be changed to getdate(), period_diff() will be datediff(), date_format() will be to_char(), etc.

Redshift Admin Dashboards

Looker puts out really cool admin "blocks" that allow both admins and regular users to get insights into things such as ETL cycles, recent load failures, table architectures, and more; all within Looker without having to grant everyone access to the AWS Console.

The recommended way to create a Redshift Admin dashboard and Performance Overview in Looker is to follow these instructions but I will clarify some points that were missing for me. Start by creating a separate DB connection for your Redshift cluster to be used by this block. This step isn't absolutely necessary but it's a good idea since it increases performance because the Admin Block uses PDTs, so having a separate connection will reduce extra query load on your main connection. This connection will mostly be a duplicate, except you'll make a new user just for this connection since this user will have permissions to the metadata tables that the regular Looker user shouldn't. Everything else can be duplicated: the database and even the scratch schema.

I created a user called "looker" as my main connection's user. Don't use an admin user for Looker or it will have access to everything. For instructions on how to do this see this article. I found their explanation of granting SELECT on all tables to Looker inadequate so I did GRANT SELECT ON ALL TABLES IN SCHEMA <my_schema> TO looker;

If you haven't already, create a scratch schema that your PDT's, including the Redshift Admin Block, will use:

psql> CREATE SCHEMA looker_scratch AUTHORIZATION looker;
psql> ALTER USER looker SET search_path TO '$user',looker_scratch,<my_schema>,public;

In Admin>Connections select your connection and click the checkbox to enable PDTs. Under Temp Database enter 'looker_scratch' and click 'Update Connection'. Then click 'test connection'. If you get an error like: Failed to create/write to pdt connection registration table looker_scratch.connection_reg_3 then run:

psql> select has_table_privilege('looker', 'looker_scratch.connection_reg_r3', 'insert');

If that says 'f', then run:

psql> \dt looker_scratch.*

If that says the owner is someone else, what likely happened was that you first made a connection in Looker that had a user of someone else and that user created the connection_reg_r3 table first, so it's still the owner. One way to fix that is:

psql> DROP SCHEMA looker_scratch CASCADE;
psql> CREATE SCHEMA looker_scratch AUTHORIZATION looker;
psql> ALTER USER looker SET search_path TO '$user',looker_scratch,<my_schema>,public;

Now when you click 'test connection' with the 'looker' user, the looker user will be the owner of the look_scratch tables and should now pass with: Can use persistent derived tables in "looker_scratch"

We'll now create a new Redshift user called 'looker_admin' for this admin block to use and give that user access to the redshift tables and scratch_schema it will use:

psql> CREATE USER looker_admin WITH PASSWORD 'some_password';

psql> GRANT SELECT ON TABLE pg_class TO looker_admin;
psql> GRANT SELECT ON TABLE pg_namespace TO looker_admin;
psql> GRANT SELECT ON TABLE pg_database TO looker_admin;

psql> GRANT ALL ON SCHEMA looker_scratch TO looker_admin;
psql> GRANT ALL ON ALL TABLES IN SCHEMA looker_scratch TO looker_admin;

Looker's SQL Runner will run as the 'looker' user, not 'looker_admin'. So don't be confused if use it with queries like SELECT COUNT(*) FROM stl_load_commits, which will show a count of zero because these are virtual tables that only users with superuser privileges can see. Redshift's convention is to have rows generated by another user show up as invisible, rather than showing an error.

You're now ready to start making the Redshift Admin block work. This requires adding 3 files: A Model called redshift_model, a Dashboard called redshift_admin, and a View called redshift_views.

Make sure the model's 'connection' string points to the separate connection we made above for the looker_admin user. Also make sure to go into Develop>Manage LookML Projects and click 'Edit Configuration' next to the redshift_model and select the same connection for use with it. You should now see the admin dashboard when you go to the dashboard section of your user. It should show you sections like Table Load Summary, Recent Files Loaded, Recent Load Errors, and Database Consumption.

But wait, there are more useful dashboards to add: redshift_performance and redshift_query_inspection. Edit both dashboard's LookML if you need to change query_timezone from their default America/Los_Angeles to something else.

If you end up seeing duplicate dashboards for all your models, it's probably because your other model still does:

include: "*.view.lkml"
include: "*.dashboard.lookml"

which makes it include every dashboard, including your new redshift ones. There currently isn't a way to exclude redshift_ from the includes, so come up with a new naming convention like how the redshift_ LookML does. I changed the include in my main model to use a prefix "somePrefix_", include: "somePrefix_*.dashboard.lookml". This meant I needed to create any LookML dashboards to have a prefix now and for consistency I did the same for the Views. It also meant going into my main Model and changing all references to the views to use the prefix. It's really annoying but not as annoying to me as having duplicate dashboards.

After doing that, some of your existing Looks and Dashboards in the 'Browse' section will likely be broken. Go into Develop>Content Validator and click 'Replace' next to any errors and rename the explores to your new prefix. Once you commit, if you get any LookML errors fix them before you push. For example, I had a couple of Views referencing ${foo} where I needed to change it to ${somePrefix_foo} now. Don't blindly do that for every ${foo}, just errors, because most ${foo}'s are probably for Dimensions which don't need prefixes but sometimes a ${foo} is actually a reference to a View from another file. Once you're resolved these, click 'Validate Again' until the errors are gone.

Finally, you might want to adjust the time of day these Redshift Admin PDTs rebuild. Go into 'redshift_views' and change the value of sql_trigger_value.


Let me start by saying that I am not a Redshift expert and that the following are from my own understanding of the docs and my own experiments. Nonetheless, it may help your understanding because these are the topics that tend to trip up developers the most.

Whether or not you're using a star-schema design with Redshift, you can still leverage Looker and optimize your tables in similar ways. For example, we can see which sortkeys and distkeys a table has with:

$ psql [cluster credentials]
# set search_path to '$user', public, <my_schema>
# select "column", type, encoding, distkey, sortkey from pg_table_def where tablename = 'foo';

However, it's useful to first know which queries are actually slow. To do this, log into the AWS Console and go to your Redshift cluster and then the Queries tab. When you find a slow query, try running it directly, e.g. in Looker's SQL Runner or via the psql command. However, keep in mind that first runs of your query will be slower because Redshift compiles and then caches them. Therefore, you'll want to do your benchmarking against the warm cache versions.

While you're still in the Queries tab, click on the actual query and then under Query Execution Details you'll see the "Explain plan". Look for the tab called Actual. Now you can click on the plan's steps to drill down further and see if there are any warnings or how many rows are being scanned - which is indicative of your query reading too many rows from disk, so you may need to change your sortkeys or run a VACUUM. You may also need to adjust your query to filter better...

Redshift is a columnar database, which means that each block holds data for only a single column. The "zone map" is held separately from the block, like an index. The zone map holds only two data points per block, the highest and lowest values in the block. Redshift consults the zone map when executing queries, and excludes the blocks that the zone map indicates won’t be returned by the WHERE clause filter.

You might notice that your ad-hoc queries run perfectly fast in warm cache, but perhaps your Look or Dashboard in Looker is much slower. This is probably because Looker's visualizations can be slow to render in the browser. Fortunately, once you (or anyone else) loads something in Looker, it will remain cached until your caching-policy resets the cache, e.g. once per day after ETL run.

Aside from incorrect queries, the general place to look into why queries are slow in redshift is usually with a distkey/diststyle and second most would be sortkeys. The distyle of EVEN is the default and means "distributed across all the nodes". The default is generally fine because you don't have bottlenecks or 90% disk space and CPU on one node while other nodes are empty. It makes all nodes participate to the best of their ability. However, it can take longer to do joins.

Unfortunately, it's not easy (or even realistic) to know ahead of time which distkeys and sortkeys to add because you have no way of knowing which queries people will plan to write against your tables a year from now. And since you have to add distkeys and sortkeys at table creation time, there are two main things you'll need to do: 1) Make it as easy as possible to change your schema, so you can edit the keys and repopulate the data 2) Use EXPLAIN on the queries you do have, to figure out which columns could benefit from different keys.

When you run JOINs, think in terms of distributed joins. Say you have a query like:

FROM groups JOIN users ON groups.g_id = users.g_id

and EXPLAIN output shows Hash Full Join DS_DIST_BOTH. It means both outer and inner, which redistributes part of the data, allocating different ranges to different nodes. This is usually the default in Redshift if you don't optimize, and it's one of the least optimized! So you will want to add distkeys and sort keys. Also, BCAST (broadcast) means "duplicate the entire table to all the nodes" but isn't seen as often yet some queries need broadcast. Most use DIST (distributed) which is generally better to use diststyle: ALL to broadcast once, rather than every query run.

Why is this so important? Redshift is not designed to run on a single node. It's designed to run in a sharded cluster and it's expected to have very bad numbers within only one node. This is one reason you'll get slow queries. In MPP databases, data is partitioned across multiple servers & nodes with communication across a network rather than disk. The nice thing is that your SQL queries will look as though you're selecting from a single database, this is thanks to using a "distribution key".

Say you commonly JOIN or GROUP BY using the 'state' column of your table. You may choose to distribute on that column. However, if you set the DISTKEY to the state column the distribution won't be as even. In a large table, this amount of distribution "skew" could have an adverse impact on query processing. From the columns used in your queries, choose a column that causes the least amount of skew as the DISTKEY. A column with many distinct values, like a timestamp, would be a good first choice. Avoid columns with few distinct values such as credit card types, error messages or days of the week.

While the general rule of thumb is to add distkeys to the fields you'll be JOIN'ing on the most and to add sortkeys to fields you'll be filtering the most in WHERE clauses, you need to benchmark before/after and really look into the EXPLAIN output to be sure you're not making things worse. There is more to consider than just query speed, such as CPU usage and other things that can cause problems or drive up your AWS bill.

Pick a few important queries you want to optimize your databases for. You can’t optimize your table for all queries, unfortunately. Even though it will rarely be the best performance, a table with no DISTKEY/SORTKEY can be an ok all-round performer. It’s a good option not to define DISTKEY and SORTKEY until you really understand the nature of your data and queries.

That all said, sortkeys are less likely to backfire. If you don't use sortkeys, every query you'd do has to read all the files from disk (like a full table scan) before it can decide if it's something it can use. This sortkeys are comparable to indexes. If you use sortkeys you can more easily reason about and measure the min/max values of each of the 1mb blocks, so you can skip blocks. So when you WHERE to filter by some value, Redshift wouldn't have to read everything to do it. Take:

SELECT COUNT(1) FROM users WHERE state = 'California'

Most of the time of this query is spent in 'Seq Scan on users'.

The Redshift table STL_EXPLAIN gives you actual EXPLAIN output of all your queries. This is very useful because you can audit this list rather than figuring out all the queries and manually put EXPLAINs in front of them. It shows the time cost of query plans. Don't pay too much attention to the number, but you can use it to reason about what is probably the most time-consuming task. The number at the top contains the numbers at the bottom:

Plannode           | cost   | info
Aggregate          | 68718  | 
  Seq Scan on users | 62500  | Filters: state = 'California'

See also the STL_SCAN table. Helps you figure out which columns will contribute the most to optimizing. It helps you see what are the top columns that take a lot of time being scanned and have the greatest potential for improvement. Fortunately, we created some Admin Dashboards above in this post, so that you can peer into all of this more easily.

Sunday, January 3, 2016

Data Pipeline and ETL tasks in Go using Ratchet

As Data Engineers and Software Engineers we might define Data Pipelines differently. This Wikipedia entry defines a pipeline as a set of data processing elements connected in series, where the output of one element is the input of the next one; often executed in parallel. That is pretty close to how I'll define it here. More specifically, I'll use The Golang Blog's informal definition:

A series of stages connected by channels, where each stage is a group of goroutines running the same function. In each stage, the goroutines receive values from upstream via inbound channels; perform some function on that data, usually producing new values; and send values downstream via outbound channels.

Ratchet is a Go library that abstracts these pipeline concepts, giving you a clearcut framework to work with. You will satisfy a data processing interface and be provided with some convenience methods for managing the flow of data.

Perhaps you have been tasked with creating automated pipelines that fetch data from disparate sources -- such as remote databases or web services -- and then cleaning up, transforming and aggregating the unorganized data into specialized databases or datasources (an RDBMS, CSV files, etc.) so your team can use that organized data for analytics and other purposes.

Ratchet is an open-source project by Daily Burn. It provides a better way of extracting data from the databases and web services an application uses, then transforming and loading that data into reporting-oriented formats. It makes doing these tasks more robust than creating basic scripts or using limited 3rd-party services that don't give you full control over your data processing.

Install ratchet by running:

 $ go get

Example project layout:

    ├── main.go             (Main package. Pipeline functions)
    ├── packages            (Your reporting package)
    │   ├── models.go       (JSON structs for transformer.go)
    │   ├── queries.go      (SQL query functions for SQLReaders)
    │   └── transformer.go  (Custom DataProcessors)

Ratchet consists of a Pipeline with a series of PipelineStages, which each handle one or more DataProcessors. DataProcessors each run in their own goroutine so all of the data is processed concurrently. The DataProcessors send and receive JSON for convenience. Ratchet gives you some useful data processors and an interface for you to implement custom ones.

Your Ratchet tasks will likely use a combination of the provided DataProcessors and your own custom ones. If you're creating an ETL task, the custom DataProcessors will usually be for the Transform stage while the Extract and Load stages could use the built-in Reader/Writer DataProcessors such as SQLReader and SQLWriter. But there are times when you will create custom Data Processors even for the extract stages, such as when you're making calls to 3rd party REST Apis, and the like.

There are other DataProcessors provided for your convenience, such as ones for Google's BigQuery and more. You will get to mix-and-match things as they make sense to your application. See the full-list of provided DataProcessors.

You will typically begin your Ratchet code by including the necessary packages. For the sake of example I will assume you are also going to access an RDBMS, such as MySQL, to perform some SQL reads and/or writes on:

 import (
     _ ""

Since an SQLReader takes a *sql.DB as the first parameter, you will want to start by creating one. If you only need a basic SQL string then you may want to use NewSQLReader. Example:

 func UsersQuery(minId int, maxId int) string {
     return fmt.Sprintf(`SELECT id, name FROM users 
         WHERE id >= %v AND id <= %v`, minId, maxId)

We simply created a function that takes arguments, generates some SQL and return a string. We can call it like:

 users := processors.NewSQLReader(someDB, mypkg.UsersQuery(5, 10))

If you need a query to use the JSON results generated from a previous stage, use NewDynamicSQLReader.

When you need to create your own data processors, you implicitly satisfy the DataProcessor interface by implementing its ProcessData and Finish methods. You will create a struct and attach these methods to it. It's up to you to determine how your structure will hold any state you need. Also, because ProcessData deals with receiving JSON from the previous stage and then passing JSON on to the next stage, you will need to create structs for Unmarshaling and Marshaling the JSON.

Tip: Creating structs for all the JSON you will be processing can be tedious so I highly recommend the tool json2go.

The data package includes some types and functions for using JSON with Ratchet. It provides wrapper functions for json.Unmarshal() and json.Marshal() called data.ParseJSON() and data.NewJSON() that provide additional logging. It also creates the type data.JSON as a simple name to a byte slice: type JSON []byte. Example:

 package mypkg

 import ""

 type myTransformer struct{}

 // Expose our DataProcessor for clients to use
 func NewMyTransformer() *myTransformer {
     return &myTransformer{}

 // Consider moving these two structs to models.go
 type ReceivedData struct {
     UserID int `json:"user_id,omitempty"`
 type TransformedData struct {
     UserID         int    `json:"user_id,omitempty"`
     SomeNewField   string `json:"some_new_field"`

 func (t *myTransformer) ProcessData(d data.JSON, 
                                     outputChan chan data.JSON,
                                     killChan chan error) {

     // Step 1: Unmarshal json into slice of ReceivedData structs
     var users []ReceivedData
     var transforms []TransformedData 
     err := data.ParseJSON(d, &users)
     if err != nil {
         killChan <- err

     // Step 2: Loop through slice and transform data
     for _, user := range users {
         transform := TransformedData{}
         transform.UserID = user.UserID;
         transform.SomeNewField = "whatever"
         transforms = append(transforms, transform)

     // Step 3: Marshal transformed data and send to next stage
     dd, err := data.NewJSON(transforms)

     if err != nil {
         killChan <- err
     } else {
         outputChan <- dd

 func (t *myTransformer) Finish(outputChan chan data.JSON,
                               killChan chan error) {}

Notice the idiomatic NewMyTransformer() function that returns a pointer to our DataProcessor's struct with its zero value. Since it starts with a capital letter, it will be used by package users like:

 transform := mypkg.NewMyTransformer()

Finish() is called by Ratchet after a previous stage is done sending its data. You can often implement this as an empty method. Finish is more useful when you want to wait until all the data has been received before doing something with it, and if you're working with more than one input source you may really want this. If that's the case, you will typically use ProcessData() for validating and storing the incoming data into the receiver struct and then doing the second and third steps above inside of Finish() instead. Here is an example of how we could rewrite the above to batch things using a non-empty struct and a pointer reciver, then complete the transformation stage in the Finish():

 type myTransformer struct{
     BatchedUsers []User

 func NewMyTransformer() *myTransformer {
     return &myTransformer{}

 func (t *myTransformer) ProcessData(d data.JSON, 
                                     outputChan chan data.JSON,
                                     killChan chan error) {

     // Step 1: Unmarshal the JSON into a User slice
     var users []User
     err := data.ParseJSON(d, &users)
     if err != nil {
         killChan <- err

     // Step 2: append via pointer receiver
     t.BatchedUsers = append(t.BatchedUsers, users...)

 func (t *myTransformer) Finish(outputChan chan data.JSON,
                               killChan chan error) {

     var transforms []TransformedData

     // Step 3: Loop through slice and transform data
     for _, user := range t.BatchedUsers {
         transform := TransformedData{}
         transform.UserID = user.UserID;
         transform.SomeNewField = "whatever"
         transforms = append(transforms, transform)

     // Step 4: Marshal transformed data and send to next stage
     // Write the results if more than one row/record.
     // You can change the batch size by setting loadDP.BatchSize
     if len(transforms) > 0 {
         dd, err := data.NewJSON(transforms)

         if err != nil {
             killChan <- err
         } else {
             outputChan <- dd

Once you have your data processors setup, you will just need to pass them into a new pipeline for processing. If you just have one reader, one transformer and one loader, you can use a basic 3 stage pipeline using NewPipeline():

 pipeline := ratchet.NewPipeline(extractDP, transformDP, loadDP)
 err := <-pipeline.Run()

If you do not even need a transform stage then you can just have extract and load stages: ratchet.NewPipeline(extractDP, loadDP). This is enough if your extracted SQL fields match and table you will load the data into.

Things aren't always that simple so Ratchet provides a more flexible branching pipeline, via NewPipelineLayout(), that takes a variable amount of NewPipelineStage()'s which in turn take a variable amount of DataProcessors. Inside the pipeline stages you will wrap your data processor instances with the Do() function and pass its returned value to the Outputs() method, following these rules:

  • DataProcessors in a non-final PipelineStage must use Outputs().
  • A DataProcessor must be pointed to by one of the previous Outputs() (except in the first stage).
  • Outputs() must point to a DataProcessor in the next immediate stage.
  • DataProcessors in the final stage must not use Outputs().

Here's how the basic 3 stage pipeline shown above would look as a branching pipeline:

 layout, err := ratchet.NewPipelineLayout(
     ratchet.NewPipelineStage( // Stage 1
     ratchet.NewPipelineStage( // Stage 2
     ratchet.NewPipelineStage( // Stage 3

 pipeline := ratchet.NewBranchingPipeline(layout)
 err = <-pipeline.Run()

Fortunately, you can do a lot more than that with a branching pipeline. Outputs() can take multiple parameters to shove data into and you will often have multiple calls to Do() in the intermediate stages, for handling disparate data.

Sometimes you will want to pass the original object through the various stages in case you need to use it in a later stage. This is very easy to do using NewPassthrough DataProcessor. In the following example, passing passthrough to Outputs() makes it so aDP gets passed to the next stage along with the other values passed to Outputs():

 passthrough := processors.NewPassthrough()


     ratchet.Do(aDP).Outputs(anotherDP, passthrough),

Ratchet provides a logging facility that's very useful for debugging Ratchet tasks. It is often helpful to temporarily place the following line into your ProcessData() implementation so that you can see output from the calls to logger.Debug():

 logger.LogLevel = logger.LevelDebug

Even better is for you to create a way to set the different levels from the CLI. For example: --log-level="debug". And so on for LevelError, LevelInfo, LevelStatus, LevelSilent. I recommend using LevelSilent on production or whenever you just need the job to run faster.

Be sure to read the log output (ratchet_default.log) because it shows the SQL INSERT Data that was actually executed, for example. The logger package provides some other useful logging functions for you to use as well.

Another debugging/development tip is to temporarily write the final stage output to a CSVWriter that goes to standard output. This allows you to quickly prototype a Ratchet task without having to set up or write to your final database table(s) yet. Example:

 // Setup all the DataProcessors
 users := processors.NewSQLReader(inputDB, mypkg.UsersQuery())
 bq := processors.NewDynamicBigQueryReader(bqConfig, mypkg.BQQuery)
 bq.UnflattenResults = true
 transformer := mypkg.NewMyTransformer()
 writeCSV := processors.NewCSVWriter(os.Stdout)

 // Create a new Pipeline using the DataProcessors
 layout, err := ratchet.NewPipelineLayout(

I have covered all of the main concepts. Please see the Ratchet documentation for more information. If you have any questions please post them in the comment section below.

Sunday, May 31, 2015

Full-Site HTTPS Lessons

There may be more to consider than you think before attempting to convert your web site to HTTPS. Even if you already use HTTPS, you might benefit from this information if you aren't sure you set it up properly. I will cover both the obvious and the not-so-obvious things you need to know so you can avoid common pitfalls and plan out your conversion more completely.

Every web site these days should support https:// on every page in order to be more secure and to provide privacy to its users. It can be be very easy or fairly difficult to convert an existing site to HTTPS. If it happens to be difficult for your site, don't be discouraged. It will benefit everybody because your site will be more secure and enjoy a better reputation.

Google has done a great job of explaining why you should use HTTPS on all your pages (HTTPS Everywhere). Man-in-the-Middle aren't just easy if you aren't using HTTPS at all, but also if you're not using it everywhere or properly. For example, encrypting passwords isn't enough if your cookies aren't secure, and your cookies can't be secure if you're not using https everywhere. We don't want our faithful users to have their sessions hijacked while they're browsing our site at a coffee shop.

What I would like to talk about are the lessons that I learned from setting this up at, which was an existing site with millions of users at the time. I'll start by saying that you should not procrastinate on doing this. It is much easier to set up HTTPS everywhere if you start early but it is possible to do it on any website -- it will just take more work.

I recommend reading this entire article once through first, before trying any of it out. Use this information at your own risk since it is just from my experience and your mileage may vary.

The first thing you'll need is an SSL certificate. Find out if you will need a wildcard cert and so forth. Your site without a subdomain is called a "naked domain" (e.g. You may need to ensure your certificates include subdomains to avoid site visitors getting certificate errors if they go to the www. address for your site - or maybe you handle this by redirecting www to the naked domain in your web server config. And while not necessary, you should also decide if you want to use Extended Validation (EV) certificates. Anyway, do your homework before purchasing your certificate.

Once you have your certificates figured out, you start by making every page redirect from http to https. This is usually as simple as adding the appropriate redirect directives to wherever you terminate SSL (NGINX, Apache, etc). In NGINX this is usually as simple as adding  return 301 https://$domain$request_uri;  to your HTTP server block; then in your HTTPS server block you'll add directives to enable ssl, listen on port 443, link to your certificate and key and add any needed headers. Refer to your web server's documentation.

It is a myth that having https on every page is going to be too slow. So please do not try to only add https to a subset of pages, such as just the home page or login page. Doing so will only add complexity to your configurations without making your site secure enough.

In your web server config, you will also specify the protocols to support. You should disable support for SSL v2 and v3. With NGINX, just specify the protocols you want, and leave out what you don't want, otherwise it defaults to SSSv3. So do something like: ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

It may seem confusing that we're using the terms SSL and TLS interchangeably. Just note that you'll want to configure support for TLS encryption, not SSL. See also There IS NO SSL.

We're about half-way there. Unfortunately, simply having every web page be HTTPS isn't enough. You also need to have all your important resources that load on those pages (images and scripts) be https or you aren't really secure yet. Plus if you don't do this, you will see a lot of Mixed-Content warnings in the browser in the console and a warning on the lock icon. So having https:// in the address bar isn't enough. Don't try to cheat because you will not fool attackers nor your users.

You should start by first configuring this all on your dev machine (or at least a staging machine), rather than production. This is because you will likely encounter errors that can take a while to resolve and you don't want to leave your site in a broken state. For example, if it turns out that your site has hundreds of hard-coded http:// URL strings in the database then you'll need first run a migration script to convert these to "https://". (Yes https://, because protocol-relative // urls are now an anti-pattern.)

If your site has hundreds of thousands of users, then running such a migration script could take hours or even days. So you will want to run a migration script first if needed. Once your migration finishes, you will also need to refactor your code to start saving your URLs as https so you don't run into the same problem again. Depending on when you deploy, you may need to run your migration script once more to catch up.

DO NOT skip this step: To prevent session hijacking, you'll want to enable the Secure and HttpOnly attributes on your session cookies. These are security flags that you add to your cookies by appending 'Secure; HttpOnly' to the cookie HTTP response header. The Secure flag makes the cookie only accessible over HTTPS so users can't unknowingly "leak" their cookies by typing http:// to get to your site before it redirects to https://. The HttpOnly flag makes cookies invisible to client-side scripts so cookies can't be stolen by XSS exploits.

To test that your cookies have the Secure and HttpOnly flags, simply check the Set-cookie value in the response from curl'ing your website. In a browser such as Google Chrome, you can also go into the developer tools and inspect your cookies and you should see checkmarks under the Secure and HttpOnly columns.

Your development sandboxes should use https:// since it's best that development machines closely match production. This will help prevent developers from hard-coding http:// url's but not realize it until there are https warnings on production. This means that your developer machines will need to be running NGINX or whatever web server you use in production. Creating a self-signed certificate will make this easy enough. There are additional benefits to this, such as using a hostname like instead of http://localhost:3000.

Lastly, make sure that any middleware and front-end code that does redirects, does so as https. You can test this on the command-line by running:

 $ curl -I -L https://yoursite/whatever 

and checking if any Location: lines show http:// URLs. Assuming that /whatever is a controller that does redirects, you want to be sure that it doesn't do any redirecting back to http:// at any request. I've had to ensure https:// redirects at the WSGI-middleware and at the front-end controller code layers.  

To further test that you set everything up correctly, run your domain through a scanner such as It will give you different grades and your goal is to get an 'A'. For a command-line tool, check your package installer for sslscan (Homebrew, apt-get, etc). Additionally you can check that your site uses strong certificates by using If you get bad scores, make sure you're using SHA-2 certificates and 256-bit TLS ciphers.

Further considerations are making sure you disable support for weak ciphers, using at least SHA-2 (even on your intermediate certs) and looking into SSLStrip and HSTS headers. Also, look into Perfect Forward Secrecy (PFS) so attackers can't use private key to decrypt archived sessions. Finally, make sure your supported browsers support everything you have set up.

You should also start linking to the https:// version of your site -- in emails and so on -- so your servers don't incur any unnecessary HTTP redirects and because you should be proud that you can provide https links to your web site.

So please get started on this as soon as possible. Don't put off the inevitable.

UPDATE: Further reading: Let's Encrypt and QUIC.

Saturday, April 26, 2014

Using Browserify with jQuery, Backbone.js and more.

This article aims to help you understand how to use Browserify in your projects to manage your front-end dependencies, as well as your back-end ones. It assumes you know what Browserify does and that you are confused about exactly how to use in a real project to handle all of your dependencies, particularly with client-side frameworks. I will show examples with jQuery and Backbone.js, but what I explain will apply somewhat to other frameworks and libraries too.

Using Browserify with jQuery

A good first step toward understanding how to use browserify is with jQuery. Delete any script tags that include jquery.js in your html on the client-side. Instead, install the jquery node module (into ./node_modules) and save a reference to it in package.json:

 $ npm install jquery --save

jQuery 2.1.0 is capable of using browserify's module.exports. That is, you can install the npm package called "jquery" instead of the older workarounds such as the "jquery-browserify" npm package.

jQuery will need a window object:

 var $ = require('jquery');

Older jQuery's will need to do this as var $ = require('jquery')(window);or var $ = require('jquery/dist/jquery')(window);

If you omit the relative path it'll resolve to the module in node_modules, allowing us to just require('jquery') rather than specify a path to jquery.

Whenever any of your JavaScript files need a reference to $ or jQuery you simply create them as needed, with the above jquery require line or with var jQuery = require('jquery'); if you use the name jQuery instead of $.

Now you browserify your code:

 $ npm install -g browserify
 $ browserify whatever.js -o bundle.js

You'll then use a script tag to include bundle.js instead of jquery.js. That's it!

Using Browserify with Backbone.js

Install Backbone.js, Underscore.js and jQuery as node modules:

 $ npm install backbone underscore jquery --save

We explicitly npm installed underscore, instead of letting backbone install it, so that we can require('underscore') later when needed. And we used the --save flag to store these versions in package.json because it's a good habit to get into.

If you structured your Backbone code using a single global object, like app, then you'll need to replace all the lines (if any) that say:

 var app = app || {};

with something like:

 var $ = require('jquery');
 var Backbone = require('backbone');
 var _ = require('underscore');
 Backbone.$ = $;

in all of your JavaScript files. Each file may only need certain objects, and so we should only put the ones each module needs. Being explicit about which dependencies are needed is part of what good modularity is all about. For example, if you only need a reference to Backbone in one file then only do:

 var Backbone = require('backbone');

Backbone can run fine without jQuery, but you can explicitly tell Backbone to use jQuery when needed by doing Backbone.$ = $;

If you explicitly need the '_' object because you're invoking underscore directly, then also include the var _ = require('underscore');

Because of the way Browserify encapsulates every module into its own scope, we no longer have to pass things like jQuery into an IIFE to avoid global conflicts, and we can even put 'use strict' at the TOP of our modules. So instead of:

 var jQuery = require('jquery');
 var Backbone = require('backbone');
 Backbone.$ = jQuery;

 (function($) {
     'use strict';

      module.exports = Backbone.View.extend({

we can simply write:

 'use strict';

 var $ = require('jquery');
 var Backbone = require('backbone');
 Backbone.$ = $;

 module.exports = Backbone.View.extend({ ... });

Notice how the above code also needed var Backbone = require('backbone'); because it references Backbone in this file. Everything not in module.exports will stay private to its module. In Node.JS, module.exports is simply the object that gets returned as the result of a require call.

Next delete any script tags that include your user-defined Backbone javascript files and replace with just:

 <script src="js/bundle.js"></script>

Generate the bundle.js file

 $ browserify ./public/js/app.js -o public/js/bundle.js

If we were using the global app object instead of browserify, we'd have to make sure js/views/app.js gets included before js/app.js in the script tag include order. But with browserify we take care of the ordering differently. For example, in public/js/app.js put:

 var $ = require('jquery');
 AppView = require('./views/app');
 // The only thing that should be in a DOMReady
 $(function() {
     new AppView();

Then in public/js/views/app.js put:

 'use strict';

 var Backbone = require('backbone');
 var $ = require('jquery');
 Backbone.$ = $;

 // examples of requiring some user-defined Backbone stuff
 var Todos = require('../collections/todos');
 var TodoView = require('./todo');

 //var app = app || {};

 //app.AppView = Backbone.View.extend({
 module.exports = Backbone.View.extend({

     initialize: function() {
         //this.listenTo(app.Todos, 'add', this.addTodoTask);
         this.listenTo(Todos, 'add', this.addTodoTask)

Notice how I commented out the old app global object stuff and replaced it with Browserify'd code. You basically just need to replace app.whatever with module.exports = whatever. And for every file that used to reference the global app object, you now need to require() that file and store it in a local object. So continuing the example, above '../collections/todo.js' might contain:

 'use strict';

 var Backbone = require('backbone');
 var Todo = require('../models/todo');
 //var app = app || {};

 var TodoList = Backbone.Collection.extend({
     //model: app.Todo,
     model: Todo,


 module.exports = new TodoList();

Again I commented out the global app object to show that's where browserify's stuff goes.

Since a Backbone.js project can become large, or any project for that matter, it's a good idea to install watchify. It takes the same command-line arguments as browserify but will automatically update the bundle.js file whenever you make changes, and it will update it faster since it only updates the changes.

Additionally, using the power of Browserify transforms, you can integrate Bower packages and more. This is useful if you need a dependency that uses CSS or something that Browserify isn't yet suited for. Perhaps I'll write more about this next time.

Monday, March 17, 2014

Headless Dropbox

You can install Dropbox on a Linux server or any Linux system that doesn't have an external monitor or a GUI Windowing System.

I recently installed Dropbox on my Digital Ocean Droplet, running Ubuntu 12.10 server. I did not figure this all out on my own. I read the sources mentioned at the bottom of this post. However, some of the information was incomplete or contained broken links, and I ran into other issues. So this blog post is more of about how I got it working in hopes that you won't encounter the same issues I did.

First you will need to link your Linux machine with your Dropbox account by downloading and running a Dropbox server script called dropboxd:
 $ cd  # changes to your $HOME directory
 $ wget -O - "" | tar xzf -
 $ ~/.dropbox-dist/dropboxd

 Use the above command if you're running "uname -m" says i686.
 $ cd
 $ wget -O - "" | tar xzf -
 $ ~/.dropbox-dist/dropboxd

 Use the above command if "uname -m" says x86_64.
The running dropboxd command should output a link for you to paste into a browser to link that system. I had issues with this at first and just had to keep rerunning the command over time and leaving it running until it finally started saying something. This was probably due to connection issues that you hopefully won't have. You can then verify that it's linked by logging into your dropbox web account and going under the "Security" tab to see linked devices.

The link uses a token so you don't have to store your Dropbox password on the Linux instance.

Once Dropbox is linked you should see your ~/Dropbox directory starting to sync. At this point, you can temporarily stop the syncing by hitting ^C (Control-C) to stop dropboxd. Don't worry we'll finish syncing in a minute.

Let's download a nice python script to help us manage Dropbox better:
 $ wget -O ~/bin/ ""
 $ chmod 755 ~/bin/
and download a bash script that we'll use to start and stop Dropbox:
 $ wget -O ~/bin/dropbox_temp "\
 $ chmod u+x ~/bin/dropbox_temp
Edit ~/bin/dropbox_temp in Vim (or whatever editor you like) and change DROPBOX_USERS="user1 user2" to whatever user(s) on your system you want to run dropbox clients for.

Now move dropbox_temp script:
 $ sudo mv ~/bin/dropbox_temp /etc/init.d/dropbox
and have it start when we boot:
 $ sudo update-rc.d dropbox defaults
Close the dropboxd script if you still have it running manually: ^C.

Check if dropbox is running, if it's not then start it:
 $ sudo service dropbox status
 $ sudo service dropbox start
If it's is running then run:
 $ status
If the status command says Dropbox is now downloading or syncing a ton of files, you might want to remove some directories from being sync'd to this linux machine by using Dropbox's Selective Sync.

Selective Sync allows you to exclude certain folders that you don't want to sync to your Linux server, e.g., maybe you don't need all your photos on it:
 $ cd ~/Dropbox
 $ ls
 Photos Projects Public iphone-photos
Let's exclude the photos and Public directories:
 $ exclude add Photos
 $ exclude add iphone-photos
 $ exclude add Public
 $ ls
At this point, I would wait until status says "Up to date" before editing files in your Dropbox folder. Just to make sure nothing gets corrupted. Once Dropbox has finished downloading everything, you should try to sync files to AND from Dropbox. Make sure to test both directions and also test nested directories and your most important directories. I had an issue once where files in ~/Dropbox/Projects/notes/ wouldn't sync outgoing, yet ~/Dropbox/Projects/code/ files would. If it doesn't, you'll need to reinstall Dropbox (instructions are below). It shouldn't be a firewall issue because the docs specify that Dropbox runs on port 80.

Alright, if you made this far and everything seems working, the great!

You may also want to delete any cached deleted files from the exclude step (NOTE: don't even try this unless you need it and at least until you're sure dropbox is already working):
 $ sudo service dropbox stop
 $ rm -rf ~/Dropbox/.dropbox.cache/*
 $ sudo service dropbox start
Don't delete the .dropbox.cache dir itself, just its contents.

If you ever need to delete the dropbox instance:
  • ^C to stop the dropboxd daemon
  • rm -rf ~/.dropbox* ~/Dropbox
  • Revoke the token that was issued for your session by clicking 'unlink' next to your Droplet's host name at
It's important that you actually delete the ~/.drobox* folders, because your old key is stored there, and you have to generate a new one next time you run the original wget commands. So don't think of keeping it around as a shortcut to reinstalling.


Sunday, March 2, 2014

The Path to Homebrew

The home page and documentation for Homebrew show how to install and use Homebrew. However, they currently don't seem to explain exactly how pathing works. This can trip up a lot of newcomers, who might give up on Homebrew or fumble around the internet and land on bad advice - such as using sudo and editing /etc/paths. All of this is unnecessary and potentially dangerous.

You just really need to understand a few basic concepts:
  • Never run brew as sudo. Not "sudo brew install" nor "sudo brew link".
  • The "Cellar" is a place that all your "kegs" go. Homebrew installs packages to their own directory (in the Cellar) and then symlinks their files into /usr/local/.
  • Change /usr/local/* to be owned by $USER, not root, so you can have write permissions and not need sudo.
  • The $PATH entry for /usr/local/bin should occur before /usr/bin.
Here's an example of installing Python and setting paths correctly. I'm using OS X 10.9.2 (Mavericks) and Homebrew 0.9.5:
 $ sudo chown -R $USER /usr/local/*
 $ brew doctor
 $ brew update
 $ brew install python --with-brewed-openssl
 $ ls /usr/local/Cellar/python
 $ python
 >>> 2.7.5
Wait, I expected to see python 2.7.6 now. What happened?
 $ which python
But the Homebrew docs said we will be using a symlink from /usr/local/bin/ that points at the Cellar instead of using /usr/bin:
 $ ls -l /usr/local/bin/python
 lrwxr-xr-x 1 rkulla admin 33 Mar 2 06:37 /usr/local/bin/python ->
Aha. So it must be a PATH issue:
 $ echo PATH

 $ cat /etc/paths
When using Homebrew, we actually now want to change the position of /usr/local/bin to be before /usr/bin in $PATH. But don't edit /etc/paths. Instead edit ~/.bashrc and prepend /usr/local/bin to $PATH, like:
Next, run:
 $ source ~/.bashrc
 $ echo $PATH
 $ /usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin
Don't worry that there's duplicate entries for /usr/local/bin. It doesn't matter. What matters is that you've modified your PATH safely and ultimately cleaner than the other ways.
 $ ls -l $(which python)
 lrwxr-xr-x 1 rkulla admin 33 Mar 2 06:37 /usr/local/bin/python ->
Yay! And if you ever want to use the old python you can just run: /usr/bin/python.

Best of all from now on whenever you need to install anything with brew, you can just run:
 $ brew install whatever
and not have to fumble with permissions or worry about overwriting any global system files.

I also like to source ~/.bashrc from ~/.bash_profile. For why see: .bash_profile vs .bashrc

Monday, January 2, 2012

The Wallpaper Algorithm Using HTML5 and JavaScript

There's a fun little book called The New Turing Omnibus - 66 Excursions in Computer Science from 1993. An omnibus is "a volume containing several novels or other items previously published separately". The first item in this book is the following algorithm:

1. input corna, cornb
2. input side
3. for i ← 1 to 100
     1. for j ← 1 to 100
       x ← corna + i x side/100
       y ← cornb + j x side/100
       c ← int (x² + y²)
       if c even
         then plot(i, j)

All that is needed to translate this algorithm to a running computer program is to have a Turing-complete programming language and a way to plot pixels on a display. So if you want to try it out for yourself first before seeing one solution, stop reading now and try it. I'm going to demonstrating how I implemented this algorithm by using JavaScript and HTML5's canvas element. The book only shows one implementation using Pascal so I thought I'd do something a little more modern.

We'll be creating two files. The first one will be the html5 content:
<html lang="en">
  <meta charset="UTF-8">

  <button id="myButton">Show Wallpaper</button>

  <canvas id="myCanvas">
  Your browser does not support the canvas element.

  <script src="wallpaper-algorithm.js"></script>

Next create a file called wallpaper-algorithm.js with:
document.getElementById("myButton").onclick = function () {'none';

function wallpaper() {

  var browserWidth = window.innerWidth;
  var browserHeight = window.innerHeight;
  var canvas = document.getElementById("myCanvas");
  var ctx = canvas.getContext("2d");
  var corna = 0;
  var cornb = 0;
  var side = 345;
  var xmul;

  canvas.setAttribute('width', browserWidth);
  canvas.setAttribute('height', browserHeight); = 'black';
  for (var i = 1; i <= browserWidth; i++) {
    x = corna + i * side / 100;
    xmul = x * x;

    for (var j = 1; j <= browserHeight; j++) {
      y = cornb + j * side / 100;

      if (i % 2 == 0) {
        ctx.fillStyle = 'white';
      } else if (j % 2 == 0) {
        ctx.fillStyle = 'lime';
      } else {
        ctx.fillStyle = 'purple';

      if (Math.round(xmul + y * y) % 2 == 0) {
          ctx.fillRect(i, j, 1, 1);



As an exercise, the book says to write the algorithm using three colors instead of two. That was easy enough so above I had it use four colors (black, white, lime and purple).

The book also says to make corna, cornb and side as user inputs, but for the sake of example, I hard-coded them. The variables corna and cornb represent the lower left-hand corner of the square of interest. The variable side is the length of the square's sides. Try side values of 5, 25, 33, 41, 45, 51 and 325 for some neat patterns.

Feel free to experiment with the algorithm and have fun with it. All-in-all it's a great example of what can be done with a simple algorithm and screen graphics. It's also a great demonstration of HTML5's canvas element coupled with JavaScript.

For exact details on how the algorithm works, buy the book. I think Google Books has a free version online as well.

About Me