The Node Version Dilemma

If you are using node for years or just starting, you are probably trying to figure out which version and distribution to use moving forward. The official Joyent distribution has two version 0.10 and 0.12, and the new community effort io.js has an active, almost weekly release schedule.

0.10 is the Current Safe Bet

If your production is running 0.10, keep it for now. If you are using node in production with a version older than 0.10, you should upgrade to the latest 0.10. Version 0.10 is by far the most stable, reliable, and well-understood release available today. We have been running 0.10 under heavy production load with extreme spikes for over 2 years. We started developing our production stack on 0.10.0 but it took until 0.10.21 for it to be production ready.

If you are not yet running node in production but are planning to go live in the next three months, use 0.10 for now. Upgrading to other versions later is going to be pretty simple (especially if you use a well maintained framework and published modules). It is unlikely that 0.12 or io.js are going to be production ready within that time-frame, and moving from 0.12 or io.js back to 0.10 is going to be painful (if at all possible).

If you are starting now, or if you are not planning on a significant load anytime soon (e.g. next three months), go ahead and start with the latest io.js. However, make sure to fully appreciate the risks associated. I consider building with io.js today similar to betting on node in the 0.2-0.4 days. It was clearly the future, but it was also experimental and unstable.

0.12 is DOA

The long awaited 0.12 release came about a year after it was originally expected. It represents significant improvements and is clearly the foundation of a future stable node release. However, given our experience with both 0.8 and 0.10, 0.12 will take about six months of active usage and development to reach similar levels of stability of the current 0.10 release.

I have already started seeing issues reported with 0.12 (and io.js which shares much of the code). Some of these issues are complex and involve changes in timing and event emitters that only appear in specific edge cases (these are no necessarily bugs, just very fine breaking changes). However, it is these edge cases that you should be most worried about when going to production since statistically, the larger your scale the higher the risk of hitting them.

I am skeptical about 0.12 reaching the required levels of stability given the resources available to work on it today (with the majority of core contributors focused on io.js). If 0.12 has a chance of moving forward, it is only once the foundation work is done and ready to support it, or via a merge with the io.js distribution. Either way, it is not a move worth making at this time.

io.js and the Future

There is little doubt that io.js represents the future of node. If you look at the work, people, and culture around it, it is pretty obviously heading in the right direction. I expect there to be only one prominent version of node within 6 to 9 months, either by making the official distribution obsolete or by merging with it. The question is when to make the switch.

The problem is that for most people, keeping track of what is going on with io.js right now is practically impossible. There is just so much activity going on. I would not call it noise because it is clearly well thought, well executed, and well communicated. Given that io.js was born out of the lack of progress on the official distribution, I am certainly not advocating slowing down or artificially blocking progress. What I am looking for is the organic maturity that is evident by the project naturally slowing itself down. And that can take a while.

Over the next few weeks and months, companies will come out and share their io.js deployment stories. What is important to remember is that just the fact someone deployed io.js in production doesn’t really means it is ready. The more meaningful story is when these companies share their experiences 3 months later, 6 months later, and a year later. When Walmart deployed 0.10 to production, it was extremely unstable, but the risk was very low given the overall architecture and mitigation available. It would have been a mistake to use that initial deployment announcement as an indication that your environment is equally risk tolerant and ready.

If you are in a position to take a measured risk and put io.js through the load of a production environment, you certainly should. We all depend on those early adopters. But when sharing that information with the community, make sure to provide the full picture, the technical details, and the reasons why you felt it was a low risk decision.

Why I Do Not Support a Node Foundation

I’ve been aware of the node foundation plans for a while. I have been part of the initial discussion group with Joyent back in May, was part of the node technical advisory board (for a bit), and had extensive discussions about this with pretty much every major players in the community. I have opted to keep my opinions offline until now because I didn’t want my (strongly held) positions to become the “opposition” and add more friction to what was already a pretty messy process. But now that the decisions have been made by both the io.js folks (to fork) and Joyent (to form a foundation), I am free to rant publicly.

For the sake of full disclosure, I am generally opposed to any foundation.

This comes from extensive first-hand experience with participating and forming similar foundations. I was an active participant of the early OpenID Foundation, I represented Yahoo in the formation of the OpenSocial foundation and wrote the intellectual property and working group process documents, I was a founding member and first president of the Open Web Foundation, and I had extensive engagement with the W3C organization. These experiences taught me that foundations are an unnecessary evil.

My main problem with foundations is that as soon as money is involved, the organization takes on a life of its own, and the mechanism will do anything to sustain itself. The first words out of IBM’s Todd Moore’s mouth about the foundation was that the next step is going to be to hire an executive director. I don’t know if Mr. Moore was expressing the position of the foundation or his own agenda but this is exactly the kind of misguided attitude that dooms such efforts.

Consider this – once you hire people to work for the foundation, these people’s livelihood depends on the foundation’s financial stability. This means they spend a significant amount of time raising funds and ensuring their paying members are happy and getting value out of it. No matter how much you try to balance the needs of the community the foundation was allegedly created to serve, it inevitably becomes a voice for its moneybags.

This is not to say all foundations are evil or unnecessary.

There are many examples of foundations that add value and support their communities effectively. The important distinction is what triggers the creation of the foundation and who are the main players behind it. In the node foundation case, the triggers were lack of technical progress on node and some concerns about ownership of the node trademarks. Both of these issues could have been quickly resolved without a foundation.

The great thing about the io.js effort is that it grew out of strong frustration with specific shortcomings in the Joyent process. Namely, the governance model, the lack of code of conduct, the sharp drop in contributions, and a release process that was predictable in its unpredictability. These concerns triggered the initial discussions about a foundation but when it came time to actually address them, the io.js community realized that all they needed to do was to simply focus on fixing things. They didn’t set a foundation, raised money, or registered marks. They simply created the work space needed to get shit done.

On the trademark side, Joyent has long claimed that their work to protect the node marks provided an important service to the community and to node. I disagree. I think trademarks should only be used to protect business interests, not to put someone in a benevolent position to decide what’s in the best interest of a community, especially one as diverse as node. I don’t think a node trademark adds any value. What exactly do we need protection from?

(As an aside, I’d like to point out that my disagreement with Joyent on the trademark policy and foundation plans does not take away from my gratitude and appreciation for everything they have done for node and the huge impact their support had on its success so far).

Node is a subset of the JavaScript community which is flourishing without any active trademark protection. Can you imagine what would have happened to the language and innovation (especially the recent work) if the Oracle corporation who owns the mark for JavaScript dictated to people what is a certified version of the language? I think there is value in someone registering important marks (and then not defending them) only so no one else can do it for evil purposes. Joyent owning the marks and not protecting them would be the ideal. While Oracle owns the JavaScript mark, I cannot find any record of that mark being used or enforced.

Now, I can understand why IBM, a company who never turned down an opportunity to exploit and make a buck wants a foundation. It’s how they manage their relationships with communities to promote their business agenda. I am not calling them evil – just that as a large corporation with a lot of history, I am confident their best interest is not aligned with mine. I would like to note that I am no hater of corporations – I’ve happily worked for Citi, Yahoo, and Walmart to name a few.

What I cannot understand is why the community should want a foundation. What will a foundation provide that we are not already doing a fantastic job at? I am hearing foundation supporters talk about events, sponsorship, marketing, and training. Sounds like a lot of people excited about potential funds flowing their way.

Node and JavaScript events are doing amazing with nothing but grassroots efforts all around the world. Companies are eager to sponsor node development by hiring full and part time developers to work on node and io.js. Node is one of the fastest growing technologies without anyone hiring an ad agency or paying for marketing. And between the free node schools effort and the paid offering of many node companies, along with a growing selection of books, training is well taken care of.

When my employer was approached to be a founding member of the foundation, I recommended they pass on the grounds that it adds no value to them. Walmart already employs two node core developers, along with almost 100 developers who use node on a daily basis and contribute significantly to open source. Walmart has also been a top sponsor of NodeConf for the last few years. Since they are not going to double their support, should Walmart direct all these funds to the foundation instead? How would that increase their influence and improve an already fantastic community?

(I do not speak for, or necessarily represent the position of the Walmart corporation).

The only real argument made so far in support of a foundation is the issue of controlling the trademarks. It could have been easily resolved by Joyent releasing them to the public domain and allowing the community and the market to sort things out. I can tell you for a fact that my employer would not have had any problems dealing with the “ensuing confusion and chaos”. Every other platform has multiple flavors competing, some open source and some commercial. What makes node so special it needs trademark protection?

Since the node foundation is a foregone conclusion, we’ll just wait and see what value it adds. Meanwhile, we should stay alert to make sure the sponsoring corporations are not fucking node up.

Got comments? I’m @eranhammer.

Notes on Managing Remote Teams

The node.js services team we built at Walmart received a lot of attention for our open source contributions and for pushing node forward in the enterprise. What gets little attention is the success we had in building and managing a distributed team.

The following notes are based on over six years of firsthand experience with remote teams during which I’ve spent 3 years as a remote employee and over 3 years building and managing a remote team. Since these notes are based on my personal experience alone, they do not assert any industry-wide conclusions on the effectiveness of remote work and remote teams. However, I think that these notes will help guide you in deciding whether remote work is suitable for your needs and culture, and how to be successful at it.

The Walmart Mobile Services team included between 2 to 20 people over three and a half years, both remote and office-based. This is what I learned.

Misguided perceptions

Many people, regardless of their actual experience with remote teams, have pretty strong opinions about their suitability and success. Many companies I talked raised largely unfounded concerns about remote workers that are based on anecdotal negative experiences. For example, I was told of a bad experience with one designer who failed to produce results when working from home and of an engineering manager who tried to manage a local team remotely.

The first thing to remember is that in any engineering team, about 10% will be underperformers regardless of their work location. Unless your experience is based on sufficient number of people and over a few years, it would be a mistake to reach any kind of conclusions. The second thing to remember is that hybrid teams – teams with both remote and local employees – rarely work out and require constant care. If your negative perception of remote teams falls into one of these two cases, you should reconsider.

Hybrid teams

Most remote employees are part of a hybrid team where some members work remotely and some are part of an office. This doesn’t work. In most cases, the ratio is heavily skewed towards the office group. The problem with managing hybrid teams is the inherit difficulty in enforcing remote culture within a common physical space. It is challenging to forcing people to use online tools to communicate with peers, even if they are sitting right next to them.

What usually happens is that when something goes wrong, the manager will walk over to someone in the office and discuss the problem. That discussion will grow to include more local people but will completely exclude the remote folks. Not only will this alienate the remote members, it will eliminate their ability to contribute to the solving the problem, add value, and participate until eventually they will be considered poorly performing employees.

The same problem applies to remote managers. Senior managers will often bypass the team remote manager and walk directly over to a local team member. When I first joined Walmart and managed a team where I was the only remote team member, I always found out about outages and problems a day or two later. I wasn’t given a chance to deal with issues because when something broke, everyone huddled locally and my lack of presence created the impression of being absent instead of simply working remotely.

Over time, I have found hybrid teams to be too difficult to manage. No matter how much we pushed to get everyone communicating online, regardless of their location, the office people always defaulted to getting up and walking over to their local peers instead of using the online communication tools. Upper management was never able to control their habit of walking over to the first available developer to look into issues for them. The only solution was to force everyone to be remote at least part of the time so they will change their habits and develop some empathy for their remote peers.

Productivity and cost

Our experience hiring about 20 node developers over the last 2 years showed that by building a remote team, we were able to hire better talent at lower cost compared to hiring the same team locally in the Bay Area:

  • Constant access to talent – we receive about 5 unsolicited employment requests from qualified developers a month (that’s a lot for a team of 20, and at Walmart). The candidates we interviewed already wanted to join the team, and with the exception of two people, we didn’t need to revisit an offer or lose someone to a competing offer. Our candidates mostly arrived from areas with either weak technology presence or limited options (jobs, industries, or technologies). Everyone was really eager to join the team.
  • Lower hiring costs – combining strong community outreach with remote positions produced a constant stream of candidates and removed the need to pay recruiters to source resumes. The quality of the people we were able to hire has been above industry average at annual cost of about 20% less on payroll costs compared with local wages. I don’t have office space cost figures but that adds to significant savings during growth periods.
  • 100% retention – over three years, not a single team member left. While this is certainly largely due to a great work environment and competitive pay, it is also due to a certain lock-in for remote employees, coming from markets with low availability of jobs or getting spoiled by the many benefits of working remotely and the limited availability of remote positions.
  • Extended coverage – our team is spread out over 4 time zones which means a normal 8 hours day is stretched to 12 hours. Add to that the flexible schedule options and the different work habits of people (morning people vs night people) and we have about 18 hours a day of team availability without asking people to work late or take night shifts.

Maintaining team cohesion

Remote teams lack the social glue that an office provides. This can make work very difficult during stressful times and especially for new hires. For a long time we had very little team cohesion. Most of the interactions were between individuals and me (as manager). This became a concern when we started growing the team. Here is what worked well for us:

  • Encourage everyone on the team to join a team nonsense channel. You can call it “random” or “general” or “nonsense” but the goal is to have a place where people can post jokes, silly pictures, abuse a bunch of chatroom bots, etc. It is critical that people interact with each other in casual, non-work related manner as much as they have serious conversations about work. An office provides that via water cooler chats and lunch breaks so for remote teams, we need to find other venues.
  • Organize a few face-to-face meetings for small groups. This should happen naturally based on business needs where it is helpful to fly in a few members of the team for a couple of days. By having a subset of the team meet, people can form strong one-on-one connections that are harder in a larger group. It also removed the need to organize a large offsite with a lot of preparations and content.
  • Setup an annual team offsite. And by offsite I don’t mean renting a hotel conference room and having everyone give a talk. I mean NodeConf. Find a community event that is not the office, that takes a few days, and that provides plenty of off time for people to hang out and chat. For the last 2 years, we flew the entire team out to CA for NodeConf. Having a non-work context with other people makes everyone more relaxed and against that backdrop, the team constantly found their way to hang out together. The presence of others made it easier not to always obsess about work.
  • We spend about $5000/year/person on travel costs which isn’t significantly higher than travel cost at top companies for Bay Area employees attending conferences and other work travel.

The personal impact

It’s hard to overstate the quality of life remote work provides. For most commuters, it eliminates anywhere between one and three hours of being in the car or train. Those extra hours means you can spend time with your family (having dinner with your kids every day is amazing) or develop a crazy hobby (like running a zoo, which was probably going too far). The flexibility also makes travel easier and allows you to take “days off” without actually missing work because the tools and expectations remove the need to be in one place all the time.

There is some downside. If you look for remote work because your area lacks opportunities, moving away from that job will be more difficult. The need to relocate in order to leave a remote position isn’t ideal (and many employers won’t pay for it). While this is true regardless at some point, it might be easier to move to a rich job market first, and then find a job. The longer you stay at a good remote position, the harder it is to pack up and move.

I heard a lot about the work discipline and strong ethics required to do remote work because of the temptation to sit home and play video games all day instead. I have not found this to be an issue for me or anyone on the team. I also don’t think being remote makes a difference. If you are lazy, you are going to find ways to surf the web all day even if you are in an office.

And last, I hear leaving the house and seeing others is something people like to do. It could be challenging for people to be physically disconnected from others. I find meetups, lunch with friends, and conferences as a good way to balance out the more insulated work environment at home. If you need the social energy of being with other people, and working out of a local co-working space isn’t available or for you, remote work might be a challenge.

A checklist

  • A team can be all remote or all local:
    • Hybrid doesn’t work.
    • Remote doesn’t mean no office – you can still have a place for people to come in when they want to but they must use remote tools all the time.
    • You don’t need to convert your entire organization to remote, but you have to do it in entire teams.
  • Don’t do remote to save money, do it to get better, happier talent:
    • Use payroll cost savings to cover team travel and events.
  • Leverage geographic diversity:
    • Spread over time zones for extended support.
    • Reach out to small tech communities for the best talent.
    • Diversify your team with access to people who don’t want to be another screw in the Bay Area tech machine.
  • Use the right tools:
    • Pick communication tools that work for your team and empower remote culture.
    • GitHub and Slack work really well.
  • Give it time:
    • Building remote culture that works well can take a few months (to a year), make sure to allow it to mature organically.

Got questions? I’m @eranhammer.

Before the Drama

I am going to comment on the recent node fork. Soon. I am not happy about it. I also don’t think it’s bad. I’ve been involved in the conversations with most sides since May and am in a unique position being (probably) the only “guy in the middle” that I think I can provide a perspective that is more complete than most. However, before I do that I would like to defuse the drama.

Given my position at Walmart and the fact that I knew a fork is highly likely for half a year, you can imagine I had a few internal conversations about node and its future inside and outside of Walmart. A large(st) enterprise has to ensure its investments are durable and sound. I shared the situation with my senior management and the message I delivered to them is the same one I am going to deliver to you now.

If there was no new release of node ever again, I would still use and recommend it. I understand people’s desire for faster releases and quicker availability of new JavaScript features but I consider these to be “rich people’s problems”. I spend most of my time writing and managing node development and I feel empowered and productive with the platform I have today.

Can things get better? Absolutely! But this concept of an evolving language and platform is pretty new. I have never imagined new features working a decade on Wall St. building high frequency trading systems in C++. The language barely changed (remember when template supported matured in 1998?). New features were mostly better optimizers and IDEs but not really the language or the platform. I am not being dismissive of progress, but I want to make sure people understand that the node we have today is pretty fucking awesome.

If you are in a decision making position and the recent events make you reconsider adopting node, don’t. Do it – you will not regret it. The current version of node is already fantastic. Again, it can get better, but after two Black Friday events running on this version of node at the biggest eCommerce scale (we did kick some ass this year against major competition) I can tell you without any hesitation that node is production ready today. Cross that. A year ago.

I also want to point out to all the delicate, sensitive souls out there who keep complaining about “all the drama” and “why can’t we all just get along” that the node community drama is amateur hour compared to other platforms. We don’t have lawsuits for hundreds of millions of dollars like Java. We don’t have key members of the project writing pages and pages of nasty blog posts calling the entire platform shit like Rails. We don’t have insane multinational standard bodies debating features of the platform over 10 years like C++. And we don’t start every mailing list response calling the new guy asking the question a fucking asshole like PHP.

I am not dismissing the importance of what is going on, but these events and the way they have evolved shows tremendous maturity and civility that I have not seen in other communities (and unlike most of the brilliant commentators on Hacker News, I have been writing code since 1983). All this drama is a healthy debate about the future of our platform and community and the way it has been handled is something to be proud of.

I am completely behind node. It might be called something else in the future, and there is probably going to be more than one server-side JavaScript platform (which is a good thing), but the foundation of running node-style code to build powerful server solutions is not going away. It is the future of the web.

Wide Open (or, Are You In?)

Earlier this year I confronted the painful realization that my baby framework grew into a mature ecosystem – one I no longer had the capacity to maintain on my own. It started with dragging open issues for more than a few days, to a growing pile of sticky notes on my monitor of ideas I’d like to try, to (and most problematic) no longer remembering how big chunks of the code work.

The problem is, how to successfully move from a one-man-show to a community driven project, without giving up on the stability, consistency, and philosophy of the framework.


I believe the only practical model for running a successful open source project is the Consensus-Dictator-Fork (CDF) model. It’s a fancy name for how most open source projects work. Decisions are made by consensus whenever possible. This usually covers 95% of the decisions by the simple mechanism of proposing a change and asking for strong objections. When strong objections are raised and consensus does not emerge, the project BDFL (benevolent dictator for life) makes the final call. If enough people object to that decision, they fork the project and create their own. It is a naturally self-balancing system.

CDF doesn’t solve the problem, but it provides the first building block.

Federation of Modules

The second building block is npm, the package management service. npm makes it trivial to break a large system into smaller modules, each with its own owners and publishing schedule. Instead of applying CDF to the entire ecosystem, I realized I can apply it individually to each module the framework consists of. Each module author has the autonomy to drive their module forward, allowing its dependents to vote with their feel (i.e. package.json files) and switch to another module if they don’t like the changes and fail to influence through consensus.

The smaller the surface area of the hapi module is, the more time I have to focus on the problems I care most about, and the more power is being delegated to other people to drive the many different areas of the framework forward. This is not just about which utilities we use, but about moving core functionality out of the main module. It redefines “core” from the top-level module people require, to the collection of modules used.

The Right People

Before splitting the hapi module into many smaller pieces, I had to identify who is going to take over these new modules. What profile am I looking for. After all, big chunks require more experience while smaller chunks increase the need to communicate. The answer to that was simple, but non-intuitive: everyone who is interested in contributing.

The reason this is non-intuitive is because we are stuck thinking about open source contributions and ownership in terms of merit and meritocracy. These are extremely unhelpful terms and for too many people they translate to “the rich keep getting richer”. Open source meritocracy is supposed to be about letting proven people lead, but in practice it creates a chicken-and-egg problem of making it virtually impossible for new leaders.

I wanted to make participation and leadership within the hapi.js community to be truly open to everyone who is committed to making a contribution. This requires two things: that no one feels excluded or unwelcomed (for any reason), and that everyone feels they are good enough to step up.

Rock Stars

Leading a successful open source project often translates to increased professional success, which in turn translates to more money, influence, freedom, and happiness. It is far easier to find a great engineering job with a famous open source project on your resume. Having a project on npm with 100K daily downloads can open doors 10 years of experience sometimes can’t. Numbers might not get you the job but they will get you through the door.

The best way to make open source sustainable is to make it rewarding. By extending the invitation to lead to others, we create opportunity and rewards that are otherwise very hard to achieve. It is far easier to take over an open source project with existing user base and traction than to start from scratch and battle for attention. The long tail is a lonely place.

Leap of Courage

So how do we convince people to make that leap? To take over a popular open source module and lead it, in public, where every mistake is visible and often amplified? How do we make them feel comfortable to try new things and take risks? And how do we trust them not to screw up? Simple – by having their back.

npm shrinkwrap is not just a command line tool. It is an instrument of social change. By locking down the proven, trusted, stable versions of the framework dependencies, module leads can make mistakes without devastating consequences. By putting safeguards in place to prevent instability, we can hand over core building blocks of the framework to people without demanding “proven track record” or “merit”. We no longer need to be exclusive in who we invite to join our round table, and if it really doesn’t work out, CDF to the rescue.

Support System

Putting risk controls is half the solution. We can’t throw people into the deep end with a lifesaver and expect them to get better at swimming. It is not enough to protect the ecosystem. We have to nurture and grow it. The other half of the solution is the new hapi.js mentoring program. By assigning new contributors and leads an experienced, one-on-one mentor, we increase their confidence and skill, and create a mutually rewarding environment where contributions get exponentially better.

The mentorship program is brand new, but it was based on extensive research and conversations with people to ensure a welcoming, safe, and diverse environment.

Making Room

With everything in place, I spend the last two months smashing hapi into about 20 new modules, all looking for a new lead maintainer. With version 7.1.0 we are now all set to embark on this new, wide open community strategy.

If you ever wanted to participate in a large, meaningful, and highly visible open source project, but did not feel confident (or safe), this is your cue. We are transforming a successful project on its head with the sole criterion of making it a welcoming place for everyone. I am sure we got plenty to improve and iterate on, but the groundwork has been laid, and the doors are open.

Are you in?

Thank You

It’s hard to describe the joy of writing a successful open source project. There is the satisfaction from people using your code and knowing that all that effort wasn’t for not. There is the gain in reputation and boost to one’s career that translate most directly to the bottom line. But all these measurable metrics still fall short of the true joy.

For the longest time, hapi was about me. The most gratifying feeling is knowing that it really isn’t anymore. It is a community of kind, smart, and dedicated people who made a decision to use it and be part of it. I still get to lead but it is not mine anymore. It has taken a life of its own and that’s the most rewarding return you can ever expect. Creating something that far exceeds my own abilities.

I’ve been thinking a great deal lately about how I want to cash in on this success and I’ve decided to use my influence to fix diversity in tech. I think the hapi community is leading the way in stability (enterprise grade with that startup smell), quality (fanatical code coverage), and openness (almost twenty leads working together in the open). It should be enough, but what’s enough? So I decided to set my own personal sight on diversity.

As a gay man (yes, still a white male) I have been fortunate to never know true adversity. But I can easily empathize with those who do not feel comfortable stepping forward and joining the crowd (and let’s face it, white males do not have the best track record). I have been listening to non-male developers I respect and I’ve been reading research on the topic. To me, fixing diversity is the next big challenge.

I plan to use whatever influence I’ve earned, whatever open source karma I’ve got, to help nudge gender diversity in tech. Not because I have a daughter (I hate this stupid excuse), but because it is the right thing. It’s that simple. And I do have a lot to gain from it because the few women I got to work with in my career have been outstanding and I would like to see more.

So thank you for putting me in a position where I can try and make a difference. The support and adoption the hapi framework is getting is one of the highlights of my career and probably what I will be known for in the near future (much better than that authentication protocol I once wrote).

I would much rather be known for creating the most diverse community in tech. Now that’s something worth aspiring for.

Performance at Rest

Disclaimer: the author is the lead developer of a consistently poorly performing node web framework (as measured by framework benchmarks). I mean, it really sucks.

Benchmarking frameworks is fucking stupid.

Every few months someone comes up with yet another system to benchmark web frameworks. They setup a few simple scenarios like serving static content, a JSON reply, and sometimes rendering views or setting cookies. The typical examples contain almost no business logic. It is a theoretical test of how fast a framework performs when it accomplishes nothing.

In this scenario, the lighter the framework is – that is, the less functionality it is offering out of the box – the faster it is going to perform. It is pathetically obvious. It is one thing to compare the performance of various algorithms but when the biggest factor is how much other “stuff” is performed, you don’t need to write tests – you need to RTFM.

To those who occasionally bring up hapi’s poor performance on these ridiculous charts, I make two points.

First, hapi is slower than bare node and express because it does more. Don’t you want protection against your process going out of memory? What about event queue delay protection? What about client request timeouts? Server response timeouts? Protection against aborted requests? Built-in request lifecycle logging? Input validation? Security headers? Which one of these is optional? If you say most – hapi is clearly not for you.

Second, the Walmart mobile servers built using hapi were able to handle all mobile Black Friday traffic with about 10 CPU cores and 28Gb RAM (of course we used more but they were sitting idle at 0.75% load most of the time). This is mind blowing traffic going through insignificant computing power. Why would anyone spend engineering resources trying to optimize it when it is clearly performant enough?

But this post is not about how stupid framework benchmarking is.

To understand what makes benchmarking node different, you need to understand what is under the hood. Node is built using Google’s v8 JavaScript engine. v8 is a highly complex virtual machine with an ever changing runtime optimizer. Picking one coding style over another can carry with it double digit performance gains. For example, using a for-loop is often 80% faster than a functional for-each. This matters because a big part of making node applications faster requires constant tweaking to benefit from optimizations and avoid the blacklist of unoptimized code (e.g. any function with try-catch).

In addition to the optimizer, v8 has to perform continuous garbage collection. This is required to free up memory taken by objects that are no longer being used. In order to minimize its impact on performance, v8 tried to limit garbage collection to application idle time. Also, the longer an object “survives” garbage collection the less likely it is to be removed quickly when it is no longer needed. And the more stuff you do, the more objects are generated and need to be cleaned up.

The other critical component is the node event loop. The event loop is the “single thread” running your code. It is not exactly a single thread but as far as your application is concerned, it is a single threaded engine. Everything that happens in node is called from the event loop. It is a queue of I/O events and timers which trigger your callbacks – basically, your entire node application is nothing but a collection of callbacks.

What allows node to handle a large number of requests is the fact that most activities block the event loop for a very short period of time. For example, typical web requests require some database items. When those are fetched, node puts the request on hold and handles other requests until the database comes back with the item. Node requires this downtime to handle multiple requests. v8 requires this downtime to perform garbage collection.

When v8 is performing garbage collection, the event loop is paused. When a callback takes a long time to return control back to the event loop, all other callbacks, including expired timeout, are paused. If your business logic performs some calculation that takes 100ms to perform, you will not be able to handle even 10 requests per second. Simple math.

Why does this matter for benchmarking? Because these benchmark systems focus on performance at maximum load. They basically measure how many requests a server can handle under heavy load. The goal is to squeeze everything you can out of your computing resources. The problem is that under 100% CPU, node’s performance is dreadful.

At very high CPU loads, node’s event loop is fighting with the v8 garbage collector over resources. They can’t both run at the same time. This means that instead of getting the most out of your resources, you are wasting energy switching between two competing forces. In fact, the vast majority of node applications should be kept at CPU load levels of under 50%. If you want to maximize your resources, run multiple processes on the same hardware (with enough margin for the operating system).

If our production servers show more than single digit CPU load, we consider that a significant problem. If your node process is CPU bound, you are doing something wrong, your deployment is misconfigured, or you don’t have enough capacity.

What makes things worse when doing this sort of benchmarking is that the load is almost exclusively blocking because there is no business logic to go and create that downtime. Most of the internal framework facilities, such as parsing headers, cookies, and payload processing are blocking activities that require better downtime management than an application with empty business logic provides.

There is still great value in benchmarking applications. But if performance under load isn’t meaningful, what is? That’s where performance at rest comes in.

Performance at rest is the best-case-scenario of your application under no load. It’s how fast you can drive from point A to point B without anyone else on the road. It is a very significant number because it directly translates to user experience and relative performance. In other words, if your server can do unlimited number of requests per second, but they each take 60 seconds to complete, your amazing capacity means nothing because all your users will leave.

Measuring performance at rest is actually a bit more involved than just running a single request and measuring how fast it takes to complete. This has a lot to do with the v8 garbage collector and the v8 runtime optimizer. These two are working for and against your application. The first time you make a request, nothing is optimized and your code will be very slow. The 10th time you make a request, the garbage collector might kick in and pause it in the middle. Testing once is not enough to get real numbers.

What you want to do is come up with a scenario in which you are making multiple requests continuously over time, while keeping your server CPU load within an acceptable range.

This is where slow performance indicates a problem. If under these conditions, and with the feature-set you require, your web framework is performing poorly, it should be fixed or replaced. If the overhead of the framework is making your requests too slow at rest, the framework is either too heavy for your use case, or is under performing and should be fixed.

Understanding your application’s performance is critical. Benchmarking without taking into account the very nature of your platform is harmful.

Dear CEO (of a node-powered corporation)

First congrats! You didn’t force your developers to only use those “proven” technologies and allowed some innovation to invade your organization. You now get to join the club of companies using node. That’s pretty awesome. Node is going to significantly improve your company’s productivity, ability to hire top talent, keep your developers happy, and get back to building products, not boilerplate and abstractions.

But as with any cutting edge technology, node comes with its own risks. Node is proven but it is also very new. It is in its most critical phase of achieving mass adoption right before it is fully baked. This means complexity is at its highest level, right when the contribution payoff is at its lowest. In other words, most developers are not motivated enough or skilled enough to move it forward.

This is where you come in. But first, a quick story.

A couple of weeks ago the folks at ^Lift Security identified a flaw in v8, the JavaScript engine node is built on top. This particular flaw caused memory to leak when a certain exception was thrown, and it was an exception particularly easy to reproduce. In other words, it made it pretty easy to take down an entire site built on node if it wasn’t setup with sufficient capacity and restart automation.

The good news was that this security hole was quickly identified, corrected, and a patch released. The bad news is that the patched version introduced a new bug. This is par for the course in software development. Shit happens.

The patched version came out on a Thursday. Most companies grabbed it on Friday. On Saturday morning, when I upgraded my own development environment I discovered that this new version breaks a feature in hapi, our enterprise-grade open source node framework. The specifics of the bug are somewhat “amusing” – it caused timeouts set with milliseconds fractions to basically get the entire node event loop stuck. Now, why would anyone set a timeout using a floating point number? Well, that was another, very old bug in hapi that never mattered before.

What makes this bugs combination even more “amusing” is that it was in the code responsible for keeping server load under control. With these two bugs, servers would stop working altogether under load instead of handling it. Slightly different from the intended outcome.

So – Saturday morning, major security bug announced, companies upgrading their environments, and our framework cannot work on the new, safer version.

Under past circumstances, we would have contacted the core team via an issue and IRC, and waited for them to find the time to identify and fix the bug. And usually that would work well. The problem is, I am among those responsible for the development of a system that’s becoming more and more critical to the bottom line of a gigantic operation. Sounds familiar? This is an unacceptable risk.

But this story has a happy ending! Within an hour of me identifying the issue, Chris Dickinson – our in-house node core contributor – was able to identify the root cause, and together we released a patched version of hapi with a workaround. This is the kind of SLA an operation like Walmart requires.

Back to you.

Node is ready, today, for taking on the most critical components of your business. But like any cutting edge technology, it comes with risks. These risks can be easily mitigated by making sure your have the right team and right resources available to you. Access to a node core contributor is absolutely essential. This is not a luxury.

Let me make it absolutely clear: if you use node for any serious business (and I will leave it up to you to define what “serious” means), you are being irresponsible to your company and shareholders if you do not secure the appropriate access to node core resources under an SLA.

There are a few ways to gain such access.

The best of course (but also the one with the biggest commitment and probably highest price tag) is to hire a full time developer to work exclusively on node core. But like any business decision, this has to be justified and will likely only make sense at a price point that’s as expensive (or cheaper) than paying someone else for the same SLA.

If you are not quite there yet, consider contracting a part time consultant or hire a company with such resources under an SLA that fits your needs. It is pretty easy to find such providers. Joyent provides this service as part of their SmartDataCenter product (as well as some limited support for Linux). NodeSource is a new company (made out of some of the most experienced node developers) offering a comprehensive solution. There are a few more, just ask around.

This is not only smart business, it is also the right thing to do. It provides crucial support to a technology you directly benefit from. It is the easiest way for you to pay back and support the community. It will also earn your company plenty of good karma points, which you will find handy when it’s time to hire the best talent.

Not sure how to go about this?

Names and Diversity

(Previously titled Nipples and Poop)

Last month I got to experience a childhood dream, one I never imagined possible. I got to sit in the front row and watch Monty Python live on stage. Twice! It was magical. It was the best 40th birthday gift to myself possible – getting to relive being 10 with the fully emotional impact of reliving well memorized moments.

I grew up watching VHS tapes of the Flying Circus. It had tremendous influence over my humor, but more importantly, the way I look at life. The absurdity of it all. The total disregard for institutions and sacred cows. If you’ve ever spent an evening with me, I am sure you’ve heard some fucked up stories about something I did against the very fabric of the institution I was part of – school, army, college, work. It’s who I am.

When I set course on hapi, an explicit goal was to change the way enterprise software is created.

Not just technically, but culturally. The configuration architecture was designed to make it simpler for entry level developers to jump right into complex requirements. The plugin architecture was designed to support a large team by breaking up large monolithic systems into smaller, self-contained parts. And the module names, logos, and references were designed to make people smile and stop taking enterprise engineering so fucking seriously.

Not everyone finds the same jokes funny.

People who grew up loving the Ren and Stimpy cartoons come into the hapi world with a grin on their face. A sense of giddiness from bringing that world of silliness into their day job. Others find it silly and just ignore it dismissively. That’s ok. The trick is to know who you are going to offend and lose as the price of making a joke.

When I was asked to name a hapi plugin that takes automatic core dumps when the process fails, I named it ‘poop‘. It was a perfect pun. We now have a module that very serious ops people at large companies, my employer included, have to use and they have to say ‘poop’ in their very serious meetings. This is powerful change, and it is because it is silly.

Sure, some people find it offensive enough not to use, and that’s fine. It’s a tiny module that is trivial to recreate. It’s not like I named the entire framework ‘doodie’. But the key here is that the group of people who might find ‘poop’ offensive isn’t exclusively any segment of the population. People who take themselves too seriously are not a protected class.

That’s not the case with ‘nipple’.

The nipple module was initially created as an internal component that no one was meant to use except for those working on hapi core. I know this sounds like an excuse for picking an inappropriate name, and it is, but it was also what was going through my mind – a public private joke. And I’m sorry for that.

The problem is, that in the larger context of a community built around the hapi framework, this turns off women from using and contributing to the project. That’s unacceptable! There is no acceptable rationale for creating an environment hostile to any segment of the population.

Creating an environment in which a woman is forced to say “nipple” to a predominately male audience is unacceptable. I don’t think that requires any explanation. It might also create a situation considered sexual harassment in many places. This has nothing to do with political correctness which is all about appearances.

What is interesting about the ‘nipple’ experience is that no one brought this issue up. I’ve had very open, frank conversations with women about making a significant shift in diversity within the hapi community and while other topics came up, this didn’t (even though it turned out to be on their mind). But when I asked plainly on Twitter what did people think, the response was strong, quick, and overwhelming.

The issue only came up as part of my review of all hapi language for potentially offensive words or expressions. I have made it my goal to dramatically change the makeup of the hapi community. I want to create a project that’s the role model of inclusiveness and diversity. The gold standard in how to build the most inclusive and safe environment in open source. Clearly we have a long way to go.

A big part of that includes reaching out to people and soliciting contribution. You change a community by starting with the diversity of its leadership. So I set to contact people from under-represented groups within the hapi leadership. All of a sudden, I felt a bit uncomfortable asking a female developer if she wanted to take lead on ‘nipple’. It stopped being funny in my head.

An hour after asking for feedback, the ‘nipple’ module was renamed to ‘wreck’, a pun on ‘req’ (common short name for ‘request’ in node). It’s still silly. We are going to continue and review the language used around the project and solicit feedback. I am going to continue asking questions, and I am confident we’ll get this right.

Bringing this topic up surfaced some unhappiness with our use of non-descriptive (and outright silly) names for modules. Turns out, a lot of people don’t share my sense of humor. No surprise there. But that’s missing an important point. hapi was created to be silly, to change the stiff corporate culture, one silly module name at a time. We take our code more seriously than most.

Looking at the audience at the Monty Python show, gender diversity was very much present. Silly humor doesn’t automatically translates to a boy’s club environment. The burden is clearly on me (us) to make sure that’s the case, but I am not ready to give up on silly.

I think the line between ‘nipple’ and ‘poop’ is clear, between offensive and silly, but this perspective, of course, is open to a community debate.

Open Source ain’t Charity

We’re spending real money on open source. Since hapi has been almost exclusively developed by the mobile team at Walmart, we had to justify the significant expense in open source the same way we justify any other expenditures. We had to develop success parameters that enable us to demonstrate the value and to make on-going investment sustainable.

The formula we constructed produced an adoption menu where the size of the company using our framework translated to “free” engineering resources. For example, every five startups using hapi translated to the value of one full time developer, while every ten large companies translated to one full time senior developer. We measure adoption primarily through engagement on issues, not just logos on the website.

These number change a couple times a year as the nature of contributions evolve, but they provide a solid baseline for progressive comparison. By having a clear way to measure ROI, we can justify more resources. It allows us to clearly show that by paying developers to work on hapi full time, we get back twice (or more) that much in engineering value. Same goes for sponsoring conferences. It all has to translate back to measurable engagement.

Of course, not everything is just numbers. Since Walmart tends to adopt hapi features about six months after they have been introduced, the value of external early adopters means significant quality and stability boost. We are also among the top work destinations for node developers. We have been getting about a dozen qualified candidates for every node opening we advertise. But while these benefits are important, they are very hard to quantify and we rarely rely on them to justify investments.

When we’re asked to sponsor an event we look at the community the event is serving and the impact a sponsorship can have on our adoption benchmarks. Unlike many other companies, we don’t have an evangelism budget. We sell goods, not APIs or services and our current interaction with the developers community is limited to hiring.

If this all sounds very cold and calculated, it’s because it is. Looking for clear ROI isn’t anti-community but pro-sustainability. It’s easy to get your boss to sponsor a community event or a conference, to print shirt and stickers for your open source project, or throw a release party for a new framework. What’s hard is to get the same level of investment a year, two years, or three years later.

What is even harder is to justify hiring a full time node contributor and other resources dedicated solely to external efforts. But with a strong, proven foundation of open source investments, even that becomes an obviously smart move – by the numbers.